DISQUS

David Ziegler's Internets: A Python Script to Extract Excerpts From Articles

  • Bartek · 6 months ago
    Your removeHeaders(soup): looks scary. Could you not rewrite it to something like so:

    [tree.extract() for tree in [soup(arg) for arg in ['h1','h2','h3']]]

    Didn't test it, and it's the morning but that chunk should definitly be done differently. Regardless, nice little article :)

    edit: Just realized you already did something similar in your github code. My bad!
  • dziegler · 6 months ago
    Yeah, that was pretty ugly. I updated the post to reflect the changes.
  • Michael · 6 months ago
    A slightly more computer-science-like approach for extracting text from websites is http://www.aidanf.net/software/bte-body-text-ex.... It works pretty well for decently structured html although your cleaning suggestions for CSS etc. help, too. If I need a shorter version of the full text I just use the first N chars.
  • Gadget_Blog · 4 months ago
    What a useful post here. Very informative for me..TQ friends...

    Cheers,
    gadgettechblog.com
  • sordyl · 3 months ago
    @trifilij did you have any luck with a ruby equivalent? I'm about to try this with RubyfulSoup and could use a headstart.