<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title>David Ziegler's Internets - Latest Comments in A Python Script to Extract Excerpts From Articles</title><link>http://davidziegler.disqus.com/</link><description>David Ziegler's personal blog of computing, math, and other geekery.</description><atom:link href="https://davidziegler.disqus.com/a_python_script_to_extract_excerpts_from_articles/latest.rss" rel="self"></atom:link><language>en</language><lastBuildDate>Fri, 10 May 2013 23:21:37 -0000</lastBuildDate><item><title>Re: A Python Script to Extract Excerpts From Articles</title><link>http://blog.davidziegler.net/post/122176962#comment-892805227</link><description>&lt;p&gt;Hey i got this error msg: global name 'SoupStrainer' is not defined&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Sujit</dc:creator><pubDate>Fri, 10 May 2013 23:21:37 -0000</pubDate></item><item><title>Re: A Python Script to Extract Excerpts From Articles</title><link>http://blog.davidziegler.net/post/122176962#comment-627286651</link><description>&lt;p&gt;This is awesome! Thanks for sharing :)&lt;/p&gt;&lt;p&gt;-tk&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tahia</dc:creator><pubDate>Fri, 24 Aug 2012 00:53:35 -0000</pubDate></item><item><title>Re: A Python Script to Extract Excerpts From Articles</title><link>http://blog.davidziegler.net/post/122176962#comment-86916665</link><description>&lt;p&gt;Wow, thanks so much! I'm searching for such information!&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Remove Antispy Safeguard Virus</dc:creator><pubDate>Thu, 14 Oct 2010 10:48:54 -0000</pubDate></item><item><title>Re: A Python Script to Extract Excerpts From Articles</title><link>http://blog.davidziegler.net/post/122176962#comment-79530493</link><description>&lt;p&gt;A slightly more computer-science-like approach for extracting text from websites is &lt;a href="http://www.aidanf.net/software/bte-body-text-extraction" rel="nofollow noopener" target="_blank" title="http://www.aidanf.net/software/bte-body-text-extraction"&gt;http://www.aidanf.net/softw...&lt;/a&gt;. It works pretty well for decently structured html although your cleaning suggestions for CSS etc. help, too. If I need a shorter version of the full text I just use the first N chars.&lt;br&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Spelling Games</dc:creator><pubDate>Mon, 20 Sep 2010 13:46:24 -0000</pubDate></item><item><title>Re: A Python Script to Extract Excerpts From Articles</title><link>http://blog.davidziegler.net/post/122176962#comment-65650785</link><description>&lt;p&gt;Hemp is is far more than a psychoactive drug. And indeed the perfect food, and when learned. Go to &lt;a href="http://www.hempproteinguide.net/" rel="nofollow noopener" target="_blank" title="http://www.hempproteinguide.net/"&gt;http://www.hempproteinguide...&lt;/a&gt; for great information.&lt;br&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">athikitie</dc:creator><pubDate>Sun, 01 Aug 2010 22:03:19 -0000</pubDate></item><item><title>Re: A Python Script to Extract Excerpts From Articles</title><link>http://blog.davidziegler.net/post/122176962#comment-45686985</link><description>&lt;p&gt;Actually that link is broken. The link to that post is &lt;a href="http://www.aidanf.net/archive/software/bte-body-text-extraction" rel="nofollow noopener" target="_blank" title="http://www.aidanf.net/archive/software/bte-body-text-extraction"&gt;http://www.aidanf.net/archi...&lt;/a&gt;&lt;/p&gt;&lt;p&gt;The latest code for BTE is on github: &lt;a href="http://github.com/aidanf/BTE" rel="nofollow noopener" target="_blank" title="http://github.com/aidanf/BTE"&gt;http://github.com/aidanf/BTE&lt;/a&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">aidanf</dc:creator><pubDate>Tue, 20 Apr 2010 13:48:10 -0000</pubDate></item><item><title>Re: A Python Script to Extract Excerpts From Articles</title><link>http://blog.davidziegler.net/post/122176962#comment-28575997</link><description>&lt;p&gt;Was hoping to find the equivalent of this in Ruby, but no luck so far. Not sure how easy it would be to do so if anyone has a heads up, feel free to let us know.&lt;br&gt;------------------------------------------&lt;br&gt;Tommy - &lt;a href="http://www.simplywoodengifts.co.uk" rel="nofollow noopener" target="_blank" title="http://www.simplywoodengifts.co.uk"&gt;Personalised Childrens Gifts&lt;/a&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Baby Gifts</dc:creator><pubDate>Tue, 05 Jan 2010 16:58:29 -0000</pubDate></item><item><title>Re: A Python Script to Extract Excerpts From Articles</title><link>http://blog.davidziegler.net/post/122176962#comment-17498178</link><description>&lt;p&gt;@trifilij did you have any luck with a ruby equivalent?  I'm about to try this with RubyfulSoup and could use a headstart.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">sordyl</dc:creator><pubDate>Sat, 26 Sep 2009 21:06:56 -0000</pubDate></item><item><title>Re: A Python Script to Extract Excerpts From Articles</title><link>http://blog.davidziegler.net/post/122176962#comment-14517598</link><description>&lt;p&gt;What a useful post here. Very informative for me..TQ friends...&lt;/p&gt;&lt;p&gt;Cheers,&lt;br&gt;&lt;a href="http://gadgettechblog.com" rel="nofollow noopener" target="_blank" title="gadgettechblog.com"&gt;gadgettechblog.com&lt;/a&gt;&lt;br&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Nail Arts Designer</dc:creator><pubDate>Sun, 09 Aug 2009 13:09:38 -0000</pubDate></item><item><title>Re: A Python Script to Extract Excerpts From Articles</title><link>http://blog.davidziegler.net/post/122176962#comment-11006808</link><description>&lt;p&gt;A slightly more computer-science-like approach for extracting text from websites is &lt;a href="http://www.aidanf.net/software/bte-body-text-extraction" rel="nofollow noopener" target="_blank" title="http://www.aidanf.net/software/bte-body-text-extraction"&gt;http://www.aidanf.net/softw...&lt;/a&gt;. It works pretty well for decently structured html although your cleaning suggestions for CSS etc. help, too. If I need a shorter version of the full text I just use the first N chars.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Michael</dc:creator><pubDate>Tue, 16 Jun 2009 15:57:04 -0000</pubDate></item><item><title>Re: A Python Script to Extract Excerpts From Articles</title><link>http://blog.davidziegler.net/post/122176962#comment-10839881</link><description>&lt;p&gt;Yeah, that was pretty ugly. I updated the post to reflect the changes.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">dziegler</dc:creator><pubDate>Sat, 13 Jun 2009 11:37:33 -0000</pubDate></item><item><title>Re: A Python Script to Extract Excerpts From Articles</title><link>http://blog.davidziegler.net/post/122176962#comment-10837451</link><description>&lt;p&gt;Your removeHeaders(soup): looks scary. Could you not rewrite it to something like so:&lt;/p&gt;&lt;p&gt;[tree.extract() for tree in [soup(arg) for arg in ['h1','h2','h3']]]&lt;/p&gt;&lt;p&gt;Didn't test it, and it's the morning but that chunk should definitly be done differently. Regardless, nice little article :)&lt;/p&gt;&lt;p&gt;edit: Just realized you already did something similar in your github code. My bad!&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Bartek</dc:creator><pubDate>Sat, 13 Jun 2009 09:28:43 -0000</pubDate></item></channel></rss>