Main
Python / Ruby

« Choosing an AJAX framework : Think graphic designers, usability experts and web services | Main | Content is the peasant, if you only live in the Internet kingdom »

September 12, 2007

The last markup parsers you will ever need : Feedparser and BeautifulSoup

Its becoming increasingly common to access and reuse data in markup formats, from the prevalent XML in web services, the RSS & Atom(XML) formats used in news feeds, to the HTML used in web pages. I don't know about your experience, but I've always found it extremely frustrating and time consuming to process markup data, so it is that I am happily amazed at the functionality I discovered in these two Python tools, or what I would dub, the last markup parsers you will ever need.

[Entry continues to the left and below ad ]

Markup languages are nasty to process for many of their beneficial traits: they are open ended and have low-barriers to distribute. Anyone can create their own markup structures to satisfy their needs -- from an XML based sales catalogue to a funky HTML layout -- on the other hand, while there have been many efforts to guarantee markup integrity -- such as DTDs and Schemas -- a small sample of news feeds and web pages around the net will demonstrate that its simply common and way to easy to distribute broken-up markup.

So what is the best approach to processing markup ? Well, now a days every programming language has more than a few ways to do it, so why am I making this far fetched claim that FeedParser and BeautifulSoup will be the last markup parsers you will ever need, and are written in Python of all things.

Error handling(try/catch) hell: In an effort to guarantee the utmost integrity, most parsers and markup processors inform you -- read 'choke' -- on every possible error, so its up to you to decide what to do in each case. Kudos to programming integrity, but I don't know about you, making try/catch error blocks for dealing with everything from unclosed attributes to orphan tags, gets a little tiresome after the 10th markup variation.

One size doesn't fit all : Given the sheer amount of markup variations, its somewhat difficult to find one single tool for all your needs. Take the case of data feeds, you have four versions of RSS 0.9, 1.0, 1.1 and 2.x and two Atom versions 0.3 and 1.0 tossed into the mix, so is there are common denominator ? Of course not, that's the case for having so many versions, each one has its quirks that have to be dealt separately, much to the dismay of those trying to work with data feeds in a programmatic fashion.

These impressions are primarily from dealing with numerous parsers across languages, using everything from Java's SAX event,DOM tree & JDOM model, .NET's XMLReader and XMLDocument to PHP's XML facilities, I won't say I've tried every parser in existence, but I've tried quite a few to see the difference in FeedParser and BeautifulSoup

FeedParser : If you are going to be working with news feeds, this little library is a must have, I haven't seen anything close to its features in Java or .NET. Its main benefits: Single syntax extraction for the numerous feed versions in the wild, just ask d.channel.title and that's it, RSS or Atom markup intricacies are taken care of under the hood. Among other things, HTTP aware -- as in checking last modified dates and etags to avoid continuous downloads -- HTML sanitation and very comfortable defaults in case markup is not up to par.

BeautifulSoup : If its open-ended parsing you need, you can't go wrong with BeautifulSoup, which is primarily intended as an HTML scraper -- hopefully legally scraped content. You know the drill, HTML markup in the wild is rarely standard compliant, and BeautifulSoup does the best job I have seen so far of dealing with broken-up markup, it just by-passes obvious markup annoyances upon processing and complains when it really counts. Other benefits include, support for processing XML and pretty much any other markup.

Caveat, both are written in Python, which has two drawbacks, albeit workable ones.

Speed: Speed is not a particular strength of interpreted languages like Python, and its an even lesser strength when you are crunching markup into memory for processing. Well you can't have everything can you, but from my experience, the trade-off in additional processing cycles is well worth the trade-off in the time spent writing a program to process any given markup. Give both tools a try to see for yourself, and don't forget about Moore's Law: processing cycles will get cheaper, programming time has always and will continue to be flat.

Tied in with Java or .NET: If you have to abide to an IT policy that states you can only use Java or .NET in your projects, or if its simply your good old choice to keep things simple by using one mainstream language, well then you are in luck, you can still use Feedparser and BeautifulSoup -- or any Python program for that matter -- in both Java and .NET. If your into Java have a look at Jython , and if .NET is your thing check out IronPython , both will allow you to test out Python without leaving your Java or .NET comfort zone.

[Comments below ad ]

Posted by Daniel at September 12, 2007 3:53 PM


Comments


Post a comment




Remember Me?

(you may use HTML tags for style)

Track back Pings

Track Back URL for this entry:
http://www.webforefront.com/mtblog/mt-tb.cgi/84.

 
XHTML 1.1   Powered by Movable Type 3.33