Monday, January 5, 2015

Explorations in Node: Using the Request and Cheerio Modules

Given the speed of growth of the web there is a limitless amount of data available and most of it is transmitted in clear text for you to process as you see fit. Traditionally I would use a language like Perl or C# since both come equipped with many web friendly request classes, but today I decided to fumble through a bit of Node.js instead.

Investigating the Problem Space

After a couple of web searches I found that the request module is a pretty popular way to get the initial content. This effectively replaces concepts like LWP::Simple or System.Net.HttpWebRequest. Next I knew that I would want to parse it and while regular expressions are powerful and would have been my solution of choice previously, I instead wanted to use whatever the cool kids were using. Turns out there are a lot of examples for the jsdom module but a pretty decent stack overflow article enumerated about 15 options and in that article I found out that the cheerio module is a jQuery for Node.js.

Being that I'm a browser developer I've had a ton of experience with jQuery, mainly debugging why it isn't working and creating patches to make it work. That made cheerio seem like the perfect approach to the problem, assuming it was able to process the content in the expected way. Cheerio also installed and worked on a raw Windows machine without providing a Python path which further increased my confidence that it was a good, stand-alone module that wouldn't give me a bunch of setup grief. Time to write some code.

Implementing the Solution

To implement my simple scraper I got some documentation pages that my wife has been spending time with lately. Might as well process something that might have some use in the future. These are android documentation, automatically generated most likely, meaning they have a fairly regular structure. This is one of the keys to processing web pages, the regular structure of fixed identifiers and classes allow for simple jQuery CSS based queries to find whatever you might need. You start with structure nodes, and then process the individual content within. An example would be something as simple as the following...
// Processes all descendents of an item with id pubmethods
// that has a classname of jd-linkcol
$('#pubmethods).find('.jd-linkcol').each(...);
You can use ANY CSS query though. So you can select based on attributes and their values, presence or non-presence using the :not() selector syntax, and of course you can have multiple ancestor predicates just like in your style sheets. For a web developer something like XPath or a standard regular expression just wouldn't be nearly as familiar, maybe more powerful, but certainly not as easy to use. This has always made me love the querySelector API in the browser as well.

There were some interesting challenges though. It turns out cheerio is not a compliant HTML 5 parser. It doesn't know how to handle the various insertion modes and it fails at managing the open element stack and active formatting elements. For this reason you may find that malformed documents require you to be more precise. Swapping the find method for the children method can help when things nest when they shouldn't. This is equivalent to using the child selector (>) which also didn't work as expected in cheerio when used with the find method.

With that I'll point you at the code. The gist has a revision. You'll notice in revision 2, shared below, that I added a mechanism for testing multiple URLs and also made portions of the code more robust to different page structures. I wish that the diff between revision 1 and 2 would have been a good diff, but when viewed on Github it looks like I deleted all of the content and completely replaced it. Looking at version 1, then version 2 though might provide you some additional insights. There are also many additional parsing strategies that I didn't discuss that might be interesting to you.

Code

No comments:

Post a Comment