Investigating the Problem Space
After a couple of web searches I found that the request module is a pretty popular way to get the initial content. This effectively replaces concepts like LWP::Simple or System.Net.HttpWebRequest. Next I knew that I would want to parse it and while regular expressions are powerful and would have been my solution of choice previously, I instead wanted to use whatever the cool kids were using. Turns out there are a lot of examples for the jsdom module but a pretty decent stack overflow article enumerated about 15 options and in that article I found out that the cheerio module is a jQuery for Node.js.Being that I'm a browser developer I've had a ton of experience with jQuery, mainly debugging why it isn't working and creating patches to make it work. That made cheerio seem like the perfect approach to the problem, assuming it was able to process the content in the expected way. Cheerio also installed and worked on a raw Windows machine without providing a Python path which further increased my confidence that it was a good, stand-alone module that wouldn't give me a bunch of setup grief. Time to write some code.
Implementing the Solution
To implement my simple scraper I got some documentation pages that my wife has been spending time with lately. Might as well process something that might have some use in the future. These are android documentation, automatically generated most likely, meaning they have a fairly regular structure. This is one of the keys to processing web pages, the regular structure of fixed identifiers and classes allow for simple jQuery CSS based queries to find whatever you might need. You start with structure nodes, and then process the individual content within. An example would be something as simple as the following...// Processes all descendents of an item with id pubmethodsYou can use ANY CSS query though. So you can select based on attributes and their values, presence or non-presence using the :not() selector syntax, and of course you can have multiple ancestor predicates just like in your style sheets. For a web developer something like XPath or a standard regular expression just wouldn't be nearly as familiar, maybe more powerful, but certainly not as easy to use. This has always made me love the querySelector API in the browser as well.
// that has a classname of jd-linkcol
$('#pubmethods).find('.jd-linkcol').each(...);
There were some interesting challenges though. It turns out cheerio is not a compliant HTML 5 parser. It doesn't know how to handle the various insertion modes and it fails at managing the open element stack and active formatting elements. For this reason you may find that malformed documents require you to be more precise. Swapping the find method for the children method can help when things nest when they shouldn't. This is equivalent to using the child selector (>) which also didn't work as expected in cheerio when used with the find method.
With that I'll point you at the code. The gist has a revision. You'll notice in revision 2, shared below, that I added a mechanism for testing multiple URLs and also made portions of the code more robust to different page structures. I wish that the diff between revision 1 and 2 would have been a good diff, but when viewed on Github it looks like I deleted all of the content and completely replaced it. Looking at version 1, then version 2 though might provide you some additional insights. There are also many additional parsing strategies that I didn't discuss that might be interesting to you.
No comments:
Post a Comment