Phantom Js (Amazing library for web scraping)

Web scraping is an extremely well-visited area while developing any project that involves large amount of data. At some point or the other, during the development of the project, you will need to download the data, extract it and store (either by dumping into a database or by storing in static files). There are many libraries that does scraping of data (to name one, we have Scrapy from Python). Scrapy or other web scraping frameworks have one major limitation :- They don’t come with a headless browser and so they cannot perform/emulate dynamic ajax queries.

Sample usecase :- You have two drop down menus in a page and you will need to select values in both of them to generate content. The second drop down menu is dynamically created on selection of the first drop down menu. And you will need to write a script to go and select all the choices in both menu1 and menu2.

Here’s where PhantomJS comes into picture.

Think of PhantomJS as a headless browser that can be controlled by writing javascript code. Once after you request for a page, you have several callback functions where you can place your code.

1) onPageCreated

2) onResourceRequested

3) onResourceReceived

4) onLoadFinished

These are some of the commonly used callback functions when you are scraping from a website. Others are listed here

In addition to providing you different callbacks, PhantomJS exposes your entire DOM. So you can manipulate individual elements, trigger a Ajax request. Run the code in a loop to select all the combinations in both your dropdown menus. Tada! Your scraper is ready.

Bonus Section

“Screen Capture”  This is a really cool feature with PhantomJS where you can capture your entire screen as a screenshot as viewed by the browser.

Advertisements