Phantom Js (Amazing library for web scraping)

Web scraping is an extremely well-visited area while developing any project that involves large amount of data. At some point or the other, during the development of the project, you will need to download the data, extract it and store (either by dumping into a database or by storing in static files). There are many libraries that does scraping of data (to name one, we have Scrapy from Python). Scrapy or other web scraping frameworks have one major limitation :- They don’t come with a headless browser and so they cannot perform/emulate dynamic ajax queries.

Sample usecase :- You have two drop down menus in a page and you will need to select values in both of them to generate content. The second drop down menu is dynamically created on selection of the first drop down menu. And you will need to write a script to go and select all the choices in both menu1 and menu2.

Here’s where PhantomJS comes into picture.

Think of PhantomJS as a headless browser that can be controlled by writing javascript code. Once after you request for a page, you have several callback functions where you can place your code.

1) onPageCreated

2) onResourceRequested

3) onResourceReceived

4) onLoadFinished

These are some of the commonly used callback functions when you are scraping from a website. Others are listed here

In addition to providing you different callbacks, PhantomJS exposes your entire DOM. So you can manipulate individual elements, trigger a Ajax request. Run the code in a loop to select all the combinations in both your dropdown menus. Tada! Your scraper is ready.

Bonus Section

“Screen Capture”  This is a really cool feature with PhantomJS where you can capture your entire screen as a screenshot as viewed by the browser.

How I kicked off GSoC

Zero to hero

What Prompted me??

I started my third year thinking I should do something that would put me different from the rest and one of my professors suggested me as to why don’t I apply for GSoC. I don’t know why but I took the suggestion rather seriously, thanks to the bet I had with one of my friend(who is about to complete his MBBS) that whoever earns first will buy the other a “RayBan shades”. Well, that’s it. I was determined. I started my research early, probably during the start of February(I knew I want to buy my friend, his shades and also buy mine too, in the process).

What experiences I had before??

I started looking at previous years’ GSoC projects(having had little experience with Open Source)  and started learning how to contribute. I was also very fascinated to the amount of knowledge one could gain just by googling and browsing web pages . I discovered very soon, as to what an immensely great tool , email, through which I could chat with anyone in the open source world and ask seemingly stupid questions and always expect to get a gentle reply back with an answer. Well, that held me spell bound and I knew I want to contribute to Open Source.

How did I begin??

About the middle of March, I discovered that my passion for Python as a programming language increased , after understanding how easy it is as a language. Added to that, my popularity among my fellow classmates increased when I started evangelizing Python(thanks to my seniors for introducing it, I guess I did a decent job popularizing the language). And I started contributing to PSF(Python Software Foundation) , started with a simple bug to fix documentation and slowly my interactivity in IRC increased and I started liking one of the project one of the community member proposed.

A twist in the story??

There I was, still a noob and not knowing how to convince my probable mentor that I could complete the project, given direction. About this juncture, a fellow student(from some university in France) mailed this particular mentor that he was interested in the project . Do, remember, I was part of the mailing list and follow the happenings of it. So, I was furious knowing that I had a competition(having put so much effort) and I was not willing to compromise my project (knowing that this is the one project I actually understood and started researching a little bit too). The other projects require me to have some domain knowledge. I went back to my teachers, seniors, friends and Google and started asking the question , “how would i solve the problem the mentor posted?” . I framed a couple of answers, though very noobish , but at least I could reply the email thread posting my understanding of the problem and how I would solve it and also ask various questions I had in my mind. Well, the mentor replied, immediately to my surprise, and responded back with comments as well as answers to the questions I posed. Again, my nemesis/competitor replied back(he having good knowledge about the problem domain). I knew it was not going to be easy. Hence, I went back again, through all my sources, made further understanding of the problem and posted back again. I guess, about 20 mails in the thread , till we(all three of us) decided we should catch up in IRC and discuss more.

The conclusion:

Well, at IRC , most of senior members from the community were present, and they suggested that they should probably extend the scope of the project(since two students were interested in one project and showed immense passion). Unsurprisingly, over multiple meetings, the project scope was expanded, both the students were given equal important but independent tasks and both the students got opportunity to say they are Google Summer of Code students. Thank goodness, we decided to built the project from scratch giving us more than enough work on our plate.

Movie titles:

1) In the open source world, there is no competition , it is only “COLLABORATION“.

2) Why give up, when you can win??

3) From Zero to Hero!!

4) A prodigy in making

p.s. I still owe my friend his shades . *sshole, I am still waiting for him to visit me so that I can buy him his  shades and buy mine too.  Also, I know its been two years since the story happened, but it is never too late to share, don’t you agree??

Calculation of Pi

After my initial encountering with OpenCL, I wanted to explore more. So I wrote a small program to calculate Pi .
This uses a very simple algorithm to calculate Pi

  1. Inscribe a circle in a square
  2. Randomly generate points in the square
  3. Determine the number of points in the square that are also in the circle
  4. Let r be the number of points in the circle divided by the number of points in the square
  5. PI ~ 4 r
  6. Note that the more points generated, the better the approximation

This gave the value of Pi as 3.141172 . Pretty decent, I suppose :)
 

The code

 

import pyopencl as cl 
import numpy
import random

class circle(object):
    def __init__(self, r=100):
        self.r = r
    def inside_the_circle(self, x, y):
        if x**2+y**2 <= r**2: 
           return 1
        else:
           return 0

class CL:
    def __init__(self, size=1000):
        self.size = size
        self.ctx = cl.create_some_context()
        self.queue = cl.CommandQueue(self.ctx)

    def load_program(self):
        f = """
        __kernel void picalc(__global float* a, __global float* b, __global float* c)
        {
         unsigned int i = get_global_id(0);

        if (a[i]*a[i]+ b[i]*b[i] < 100*100)
           {
            c[i] = 1;
           }
        else
           {
            c[i]=0;
           }
        }
        """
        self.program = cl.Program(self.ctx, f).build()

    def popcorn(self):
        mf = cl.mem_flags
        x = [random.uniform(-100,100) for i in range(self.size)]

        y = [random.uniform(-100,100) for i in range(self.size)]

        self.a = numpy.array(x, dtype=numpy.float32)
        self.b = numpy.array(y, dtype=numpy.float32)

        self.a_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.a)
        self.b_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.b)
        self.dest_buf = cl.Buffer(self.ctx, mf.WRITE_ONLY, self.b.nbytes)

    def execute(self):
        self.program.picalc(self.queue, self.a.shape, None, self.a_buf, 
                            self.b_buf, self.dest_buf)
        self.c = numpy.empty_like(self.a)
        cl.enqueue_read_buffer(self.queue, self.dest_buf, self.c).wait()
        
    def calculate_pi(self):
        number_in_circle = 0
        for i in self.c:
            number_in_circle = number_in_circle + i
        pi = number_in_circle*4 / self.size
        print 'pi = ', pi



if __name__ == '__main__':
    ex = CL(1000000)
    ex.load_program()
    ex.popcorn()
    ex.execute()
    ex.calculate_pi()

 

CPU vs GPU performance comparision with OpenCL

I recently had opportunity to explore an awesome library called OpenCL (Open Computing Language) which enables me to create programs which helps me utilize the computation power of my Graphic Card. I wanted to try out how much faster a normal program (addition of elements to two arrays) would work if I parallize the program using OpenCL.

Source
Using OpenCL

import pyopencl as cl
import numpy
import sys

class CL(object):
    def __init__(self, size=10):
        self.size = size
        self.ctx = cl.create_some_context()
        self.queue = cl.CommandQueue(self.ctx)

    def load_program(self):
        fstr="""
		__kernel void part1(__global float* a, __global float* b, __global float* c)
		{
       		unsigned int i = get_global_id(0);

	       c[i] = a[i] + b[i];
		}
	     """
        self.program = cl.Program(self.ctx, fstr).build()

    def popCorn(self):
        mf = cl.mem_flags

        self.a = numpy.array(range(self.size), dtype=numpy.float128)
        self.b = numpy.array(range(self.size), dtype=numpy.float128)

        self.a_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
                               hostbuf=self.a)
        self.b_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
                               hostbuf=self.b)
        self.dest_buf = cl.Buffer(self.ctx, mf.WRITE_ONLY, self.b.nbytes)

    def execute(self):
        self.program.part1(self.queue, self.a.shape, None, self.a_buf, self.b_buf, self.dest_buf)
        c = numpy.empty_like(self.a)
        cl.enqueue_read_buffer(self.queue, self.dest_buf, c).wait()
        print "a", self.a
        print "b", self.b
        print "c", c

if __name__ == '__main__':
    matrixmul = CL(10000000)
    matrixmul.load_program()
    matrixmul.popCorn()
    matrixmul.execute()

Normal program without Optimization

def add(size=10):
    a = tuple([float(i) for i in range(size)])
    b = tuple([float(j) for j in range(size)])
    c = [None for i in range(size)]
    for i in range(size):
        c[i] = a[i]+b[i]

    #print "a", a
    #print "b", b
    print "c", c[:1000]

add(1000000)

I compared the performance of both the programs using the tool “time” available in Linux and I noted down the “sys time”
Heres the comparision:

 

Size Using GPU Without GPU ( i.e. CPU)
100 0.130s 0.030s
1000 0.100s 0.010s
10000 0.130s 0.010s
100000 0.150s 0.050s
1000000 0.170s 0.150s
10000000 0.600s 1.150s

 
Cleary you see that the GPU outperforms CPU at higher values of size as the program is able to use multiple threads provided by the GPU. At lower values of size there is an appreciable access time associated with GPU, so CPU performs faster.

Simple query request with Wolfram api

Wolfram Alpha is really a good website if you want to perform some computation or ask some basic query like “distance between Paris and London”.So I wanted to test out Wolfram API (alpha)
The following program gets a query from the user and retrieves a result. Ofcourse, you will need an app id which could be got from wolfram.com website.

import urllib2
import urllib
import httplib
from xml.etree import ElementTree as etree

class wolfram(object):
    def __init__(self, appid):
        self.appid = appid
        self.base_url = 'http://api.wolframalpha.com/v2/query?'
        self.headers = {'User-Agent':None}

    def _get_xml(self, ip):
        url_params = {'input':ip, 'appid':self.appid}
        data = urllib.urlencode(url_params)
        req = urllib2.Request(self.base_url, data, self.headers)
        xml = urllib2.urlopen(req).read()
        return xml

    def _xmlparser(self, xml):
        data_dics = {}
        tree = etree.fromstring(xml)
        #retrieving every tag with label 'plaintext'
        for e in tree.findall('pod'):
            for item in [ef for ef in list(e) if ef.tag=='subpod']:
                for it in [i for i in list(item) if i.tag=='plaintext']:
                    if it.tag=='plaintext':
                        data_dics[e.get('title')] = it.text
        return data_dics

    def search(self, ip):
        xml = self._get_xml(ip)
        result_dics = self._xmlparser(xml)
        #return result_dics 
        #print result_dics
        print result_dics['Result']

if __name__ == "__main__":
    appid = sys.argv[0]
    query = sys.argv[1]
    w = wolfram(appid)
    w.search(query)
      

Pycon India

My first visit to any Pycon Conference(this time at Pune, India) happened last week. It was really great( as expected !!! ) . Incidentally I gave a talk there too, on PyTI (Python Testing Infrastructure) . I got to make few more friends and learnt some interesting stuff out there. One of the notable talk to be watched is definitely the keynote by Raymond Hettinger (again no surprises here as keynote is the soul of any conference ) . I had a great time in Pune and also Mumbai which I visited on sunday :)

VDFuse replacing GuestFS

Just a week back, we decided to replace GuestFS and use VDFuse. Though GuestFS comes with certain advantages, w.r.t VDFuse in terms
of support for several file formats which makes it easy to extend to other file formats and also python bindings, we faced a lot
of problems working with it. Some of them are:

  • Installing GuestFS is a big headache as there are lot of dependencies to be installed and even after that its not necessary that
    it works always. This might turn out to be one reason for developers to not try out PyTI.
  • GuestFS is not very stable . There were lot of times it didnt work and couple of issues of GuestFS hanging without any reason.
  • It takes a lot of time initializing GuestFS for reading or writing into the disk

Yes, as I mentioned before, we moved to VDFuse for now. It grows out of the difficulties of GuestFS and it is a decent solution for
reading or writing to the disk.