collin park: De-google-izing search results

Tuesday, April 03, 2012

De-google-izing search results

This keeps happening to me so I wrote a script to deal with it:

Google search on something, say "measuring software modularity"
See some PDF files in the results
"Open in new tab" one of those PDF results
The new tab has something that looks like
http://www.google.com/url?sa=t&rct=j&q=measuring%20software%20modularity&source=web&cd=6&ved=0CGEQFjAF&url=http%3A%2F%2Frise.cs.drexel.edu%2F~sunny%2Fpapers%2Facom08_drh.pdf&ei=CC57T-XGBYb20gGJ7eSvBg&usg=AFQjCNF4WWHzd2WFkx-6_YupSut7Z4XVNA&cad=rja
A "what to do with this file?" dialog box appears, suggesting
- Save the file;
- Open with something (Acroread, Preview, okular, kpdf, xpdf, etc.)

And what you've got when this is done is the ridiculous URL above

What I want instead of the above monstrosity is simply http://rise.cs.drexel.edu/~sunny/papers/acom08_drh.pdf.

Here's some code that'll do that. I have python2.3.something on this Mac OS X 10.4 powerbook (from 2006) and...

#!/usr/bin/python -utt
# vim:et
'''Given a google-ized URL on stdin return the URL of interest.
Example input:
    http://www.google.com/url?sa=t&rct=j&q=measuring%20software%20modularity&source=web&cd=2&ved=0CEUQFjAB&url=http%3A%2F%2Fwww2.dbd.puc-rio.br%2Fpergamum%2Ftesesabertas%2F0410867_08_cap_02.pdf&ei=CC57T-XGBYb20gGJ7eSvBg&usg=AFQjCNEEhsr8h5IOUWtKMoIMk7eMSdi41A&cad=rja

Example output:
    http://www2.dbd.puc-rio.br/pergamum/tesesabertas/0410867_08_cap_02.pdf'''

import sys
import urllib

USTART = '&url='
UEND = '&'

def main(infile):
    for aline in infile:
        aline = aline.rstrip()
        ustart_at = aline.find(USTART)
        if ustart_at < 0:
            print "Can't find '%s'; ignoring" % USTART
            continue
        url_start = ustart_at + len(USTART)
        url_end = aline.find(UEND, url_start)
        if url_end == -1:
            url_end = None
        print urllib.unquote(aline[url_start:url_end])

if __name__ == '__main__':
    main(sys.stdin)

collin park

Tuesday, April 03, 2012

De-google-izing search results

No comments:

more selected postings

Who is this guy?

Links

Blog Archive