Tuesday, April 03, 2012

De-google-izing search results

This keeps happening to me so I wrote a script to deal with it:
  • Google search on something, say "measuring software modularity"
  • See some PDF files in the results
  • "Open in new tab" one of those PDF results
  • The new tab has something that looks like
    http://www.google.com/url?sa=t&rct=j&q=measuring%20software%20modularity&source=web&cd=6&ved=0CGEQFjAF&url=http%3A%2F%2Frise.cs.drexel.edu%2F~sunny%2Fpapers%2Facom08_drh.pdf&ei=CC57T-XGBYb20gGJ7eSvBg&usg=AFQjCNF4WWHzd2WFkx-6_YupSut7Z4XVNA&cad=rja
  • A "what to do with this file?" dialog box appears, suggesting
    • Save the file;
    • Open with something (Acroread, Preview, okular, kpdf, xpdf, etc.)
And what you've got when this is done is the ridiculous URL above

What I want instead of the above monstrosity is simply http://rise.cs.drexel.edu/~sunny/papers/acom08_drh.pdf.

Here's some code that'll do that. I have python2.3.something on this Mac OS X 10.4 powerbook (from 2006) and...

#!/usr/bin/python -utt
# vim:et
'''Given a google-ized URL on stdin return the URL of interest.
Example input:
    http://www.google.com/url?sa=t&rct=j&q=measuring%20software%20modularity&source=web&cd=2&ved=0CEUQFjAB&url=http%3A%2F%2Fwww2.dbd.puc-rio.br%2Fpergamum%2Ftesesabertas%2F0410867_08_cap_02.pdf&ei=CC57T-XGBYb20gGJ7eSvBg&usg=AFQjCNEEhsr8h5IOUWtKMoIMk7eMSdi41A&cad=rja

Example output:
    http://www2.dbd.puc-rio.br/pergamum/tesesabertas/0410867_08_cap_02.pdf'''

import sys
import urllib

USTART = '&url='
UEND = '&'

def main(infile):
    for aline in infile:
        aline = aline.rstrip()
        ustart_at = aline.find(USTART)
        if ustart_at < 0:
            print "Can't find '%s'; ignoring" % USTART
            continue
        url_start = ustart_at + len(USTART)
        url_end = aline.find(UEND, url_start)
        if url_end == -1:
            url_end = None
        print urllib.unquote(aline[url_start:url_end])

if __name__ == '__main__':
    main(sys.stdin)
 

No comments: