Tuesday, May 18, 2010

Website secrets! Extracting lists keyed off a selection widget

The lovely Carol wanted to get a list of the staff at our church for a prayer-related event, so she went to the website to find a page with a selection widget -- you can choose a department and see its staff. Since she wanted the list to be organized by department, one could imagine selecting each department in turn, then using the mouse to snarf'n'barf the staff names for the list. Two problems with that:
  • For some reason or another, we can't "select" the names on those pages with the mouse. I suspect javascript, but can't really tell.
  • No fewer than twenty (20) departments!
I don't want to defend the number of departments here; what I do want to do is share how I overcame all of that to produce the list that the lovely Carol wanted to facilitate the prayer meeting.

First, I wanted to get the source of the overall staff page; I used lynx -dump -source but wget will probably do if you don't have lynx. The selection widget had a single very long line of the form

<select name="field_group_xyz" class="form-select" id="edit-field-group-xyz" > <option value="All" selected="selected">&lt;Any&gt;</option><option value="123">Algebra</option><option value="456">Analytic Geometry</option><option value="789">Hilbert&#39;s Tenth Problem</option> [...lots more...] </select>
(I changed the numbers and words because I want to show the technique, rather than our church's actual department names and codes.) Here's that line reformatted:
<select name="field_group_xyz" class="form-select" id="edit-field-group-xyz" > 
    <option value="All" selected="selected">&lt;Any&gt;</option> 
    <option value="123">Algebra</option> 
    <option value="456">Analytic Geometry</option> 
    <option value="789">Hilbert&#39;s Tenth Problem</option> 
    [...lots more...]
</select>
A couple of observations: first, we have "nice" department names like "Algebra"; we have two- and even three-word department names. Then there's "Hilbert&#39;s Tenth Problem" -- what's with "&#39;"? Well, that's just an HTML character code -- it's an apostrophe. More on that later.

Now from playing with the selection widget, I found that when I selected the "Algebra" department in the pull-down, I got http://www.something.org/about-us/staff?field_group_nid=123 so I concluded that one could simply append the department code to "field_group_nid=" to get its staff list.

This was good; I could write a loop that would grab the various department names (Algebra, Analytic Geometry, etc.) and print those out, then do whatever processing I needed on the corresponding departmental staff pages. I didn't want the "Any" page so I got rid of that using "grep -v Any". So here's the beginning of our answer:

lynx -dump -source $STAFFURL | grep -m1 "<select" staff.html | tr '/' '
' | grep -F value= | grep -v Any 
A word about that funny "tr" command: I noticed that the only "/" in that line came to separate the body of each entry from the closing. To make it easy to grab each part, I wanted to separate each entry into its own line -- but I didn't care about the words "<option>" or "</option>" -- hence I just divided the line at each "/" character. The output looked like this:
option><option value="123">Algebra< 
option><option value="456">Analytic Geometry< 
option><option value="789">Hilbert&#39;s Tenth Problem< 
        [...lots more...] 
option>< 
select>
So I could pick out the value (the department code) and then, later on each line, the department name. And while I was at it, fix the funky HTML character code and just display an apostrophe, too. Something like this would do it:
  sed -e 's/<$//' -e 's/^.*value="//' -e 's/">/ /' -e 's/\&#039;/'"'/"g 
When applied to the output above, the result is this:
123 Algebra 
456 Analytic Geometry 
789 Hilbert's Tenth Problem
So we want to split the department code (e.g., 456) from the department name (e.g., "Analytic Geometry"), but the name might have 1, 2, or 3 words. How to deal with that?

Since I'm using bash, I decided to just provide two variable names to the "read" builtin. The bash manpage description of read says that the...

      first word is assigned to the first name, the second word to the 
      second name, and so on, with leftover words and their  interven- 
      ing  separators  assigned  to the last name.
So if I pipe "789 Hilbert's Tenth Problem" to "read DCODE DNAME" then I'd get $DCODE=789 and $DNAME="Hilbert's Tenth Problem". If we were writing this in Python, we could use "split" with a limit, like this:
>>> "789 Hilbert's Tenth Problem".split(" ", 1) 
['789', "Hilbert's Tenth Problem"]
Anyway, our script could print the department name (which might be multiple words), then use the code for each department to get a department page. A typical department page would include these lines:
<h3>René Descartes</h3>
<h3>Euclid</h3>
<h3>H.S.M. Coxeter</h3>
-- those were the names we wanted, and the "<h3>"s appeared at the beginning of each line.

Hence, we could just look for "<h3>" at the start of each line, then take the bytes between the "<h3>" and the corresponding "</h3>" to get the names that we want. What I actually wrote was:
grep '^<h3>' | cut '-d>' -f2 | cut '-d<' -f1 | sed 's/^/        /'
The grep says to take only the lines that begin with "<h3>"; the first cut eliminates everything up to and including the first ">" on each line. The second cut deletes the "<" and everything after it. The sed inserts several " " characters at the front of each line, so the net effect would be something like this:
        René Descartes 
        Euclid 
        H.S.M. Coxeter
Putting it all together, it looked like this:
STAFFURL="http://www.something.org/about-us/staff" 
DEPTURL="http://www.something.org/about-us/staff?field_group_nid=" 
lynx -dump -source $STAFFURL | grep -m1 "<select" staff.html | tr '/' ' 
' | grep -F value= | grep -v Any | 
  sed -e 's/<$//' -e 's/^.*value="//' -e 's/">/ /' -e 's/\&#039;/'"'/"g | 
while read DCODE DNAME; do 
    echo "$DNAME" 
    lynx -dump -source ${DEPTURL}$DCODE | 
grep '^<h3>' | cut '-d>' -f2 | cut '-d<' -f1 | sed 's/^/        /'
and the output is like
Algebra 
        George Boole 
        Somebody Else 
Analytic Geometry 
        René Descartes 
        Euclid 
        H.S.M. Coxeter 
Hilbert's Tenth Problem 
        David Hilbert 
        Bertrand Russell 
        Alonzo Church 
which is just what we're after.

Well, that was fun! Useful, too. I hope you enjoyed that as much as I did. Please let me know if you have any questions.

No comments: