- For some reason or another, we can't "select" the names on those pages with the mouse. I suspect javascript, but can't really tell.
- No fewer than twenty (20) departments!
First, I wanted to get the source of the overall staff page; I used lynx -dump -source but wget will probably do if you don't have lynx. The selection widget had a single very long line of the form
<select name="field_group_xyz" class="form-select" id="edit-field-group-xyz" > <option value="All" selected="selected"><Any></option><option value="123">Algebra</option><option value="456">Analytic Geometry</option><option value="789">Hilbert's Tenth Problem</option> [...lots more...] </select>(I changed the numbers and words because I want to show the technique, rather than our church's actual department names and codes.) Here's that line reformatted:
<select name="field_group_xyz" class="form-select" id="edit-field-group-xyz" > <option value="All" selected="selected"><Any></option> <option value="123">Algebra</option> <option value="456">Analytic Geometry</option> <option value="789">Hilbert's Tenth Problem</option> [...lots more...] </select>A couple of observations: first, we have "nice" department names like "Algebra"; we have two- and even three-word department names. Then there's "Hilbert's Tenth Problem" -- what's with "'"? Well, that's just an HTML character code -- it's an apostrophe. More on that later.
Now from playing with the selection widget, I found that when I selected the "Algebra" department in the pull-down, I got http://www.something.org/about-us/staff?field_group_nid=123 so I concluded that one could simply append the department code to "field_group_nid=" to get its staff list.
This was good; I could write a loop that would grab the various department names (Algebra, Analytic Geometry, etc.) and print those out, then do whatever processing I needed on the corresponding departmental staff pages. I didn't want the "Any" page so I got rid of that using "grep -v Any". So here's the beginning of our answer:
lynx -dump -source $STAFFURL | grep -m1 "<select" staff.html | tr '/' ' ' | grep -F value= | grep -v AnyA word about that funny "tr" command: I noticed that the only "/" in that line came to separate the body of each entry from the closing. To make it easy to grab each part, I wanted to separate each entry into its own line -- but I didn't care about the words "<option>" or "</option>" -- hence I just divided the line at each "/" character. The output looked like this:
option><option value="123">Algebra< option><option value="456">Analytic Geometry< option><option value="789">Hilbert's Tenth Problem< [...lots more...] option>< select>So I could pick out the value (the department code) and then, later on each line, the department name. And while I was at it, fix the funky HTML character code and just display an apostrophe, too. Something like this would do it:
sed -e 's/<$//' -e 's/^.*value="//' -e 's/">/ /' -e 's/\'/'"'/"gWhen applied to the output above, the result is this:
123 Algebra 456 Analytic Geometry 789 Hilbert's Tenth ProblemSo we want to split the department code (e.g., 456) from the department name (e.g., "Analytic Geometry"), but the name might have 1, 2, or 3 words. How to deal with that?
Since I'm using bash, I decided to just provide two variable names to the "read" builtin. The bash manpage description of read says that the...
first word is assigned to the first name, the second word to the second name, and so on, with leftover words and their interven- ing separators assigned to the last name.So if I pipe "789 Hilbert's Tenth Problem" to "read DCODE DNAME" then I'd get $DCODE=789 and $DNAME="Hilbert's Tenth Problem". If we were writing this in Python, we could use "split" with a limit, like this:
>>> "789 Hilbert's Tenth Problem".split(" ", 1) ['789', "Hilbert's Tenth Problem"]Anyway, our script could print the department name (which might be multiple words), then use the code for each department to get a department page. A typical department page would include these lines:
<h3>René Descartes</h3> <h3>Euclid</h3> <h3>H.S.M. Coxeter</h3>-- those were the names we wanted, and the "<h3>"s appeared at the beginning of each line.
grep '^<h3>' | cut '-d>' -f2 | cut '-d<' -f1 | sed 's/^/ /'The grep says to take only the lines that begin with "<h3>"; the first cut eliminates everything up to and including the first ">" on each line. The second cut deletes the "<" and everything after it. The sed inserts several " " characters at the front of each line, so the net effect would be something like this:
René Descartes Euclid H.S.M. CoxeterPutting it all together, it looked like this:
STAFFURL="http://www.something.org/about-us/staff" DEPTURL="http://www.something.org/about-us/staff?field_group_nid=" lynx -dump -source $STAFFURL | grep -m1 "<select" staff.html | tr '/' ' ' | grep -F value= | grep -v Any | sed -e 's/<$//' -e 's/^.*value="//' -e 's/">/ /' -e 's/\'/'"'/"g | while read DCODE DNAME; do echo "$DNAME" lynx -dump -source ${DEPTURL}$DCODE | grep '^<h3>' | cut '-d>' -f2 | cut '-d<' -f1 | sed 's/^/ /'and the output is like
Algebra George Boole Somebody Else Analytic Geometry René Descartes Euclid H.S.M. Coxeter Hilbert's Tenth Problem David Hilbert Bertrand Russell Alonzo Churchwhich is just what we're after.
Well, that was fun! Useful, too. I hope you enjoyed that as much as I did. Please let me know if you have any questions.
No comments:
Post a Comment