The lovely Carol had her wordpress site
hacked.
Not knowing squat about wordpress except
this, I naively thought it would be a Good Idea to pull the posts and the media
off the site,
blow it away, create a new site on top of it, and put the content back.
It wasn't a problem finding the media; I
tar(1)ed it up, gzip-ed (in vain) like this:
$ tar -tzf ~/uploaded-images.tgz
wp-content/uploads/
wp-content/uploads/2013/
wp-content/uploads/2013/01/
wp-content/uploads/2013/01/masthead_web.jpg
wp-content/uploads/2013/01/masthead_web-150x150.jpg
wp-content/uploads/2013/01/masthead_web-300x112.jpg
wp-content/uploads/2013/01/masthead_web-1000x288.jpg
wp-content/uploads/2013/01/masthead_web-500x187.jpg
…
But what about the text? I asked the lovely Carol, who said there was no need to
capture the comments and a bunch of other stuff. She did, however, want her
subscribers.
Following the method shown in that other posting, I went to the database and said
mysql> show tables;
+--------------------------+
| Tables_in_REDACTED |
+--------------------------+
| wp_commentmeta |
| wp_comments |
| wp_links |
| wp_options |
| wp_postmeta |
| wp_posts |
| wp_subscribe2 |
| wp_term_relationships |
| wp_term_taxonomy |
| wp_termmeta |
| wp_terms |
| wp_usermeta |
| wp_users |
+--------------------------+
I saved the subscribers off by typing
mysql> select * from wp_subscribe2; which will do, because we don't
have bazillions of them; a little scripting will get all the email addresses set up as
subscribers for the new site.
She doesn't care about saving the old options, or comments, so all I need to do is save the posts and pages. We'll ask the designer who did the original design to please
re-do it, once I can give her access.
So, what about saving the posts? It turns out that wp_posts has both the
posts and something wordpress calls "pages." My first thought was to save everything into a nice file, like this:
SELECT * INTO OUTFILE '/$HOME/posts-outfile' CHARACTER SET utf8 FROM wp_posts;
which didn't work, because I don't have FILE privilege. Dang.
This worked fine though:
echo -e "use REDACTED\nSELECT post_type, post_date, \
post_modified, post_title, post_content \
FROM wp_posts where post_status = 'publish'; " \
| mysql -h REDACTED -u REDACTED -p | tee $HOME/tempfile
So now I had her posts and "pages"
I then moved everything in the "root" directory of her domain into a subdirectory
where nobody will look. By "nobody will look" I mean "nobody has permission."
And by "nobody has permission" I mean 0700.
Now getting the data out of the aforementioned temp file was harder than I thought.
My first naive thought was to do something like
sed -n 5p TEMPFILE
and just
snarf'n'barf the rather long line I got.
Bad idea. It almost worked, except for a few things:
- a bunch of newline characters, which were represented as "\n";
of course, HTML doesn't care about "\n" and just renders it like that....
- Some unwelcome <CR> (aka '\015' or '^M' or '\r') characters, which
do a "carriage-return," putting the cursor back at the left margin but
without scrolling the window. These caused a jumble of nonsense, but
it was over toward the right, where you wouldn't see it without reading
the whole thing through
- characters with the high-order bit set. These are things in the range
0x91—0x97, which are translated in this chart
I couldn't quite do all this with a shell one-liner, but I tried this:
grep '^page' published-posts-pages.out |sed -n 1p | tr -d '^M' | \
sed 's/\\n\\n/<p>/g'
which was mostly okay. the bold blue
^M
was entered by using <ctrl-V><ctrl-M> I think.
For the "page"s I think I just did the funky characters by hand, addressing
(mostly) #1 and #2 in the one-liner. To take care of #3 (the "posts" had a lot more
of these than the "pages") I wrote a little Python.
#!/usr/bin/python2.7 -utt
# vim:et:sw=4
import sys
the_line = sys.stdin.read()
sub_table = [
[ 0x91, '‘' ],
[ 0x92, '’' ],
[ 0x93, '“' ],
[ 0x94, '”' ],
[ 0x96, '–' ],
[ 0x97, '—' ],
]
for old_thing, new_thing in sub_table:
the_line = the_line.replace(chr(old_thing), new_thing)
sys.stdout.write(the_line)
sys.exit(0)
I stuck this into $HOME/bin/chars.py and then, to handle a typical post, said:
$ LANG=C; grep '^post ' published-posts-pages.out | tr -d '^M' \
| sed 's/\\n\\n/<p>/g' | sed 's/\\n/<br>/g' | sed -n 12p \
| ~/bin/chars.py | pbcopy
A few things about this:
- LANG=C was because I had been trying some other character sets. They didn't work.
- after translating all the "\n\n"s into "<p>", there were some single "\n"s
left, which I translated into "<br>"
- chars.py of course was to do the right thing with those odd characters
- I used the Mac OS utility "pbcopy" because the terminal emulator renders unknown
characters as '?'. Of course, now that I have chars.py in the pipeline,
this was probably superfluous. Well, I could just CMD-v in the wordpress window
to add a new post.
That's about it. The site is now ready for the lovely Carol to fine-tune.
The
.htaccess file currently will allow access only to the IP address
of this particular house. When it's ready to publish, we'll remove the IP address
restriction.