Sunday, April 17, 2016

Recovering posts from a toasted wordpress site

The lovely Carol had her wordpress site hacked. Not knowing squat about wordpress except this, I naively thought it would be a Good Idea to pull the posts and the media off the site, blow it away, create a new site on top of it, and put the content back. It wasn't a problem finding the media; I tar(1)ed it up, gzip-ed (in vain) like this:
$ tar -tzf ~/uploaded-images.tgz
wp-content/uploads/
wp-content/uploads/2013/
wp-content/uploads/2013/01/
wp-content/uploads/2013/01/masthead_web.jpg
wp-content/uploads/2013/01/masthead_web-150x150.jpg
wp-content/uploads/2013/01/masthead_web-300x112.jpg
wp-content/uploads/2013/01/masthead_web-1000x288.jpg
wp-content/uploads/2013/01/masthead_web-500x187.jpg
…
But what about the text? I asked the lovely Carol, who said there was no need to capture the comments and a bunch of other stuff. She did, however, want her subscribers.

Following the method shown in that other posting, I went to the database and said

mysql> show tables;             
+--------------------------+
| Tables_in_REDACTED       |
+--------------------------+
| wp_commentmeta           |
| wp_comments              |
| wp_links                 |     
| wp_options               |
| wp_postmeta              |
| wp_posts                 |
| wp_subscribe2            |
| wp_term_relationships    |
| wp_term_taxonomy         |
| wp_termmeta              |
| wp_terms                 |
| wp_usermeta              |
| wp_users                 |       
+--------------------------+
I saved the subscribers off by typing mysql> select * from wp_subscribe2; which will do, because we don't have bazillions of them; a little scripting will get all the email addresses set up as subscribers for the new site.

She doesn't care about saving the old options, or comments, so all I need to do is save the posts and pages. We'll ask the designer who did the original design to please re-do it, once I can give her access.

So, what about saving the posts? It turns out that wp_posts has both the posts and something wordpress calls "pages." My first thought was to save everything into a nice file, like this:

SELECT * INTO OUTFILE '/$HOME/posts-outfile' CHARACTER SET utf8 FROM wp_posts;
which didn't work, because I don't have FILE privilege. Dang.

This worked fine though:

echo -e "use REDACTED\nSELECT post_type, post_date, \
    post_modified, post_title, post_content \
    FROM wp_posts where post_status = 'publish'; " \
 | mysql -h REDACTED -u REDACTED -p | tee $HOME/tempfile
So now I had her posts and "pages"

I then moved everything in the "root" directory of her domain into a subdirectory where nobody will look. By "nobody will look" I mean "nobody has permission." And by "nobody has permission" I mean 0700.

Now getting the data out of the aforementioned temp file was harder than I thought. My first naive thought was to do something like

sed -n 5p TEMPFILE
and just snarf'n'barf the rather long line I got. Bad idea. It almost worked, except for a few things:
  1. a bunch of newline characters, which were represented as "\n"; of course, HTML doesn't care about "\n" and just renders it like that....
  2. Some unwelcome <CR> (aka '\015' or '^M' or '\r') characters, which do a "carriage-return," putting the cursor back at the left margin but without scrolling the window. These caused a jumble of nonsense, but it was over toward the right, where you wouldn't see it without reading the whole thing through
  3. characters with the high-order bit set. These are things in the range 0x91—0x97, which are translated in this chart
I couldn't quite do all this with a shell one-liner, but I tried this:
grep '^page' published-posts-pages.out |sed -n 1p | tr -d '^M' | \
   sed 's/\\n\\n/<p>/g'
which was mostly okay. the bold blue ^M was entered by using <ctrl-V><ctrl-M> I think.

For the "page"s I think I just did the funky characters by hand, addressing (mostly) #1 and #2 in the one-liner. To take care of #3 (the "posts" had a lot more of these than the "pages") I wrote a little Python.

#!/usr/bin/python2.7 -utt
# vim:et:sw=4
import sys

the_line = sys.stdin.read()
sub_table = [
    [ 0x91, '&lsquo;' ],
    [ 0x92, '&rsquo;' ],
    [ 0x93, '&ldquo;' ],
    [ 0x94, '&rdquo;' ],
    [ 0x96, '&ndash;' ],
    [ 0x97, '&mdash;' ],

]
for old_thing, new_thing in sub_table:
    the_line = the_line.replace(chr(old_thing), new_thing)

sys.stdout.write(the_line)
sys.exit(0)
I stuck this into $HOME/bin/chars.py and then, to handle a typical post, said:
$ LANG=C; grep '^post    ' published-posts-pages.out | tr -d '^M' \
   | sed 's/\\n\\n/<p>/g' | sed 's/\\n/<br>/g' | sed -n 12p \
   | ~/bin/chars.py | pbcopy
A few things about this:
  • LANG=C was because I had been trying some other character sets. They didn't work.
  • after translating all the "\n\n"s into "<p>", there were some single "\n"s left, which I translated into "<br>"
  • chars.py of course was to do the right thing with those odd characters
  • I used the Mac OS utility "pbcopy" because the terminal emulator renders unknown characters as '?'. Of course, now that I have chars.py in the pipeline, this was probably superfluous. Well, I could just CMD-v in the wordpress window to add a new post.
That's about it. The site is now ready for the lovely Carol to fine-tune. The .htaccess file currently will allow access only to the IP address of this particular house. When it's ready to publish, we'll remove the IP address restriction.

No comments: