collin park: Recovering posts from a toasted wordpress site

The lovely Carol had her wordpress site hacked. Not knowing squat about wordpress except this, I naively thought it would be a Good Idea to pull the posts and the media off the site, blow it away, create a new site on top of it, and put the content back. It wasn't a problem finding the media; I tar(1)ed it up, gzip-ed (in vain) like this:

$ tar -tzf ~/uploaded-images.tgz
wp-content/uploads/
wp-content/uploads/2013/
wp-content/uploads/2013/01/
wp-content/uploads/2013/01/masthead_web.jpg
wp-content/uploads/2013/01/masthead_web-150x150.jpg
wp-content/uploads/2013/01/masthead_web-300x112.jpg
wp-content/uploads/2013/01/masthead_web-1000x288.jpg
wp-content/uploads/2013/01/masthead_web-500x187.jpg
…

But what about the text? I asked the lovely Carol, who said there was no need to capture the comments and a bunch of other stuff. She did, however, want her subscribers.

Following the method shown in that other posting, I went to the database and said

mysql> show tables;             
+--------------------------+
| Tables_in_REDACTED       |
+--------------------------+
| wp_commentmeta           |
| wp_comments              |
| wp_links                 |     
| wp_options               |
| wp_postmeta              |
| wp_posts                 |
| wp_subscribe2            |
| wp_term_relationships    |
| wp_term_taxonomy         |
| wp_termmeta              |
| wp_terms                 |
| wp_usermeta              |
| wp_users                 |       
+--------------------------+

I saved the subscribers off by typing mysql> select * from wp_subscribe2; which will do, because we don't have bazillions of them; a little scripting will get all the email addresses set up as subscribers for the new site.

She doesn't care about saving the old options, or comments, so all I need to do is save the posts and pages. We'll ask the designer who did the original design to please re-do it, once I can give her access.

So, what about saving the posts? It turns out that wp_posts has both the posts and something wordpress calls "pages." My first thought was to save everything into a nice file, like this:

SELECT * INTO OUTFILE '/$HOME/posts-outfile' CHARACTER SET utf8 FROM wp_posts;

which didn't work, because I don't have FILE privilege. Dang.

This worked fine though:

echo -e "use REDACTED\nSELECT post_type, post_date, \
    post_modified, post_title, post_content \
    FROM wp_posts where post_status = 'publish'; " \
 | mysql -h REDACTED -u REDACTED -p | tee $HOME/tempfile

So now I had her posts and "pages"

I then moved everything in the "root" directory of her domain into a subdirectory where nobody will look. By "nobody will look" I mean "nobody has permission." And by "nobody has permission" I mean 0700.

Now getting the data out of the aforementioned temp file was harder than I thought. My first naive thought was to do something like

sed -n 5p TEMPFILE

and just snarf'n'barf the rather long line I got. Bad idea. It almost worked, except for a few things:

a bunch of newline characters, which were represented as "\n"; of course, HTML doesn't care about "\n" and just renders it like that....
Some unwelcome <CR> (aka '\015' or '^M' or '\r') characters, which do a "carriage-return," putting the cursor back at the left margin but without scrolling the window. These caused a jumble of nonsense, but it was over toward the right, where you wouldn't see it without reading the whole thing through
characters with the high-order bit set. These are things in the range 0x91—0x97, which are translated in this chart

I couldn't quite do all this with a shell one-liner, but I tried this:

grep '^page' published-posts-pages.out |sed -n 1p | tr -d '^M' | \
   sed 's/\\n\\n/<p>/g'

which was mostly okay. the bold blue ^M was entered by using <ctrl-V><ctrl-M> I think.

For the "page"s I think I just did the funky characters by hand, addressing (mostly) #1 and #2 in the one-liner. To take care of #3 (the "posts" had a lot more of these than the "pages") I wrote a little Python.

#!/usr/bin/python2.7 -utt
# vim:et:sw=4
import sys

the_line = sys.stdin.read()
sub_table = [
    [ 0x91, '&lsquo;' ],
    [ 0x92, '&rsquo;' ],
    [ 0x93, '&ldquo;' ],
    [ 0x94, '&rdquo;' ],
    [ 0x96, '&ndash;' ],
    [ 0x97, '&mdash;' ],

]
for old_thing, new_thing in sub_table:
    the_line = the_line.replace(chr(old_thing), new_thing)

sys.stdout.write(the_line)
sys.exit(0)

I stuck this into $HOME/bin/chars.py and then, to handle a typical post, said:

$ LANG=C; grep '^post    ' published-posts-pages.out | tr -d '^M' \
   | sed 's/\\n\\n/<p>/g' | sed 's/\\n/<br>/g' | sed -n 12p \
   | ~/bin/chars.py | pbcopy

A few things about this:

LANG=C was because I had been trying some other character sets. They didn't work.
after translating all the "\n\n"s into "<p>", there were some single "\n"s left, which I translated into "<br>"
chars.py of course was to do the right thing with those odd characters
I used the Mac OS utility "pbcopy" because the terminal emulator renders unknown characters as '?'. Of course, now that I have chars.py in the pipeline, this was probably superfluous. Well, I could just CMD-v in the wordpress window to add a new post.

That's about it. The site is now ready for the lovely Carol to fine-tune. The .htaccess file currently will allow access only to the IP address of this particular house. When it's ready to publish, we'll remove the IP address restriction.

collin park

Sunday, April 17, 2016

Recovering posts from a toasted wordpress site

No comments:

more selected postings

Who is this guy?

Links

Blog Archive