One of the annoying things about so-called WYSIWYG word processors is that you can't quite tell what's going on. The image at right, for example, shows somewhat uneven spacing between lines. The gap between the first two lines is a little wider than the gap between the second and third. And the 2nd line from the bottom is a little farther away from its predecessor.
Where do these things come from? Well, if you jam a half-dozen ".doc" files together, you might have a diversity of font faces and sizes. Or if you copy/paste some text from one document (which started off with a larger font size for example) you might find font sizes varying even within a paragraph.
Now a close look at the 2nd-to-last line reveals that the closing quotation mark looks a little large. Indeed, if you position the cursor right there and watch the appropriate toolbar, you might see the font size window change from '10' to '12' and back. This explains why that line is a little lower than you'd otherwise expect. But what about the 2nd line? The same careful trick with the cursor would show that one of the '.'s had a larger font size.
So in a 112-page document, would you want to look carefully at the line spacings and watch the font-size on every single character to see where the font size changed? Or grab the entire document and change the font size to 10pt? That last trick might actually work, but what if you have some characters in a different font—"Albany AMT" for example instead of "Verdana"?
If you know the entire document shall be of one font, one size, one style (etc) then that would work, but often there are words or sentences in italics, or a section in a different font. So the brute-force method feels just a little risky.
So what's a techno-weenie to do with this? Since this particular techno-weenie wrote this article about manipulating ODF files with Python, the natural thing is to write a Python script. I'll spare you the gruesome details but basically I did this:
- Use openoffice/Neooffice to convert a ".doc" file to ".odt"
- Use unzip to unpack the ".odt" file, and examine "content.xml" using emacs (or firefos)
- Write a Python script to examine and modify properties
- Play with it a bit, and save the modified version...
- Use openoffice/Neooffice to convert the ".odt" file back to ".doc"
collin@p3:/mnt/home/collin/kstyle/tmp> unzip ../CreativeProjectFeb10.odt Archive: ../CreativeProjectFeb10.odt extracting: mimetype creating: Configurations2/statusbar/ inflating: Configurations2/accelerator/current.xml creating: Configurations2/floater/ creating: Configurations2/popupmenu/ creating: Configurations2/progressbar/ creating: Configurations2/menubar/ creating: Configurations2/toolbar/ creating: Configurations2/images/Bitmaps/ inflating: layout-cache inflating: content.xml inflating: styles.xml extracting: meta.xml inflating: Thumbnails/thumbnail.png inflating: Thumbnails/thumbnail.pdf inflating: settings.xml inflating: META-INF/manifest.xml collin@p3:/mnt/home/collin/kstyle/tmp>Then I pointed firefox at /mnt/home/collin/kstyle/tmp/content.xml and observed that some text styles specified different fonts, different sizes, etc.
collin@p3:~/kstyle> ./kstyles.py CreativeProjectFeb10.odt junk.odt -l | grep -v '<' T1: 701 spans T2: 77 spans T3: 2 spans T4: 10 spans T5: 2 spans T6: 4 spans T7: 18 spans T8: 8 spans T9: 13 spans T10: 1 spans T11: 221 spans T12: 1 spans T13: 112 spans T14: 2 spans T15: 6 spans T16: 1 spans collin@p3:~/kstyle>The script gives the characteristics of the styles, which I've filtered out above. Anyway, here's a slightly less uncensored version, looking at style T15; I've folded the output lines so you can see 'em all:
collin@p3:~/kstyle> ./kstyles.py CreativeProjectFeb10.odt junk.odt -l -dT15 T1: 701 spans … T15: 6 spans <style:style style:family="text" style:name="T15"><style:text-properties fo:font-size="12pt" fo:font-style="italic" style:font-name-asian="Albany AMT" style:font-name-complex="Albany AMT" style:font-size-asian="12pt" style:font-style-asian="italic" style:font-style-complex="italic"/></style:style> T16: 1 spans … === Text style T15: . . . . . .” collin@p3:~/kstyle>So the font size is too big here -- also it's the wrong font! And did you notice that the big font was just a '.' in several cases? How could you ever find those?
After looking at a bunch of them, I eventually decided I could collapse 3 and 5 to 2, and most of the rest to 1. I ended up with this:
collin@p3:~/kstyle> ./kstyles.py CreativeProjectFeb10.odt d.odt T3=T2 T5=T2 T6=T1 \ T7=T1 T8=T1 T9=T1 T10=T1 T11=T1 T12=T1 T13=T1 T14=T1 T15=T2 Changing style "T3" to "T2". Changing style "T5" to "T2". Changing style "T6" to "T1". Changing style "T7" to "T1". Changing style "T8" to "T1". Changing style "T9" to "T1". Changing style "T10" to "T1". Changing style "T11" to "T1". Changing style "T12" to "T1". Changing style "T13" to "T1". Changing style "T14" to "T1". Changing style "T15" to "T2". collin@p3:~/kstyle>The output file is called "d.odt" (due to typing laziness), and it looked fine. Here's the script:
1 #!/usr/bin/python -utt 2 # vim:et:sw=4 3 '''Unzip an ODF document and list or kill/substitute text styles. 4 5 Usage: kstyles INPUT OUTPUT [-l] [-dsty1]... [sty1=sty2]... 6 -l 7 list styles 8 9 -dsty1 10 Show text spans having property sty1 11 12 sty1=sty2 13 Text elements which are sty1 are assigned sty2 14 15 $Id: kstyles.py,v 0.4 2012/02/12 01:25:52 collin Exp collin $ 16 ''' 17 18 import codecs 19 import os 20 import sys 21 import xml.dom.minidom 22 import zipfile 23 24 CONTENT = 'content.xml' 25 STYLE = 'style:style' 26 SFAMILY = 'style:family' 27 SNAME = 'style:name' 28 TSTYLENAME = 'text:style-name' 29 TSPAN = 'text:span' 30 31 def main(args): 32 '''Unpack ODF/XML (zipfile). Discover text styles. 33 Find # of text elements which have each style; if "-l", display. 34 If "-dXX", display text:spans with style XX 35 For each "sty1=sty2" provided: 36 change any text elements of sty1 to sty2''' 37 try: 38 infile_name = args[0] 39 outfile_name = args[1] 40 ops = args[2:] 41 except: 42 usage() 43 if not os.path.exists(infile_name): 44 print "Couldn't find input file %s" % INFILE 45 usage() 46 INFILE = zipfile.ZipFile(infile_name, 'r') 47 # Sanity-check input file before doing anything else. 48 if CONTENT not in INFILE.namelist(): 49 print "Couldn't find %s in %s's zip archive" % (CONTENT, infile_name) 50 print 'Is it an ODF file?' 51 sys.exit(1) 52 # Read and parse content. 53 cdata = INFILE.read(CONTENT) 54 cdom = xml.dom.minidom.parseString(cdata)The above checks parameters, unpacks the ".odt" file, and ensures that CONTENT (viz., "content.xml"; see line 24) is there. Then it creates a document object model (DOM) from what's in the content.
55 # Find text styles 56 cstyles = cdom.getElementsByTagName(STYLE) 57 text_styles = [X for X in cstyles 58 if X.getAttribute(SFAMILY) == 'text'] 59 text_style_names = [X.getAttribute(SNAME) for X in text_styles] 60 # print text_styles 61 style_counts = dict() 62 for astyle in text_style_names: 63 style_counts[astyle] = 0 64 for aspan in [X for X in cdom.getElementsByTagName(TSPAN) 65 if X.hasAttribute(TSTYLENAME)]: 66 style_counts[aspan.getAttribute(TSTYLENAME)] += 1The above looks for all the text-styles, and counts how many "text:span" items refer to each text-style.
67 68 if '-l' in ops: 69 for idx in range(len(text_styles)): 70 astyle = text_style_names[idx] 71 print '%s: %d spans' % (astyle, style_counts[astyle]) 72 if style_counts[astyle]: 73 print '\t%s' % text_styles[idx].toxml() 74 for astyle in [X for X in style_counts if X not in text_style_names]: 75 print '??? %s: %d spans' % (astyle, style_counts[astyle]) 76 while '-l' in ops: 77 ops.remove('-l')...and this part prints the information, if you want it
78 79 # Before the following fun stuff, make stdout be utf8 80 utf8_enc = codecs.getencoder('utf8')I need line 80 to avoid encoding errors. The next part handles each operation (or "command") -- "-dT15" for example.
81 82 for op in ops: 83 if op.startswith('-d'): 84 astyle = op[2:] 85 if astyle not in style_counts: 86 print "*** Couldn't find style %s" % astyle 87 continue 88 print "=== Text style %s:" % astyle 89 for aspan in [X for X in cdom.getElementsByTagName(TSPAN) 90 if X.getAttribute(TSTYLENAME) == astyle]: 91 print utf8_enc(aspan.firstChild.data)[0] 92 continueSo that was the "-d" part -- dump out the text spans referring to a particular style.
93 styles = op.split('=') 94 if len(styles) > 2: 95 print >> sys.stderr, "Can't parse: '%s'" % op 96 usage() 97 if len(styles) < 2: 98 print >> sys.stderr, 'Not yet implemented: %s' % op 99 continue 100 print 'Changing style "%s" to "%s".' % (styles[0], styles[1]) 101 for aspan in [X for X in cdom.getElementsByTagName(TSPAN) 102 if X.getAttribute(TSTYLENAME) == styles[0]]: 103 aspan.setAttribute(TSTYLENAME, styles[1])Line 93 interprets "T15=T1" and assigns styles[0]="T15", styles[1]="T1". Then lines 101-103 find all the text-spans matching "T15" and does the setAttribute to change it to "T1". The last part puts the content back into a new ODF file:
104 105 if os.path.exists(outfile_name): 106 os.unlink(outfile_name) 107 OUTFILE = zipfile.ZipFile(outfile_name, 'w') 108 for oldinfo in INFILE.infolist(): 109 fname = oldinfo.filename 110 fsize = oldinfo.file_size 111 #print 'archive member "%s", %d bytes' % (fname, fsize) 112 if fsize > 0: 113 if fname == CONTENT: 114 OUTFILE.writestr(fname, utf8_enc(cdom.toxml())[0]) 115 else: 116 OUTFILE.writestr(fname, INFILE.read(fname)) 117 else: 118 OUTFILE.writestr(fname, '') 119 OUTFILE.close() 120 121 122 def usage(): 123 print >> sys.stderr, __doc__ 124 sys.exit(1) 125 126 if __name__ == '__main__': 127 main(sys.argv[1:])
Update: May 2014
I ran into this issue again with a collection of short stories. A single document (a short story, to which a half-dozen other stories were appended via snarf'n'barf with the mouse) had something like 35 or 36 text styles, which were largely unnecessary. I had to update the above script (which by now I've renamed kstyles.py, but I don't remember what "k" meant) to account for a few things- Sometimes line 91 didn't work:
91 print utf8_enc(aspan.firstChild.data)[0]
because firstChild wasn't Plain Old Text; it was this:<text:s/> - Sometimes a text style couldn't be found in content.xml; I had to look in styles.xml (no, really); who would have known?
- While I was in there, I added some sanity checks and updated the documentation
kstyles | index /mnt/home/collin/projects/kstyle/kstyles.py |
Unzip an ODF document and list or kill/substitute text styles.
Usage: kstyles INPUT OUTPUT [-l] [-dsty1]... [sty1=sty2]...
INPUT
name of file to read
OUTPUT
name of file to write new (modified) file
-l
list styles
-dsty1
Show text spans having property sty1
sty1=sty2
Text elements which are sty1 are assigned sty2
$Id: kstyles.py,v 0.6 2014/05/10 22:19:13 collin Exp collin $
Modules | ||||||
|
Functions | ||
|
Data | ||
CONTENT = 'content.xml' SFAMILY = 'style:family' SNAME = 'style:name' STYLE = 'style:style' STYLES = 'styles.xml' TSPAN = 'text:span' TSTYLENAME = 'text:style-name' |
No comments:
Post a Comment