My wife, the lovely Carol, is a writer. Unsurprisingly,
she uses word-processing software, NeoOffice in particular. Her grad
school profs use Microsoft Word, so what she sends them are ".doc" files.
One of the annoying things about so-called WYSIWYG word processors is that you can't
quite tell what's going on. The image at right, for example, shows somewhat uneven
spacing between lines. The gap between the first two lines is a little wider than
the gap between the second and third. And the 2nd line from the bottom is a little
farther away from its predecessor.
Where do these things come from? Well, if you jam a half-dozen ".doc" files together,
you might have a diversity of font faces and sizes. Or if you copy/paste some text from
one document (which started off with a larger font size for example) you might find font sizes
varying even within a paragraph.
Now a close look at the 2nd-to-last line reveals that the closing quotation mark looks
a little large. Indeed, if you position the cursor right there and watch the appropriate
toolbar, you might see the font size window change from '10' to '12' and back. This explains
why that line is a little lower than you'd otherwise expect. But what about the 2nd line?
The same careful trick with the cursor would show that one of the '.'s had a larger font
size.
So in a 112-page document, would you want to look carefully at the line spacings and
watch the font-size on every single character to see where the font size changed? Or
grab the entire document and change the font size to 10pt? That last trick might actually
work, but what if you have some characters in a different font—"Albany AMT" for example
instead of "Verdana"?
If you know the entire document shall be of one font, one size, one style (etc) then
that would work, but often there are words or sentences in italics, or a section in a
different font. So the brute-force method feels just a little risky.
So what's a techno-weenie to do with this? Since this particular techno-weenie
wrote this article
about manipulating ODF files with Python, the natural thing is to write a Python script.
I'll spare you the gruesome details but basically I did this:
- Use openoffice/Neooffice to convert a ".doc" file to ".odt"
- Use unzip to unpack the ".odt" file, and examine "content.xml" using emacs (or firefos)
- Write a Python script to examine and modify properties
- Play with it a bit, and save the modified version...
- Use openoffice/Neooffice to convert the ".odt" file back to ".doc"
So items #1 and #5 are just a matter of "Save as"; for #2 I said:
collin@p3:/mnt/home/collin/kstyle/tmp> unzip ../CreativeProjectFeb10.odt
Archive: ../CreativeProjectFeb10.odt
extracting: mimetype
creating: Configurations2/statusbar/
inflating: Configurations2/accelerator/current.xml
creating: Configurations2/floater/
creating: Configurations2/popupmenu/
creating: Configurations2/progressbar/
creating: Configurations2/menubar/
creating: Configurations2/toolbar/
creating: Configurations2/images/Bitmaps/
inflating: layout-cache
inflating: content.xml
inflating: styles.xml
extracting: meta.xml
inflating: Thumbnails/thumbnail.png
inflating: Thumbnails/thumbnail.pdf
inflating: settings.xml
inflating: META-INF/manifest.xml
collin@p3:/mnt/home/collin/kstyle/tmp>
Then I pointed firefox at /mnt/home/collin/kstyle/tmp/content.xml and observed that some
text styles specified different fonts, different sizes, etc.
collin@p3:~/kstyle> ./kstyles.py CreativeProjectFeb10.odt junk.odt -l | grep -v '<'
T1: 701 spans
T2: 77 spans
T3: 2 spans
T4: 10 spans
T5: 2 spans
T6: 4 spans
T7: 18 spans
T8: 8 spans
T9: 13 spans
T10: 1 spans
T11: 221 spans
T12: 1 spans
T13: 112 spans
T14: 2 spans
T15: 6 spans
T16: 1 spans
collin@p3:~/kstyle>
The script gives the characteristics of the styles, which I've filtered out above. Anyway,
here's a slightly less uncensored version, looking at style T15; I've folded the output
lines so you can see 'em all:
collin@p3:~/kstyle> ./kstyles.py CreativeProjectFeb10.odt junk.odt -l -dT15
T1: 701 spans
…
T15: 6 spans
<style:style style:family="text" style:name="T15"><style:text-properties
fo:font-size="12pt" fo:font-style="italic" style:font-name-asian="Albany AMT"
style:font-name-complex="Albany AMT" style:font-size-asian="12pt"
style:font-style-asian="italic" style:font-style-complex="italic"/></style:style>
T16: 1 spans
…
=== Text style T15:
.
.
.
.
.
.”
collin@p3:~/kstyle>
So the font size is too big here -- also it's the wrong font! And did you notice that
the big font was just a '.' in several cases? How could you ever find those?
After looking at a bunch
of them, I eventually decided I could collapse 3 and 5 to 2, and most of the rest to 1.
I ended up with this:
collin@p3:~/kstyle> ./kstyles.py CreativeProjectFeb10.odt d.odt T3=T2 T5=T2 T6=T1 \
T7=T1 T8=T1 T9=T1 T10=T1 T11=T1 T12=T1 T13=T1 T14=T1 T15=T2
Changing style "T3" to "T2".
Changing style "T5" to "T2".
Changing style "T6" to "T1".
Changing style "T7" to "T1".
Changing style "T8" to "T1".
Changing style "T9" to "T1".
Changing style "T10" to "T1".
Changing style "T11" to "T1".
Changing style "T12" to "T1".
Changing style "T13" to "T1".
Changing style "T14" to "T1".
Changing style "T15" to "T2".
collin@p3:~/kstyle>
The output file is called "d.odt" (due to typing laziness), and it looked fine.
Here's the script:
1 #!/usr/bin/python -utt
2 # vim:et:sw=4
3 '''Unzip an ODF document and list or kill/substitute text styles.
4
5 Usage: kstyles INPUT OUTPUT [-l] [-dsty1]... [sty1=sty2]...
6 -l
7 list styles
8
9 -dsty1
10 Show text spans having property sty1
11
12 sty1=sty2
13 Text elements which are sty1 are assigned sty2
14
15 $Id: kstyles.py,v 0.4 2012/02/12 01:25:52 collin Exp collin $
16 '''
17
18 import codecs
19 import os
20 import sys
21 import xml.dom.minidom
22 import zipfile
23
24 CONTENT = 'content.xml'
25 STYLE = 'style:style'
26 SFAMILY = 'style:family'
27 SNAME = 'style:name'
28 TSTYLENAME = 'text:style-name'
29 TSPAN = 'text:span'
30
31 def main(args):
32 '''Unpack ODF/XML (zipfile). Discover text styles.
33 Find # of text elements which have each style; if "-l", display.
34 If "-dXX", display text:spans with style XX
35 For each "sty1=sty2" provided:
36 change any text elements of sty1 to sty2'''
37 try:
38 infile_name = args[0]
39 outfile_name = args[1]
40 ops = args[2:]
41 except:
42 usage()
43 if not os.path.exists(infile_name):
44 print "Couldn't find input file %s" % INFILE
45 usage()
46 INFILE = zipfile.ZipFile(infile_name, 'r')
47 # Sanity-check input file before doing anything else.
48 if CONTENT not in INFILE.namelist():
49 print "Couldn't find %s in %s's zip archive" % (CONTENT, infile_name)
50 print 'Is it an ODF file?'
51 sys.exit(1)
52 # Read and parse content.
53 cdata = INFILE.read(CONTENT)
54 cdom = xml.dom.minidom.parseString(cdata)
The above checks parameters, unpacks the ".odt" file, and ensures that
CONTENT (viz., "content.xml"; see line 24) is there. Then it creates a
document object model (DOM) from what's in the content.
55 # Find text styles
56 cstyles = cdom.getElementsByTagName(STYLE)
57 text_styles = [X for X in cstyles
58 if X.getAttribute(SFAMILY) == 'text']
59 text_style_names = [X.getAttribute(SNAME) for X in text_styles]
60 # print text_styles
61 style_counts = dict()
62 for astyle in text_style_names:
63 style_counts[astyle] = 0
64 for aspan in [X for X in cdom.getElementsByTagName(TSPAN)
65 if X.hasAttribute(TSTYLENAME)]:
66 style_counts[aspan.getAttribute(TSTYLENAME)] += 1
The above looks for all the text-styles, and counts how many "text:span" items
refer to each text-style.
67
68 if '-l' in ops:
69 for idx in range(len(text_styles)):
70 astyle = text_style_names[idx]
71 print '%s: %d spans' % (astyle, style_counts[astyle])
72 if style_counts[astyle]:
73 print '\t%s' % text_styles[idx].toxml()
74 for astyle in [X for X in style_counts if X not in text_style_names]:
75 print '??? %s: %d spans' % (astyle, style_counts[astyle])
76 while '-l' in ops:
77 ops.remove('-l')
...and this part prints the information, if you want it
78
79 # Before the following fun stuff, make stdout be utf8
80 utf8_enc = codecs.getencoder('utf8')
I need line 80 to avoid encoding errors. The next part handles each
operation (or "command") -- "-dT15" for example.
81
82 for op in ops:
83 if op.startswith('-d'):
84 astyle = op[2:]
85 if astyle not in style_counts:
86 print "*** Couldn't find style %s" % astyle
87 continue
88 print "=== Text style %s:" % astyle
89 for aspan in [X for X in cdom.getElementsByTagName(TSPAN)
90 if X.getAttribute(TSTYLENAME) == astyle]:
91 print utf8_enc(aspan.firstChild.data)[0]
92 continue
So that was the "-d" part -- dump out the text spans referring to a
particular style.
93 styles = op.split('=')
94 if len(styles) > 2:
95 print >> sys.stderr, "Can't parse: '%s'" % op
96 usage()
97 if len(styles) < 2:
98 print >> sys.stderr, 'Not yet implemented: %s' % op
99 continue
100 print 'Changing style "%s" to "%s".' % (styles[0], styles[1])
101 for aspan in [X for X in cdom.getElementsByTagName(TSPAN)
102 if X.getAttribute(TSTYLENAME) == styles[0]]:
103 aspan.setAttribute(TSTYLENAME, styles[1])
Line 93 interprets "T15=T1" and assigns styles[0]="T15", styles[1]="T1". Then
lines 101-103 find all the text-spans matching "T15" and does the setAttribute
to change it to "T1". The last part puts the content back into
a new ODF file:
104
105 if os.path.exists(outfile_name):
106 os.unlink(outfile_name)
107 OUTFILE = zipfile.ZipFile(outfile_name, 'w')
108 for oldinfo in INFILE.infolist():
109 fname = oldinfo.filename
110 fsize = oldinfo.file_size
111 #print 'archive member "%s", %d bytes' % (fname, fsize)
112 if fsize > 0:
113 if fname == CONTENT:
114 OUTFILE.writestr(fname, utf8_enc(cdom.toxml())[0])
115 else:
116 OUTFILE.writestr(fname, INFILE.read(fname))
117 else:
118 OUTFILE.writestr(fname, '')
119 OUTFILE.close()
120
121
122 def usage():
123 print >> sys.stderr, __doc__
124 sys.exit(1)
125
126 if __name__ == '__main__':
127 main(sys.argv[1:])
Update: May 2014
I ran into this issue again with a collection of short stories. A single document
(a short story, to which a half-dozen other stories were appended via snarf'n'barf
with the mouse) had something like 35 or 36 text styles, which were largely unnecessary.
I had to update the above script (which by now I've renamed
kstyles.py, but
I don't remember what "k" meant) to account for a few things
You can see the resulting script at
http://cpwriter.net/kstyles.py-v0.6
Wait, did I say documentation?
Unzip an ODF document and list or kill/substitute text styles.
Usage: kstyles INPUT OUTPUT [-l] [-dsty1]... [sty1=sty2]...
INPUT
name of file to read
OUTPUT
name of file to write new (modified) file
-l
list styles
-dsty1
Show text spans having property sty1
sty1=sty2
Text elements which are sty1 are assigned sty2
$Id: kstyles.py,v 0.6 2014/05/10 22:19:13 collin Exp collin $
Functions |
| |
- main(args)
- Unpack ODF/XML (zipfile). Discover text styles.
Find # of text elements which have each style; if "-l", display.
If "-dXX", display text:spans with style XX
For each "sty1=sty2" provided:
change any text elements of sty1 to sty2
- usage()
|
Data |
| |
CONTENT = 'content.xml'
SFAMILY = 'style:family'
SNAME = 'style:name'
STYLE = 'style:style'
STYLES = 'styles.xml'
TSPAN = 'text:span'
TSTYLENAME = 'text:style-name' |