Saturday, February 11, 2012

Strange fonts and spacing in a ".doc" file

My wife, the lovely Carol, is a writer. Unsurprisingly, she uses word-processing software, NeoOffice in particular. Her grad school profs use Microsoft Word, so what she sends them are ".doc" files.

One of the annoying things about so-called WYSIWYG word processors is that you can't quite tell what's going on. The image at right, for example, shows somewhat uneven spacing between lines. The gap between the first two lines is a little wider than the gap between the second and third. And the 2nd line from the bottom is a little farther away from its predecessor.

Where do these things come from? Well, if you jam a half-dozen ".doc" files together, you might have a diversity of font faces and sizes. Or if you copy/paste some text from one document (which started off with a larger font size for example) you might find font sizes varying even within a paragraph.

Now a close look at the 2nd-to-last line reveals that the closing quotation mark looks a little large. Indeed, if you position the cursor right there and watch the appropriate toolbar, you might see the font size window change from '10' to '12' and back. This explains why that line is a little lower than you'd otherwise expect. But what about the 2nd line? The same careful trick with the cursor would show that one of the '.'s had a larger font size.

So in a 112-page document, would you want to look carefully at the line spacings and watch the font-size on every single character to see where the font size changed? Or grab the entire document and change the font size to 10pt? That last trick might actually work, but what if you have some characters in a different font—"Albany AMT" for example instead of "Verdana"?

If you know the entire document shall be of one font, one size, one style (etc) then that would work, but often there are words or sentences in italics, or a section in a different font. So the brute-force method feels just a little risky.

So what's a techno-weenie to do with this? Since this particular techno-weenie wrote this article about manipulating ODF files with Python, the natural thing is to write a Python script. I'll spare you the gruesome details but basically I did this:

  1. Use openoffice/Neooffice to convert a ".doc" file to ".odt"
  2. Use unzip to unpack the ".odt" file, and examine "content.xml" using emacs (or firefos)
  3. Write a Python script to examine and modify properties
  4. Play with it a bit, and save the modified version...
  5. Use openoffice/Neooffice to convert the ".odt" file back to ".doc"
So items #1 and #5 are just a matter of "Save as"; for #2 I said:
collin@p3:/mnt/home/collin/kstyle/tmp> unzip ../CreativeProjectFeb10.odt 
Archive:  ../CreativeProjectFeb10.odt
 extracting: mimetype                
   creating: Configurations2/statusbar/
  inflating: Configurations2/accelerator/current.xml  
   creating: Configurations2/floater/
   creating: Configurations2/popupmenu/
   creating: Configurations2/progressbar/
   creating: Configurations2/menubar/
   creating: Configurations2/toolbar/
   creating: Configurations2/images/Bitmaps/
  inflating: layout-cache            
  inflating: content.xml             
  inflating: styles.xml              
 extracting: meta.xml                
  inflating: Thumbnails/thumbnail.png  
  inflating: Thumbnails/thumbnail.pdf  
  inflating: settings.xml            
  inflating: META-INF/manifest.xml   
collin@p3:/mnt/home/collin/kstyle/tmp> 
Then I pointed firefox at /mnt/home/collin/kstyle/tmp/content.xml and observed that some text styles specified different fonts, different sizes, etc.
collin@p3:~/kstyle> ./kstyles.py CreativeProjectFeb10.odt junk.odt -l | grep -v '<'
T1: 701 spans
T2: 77 spans
T3: 2 spans
T4: 10 spans
T5: 2 spans
T6: 4 spans
T7: 18 spans
T8: 8 spans
T9: 13 spans
T10: 1 spans
T11: 221 spans
T12: 1 spans
T13: 112 spans
T14: 2 spans
T15: 6 spans
T16: 1 spans
collin@p3:~/kstyle> 
The script gives the characteristics of the styles, which I've filtered out above. Anyway, here's a slightly less uncensored version, looking at style T15; I've folded the output lines so you can see 'em all:
collin@p3:~/kstyle> ./kstyles.py CreativeProjectFeb10.odt junk.odt -l -dT15
T1: 701 spans
…
T15: 6 spans
        <style:style style:family="text" style:name="T15"><style:text-properties 
fo:font-size="12pt" fo:font-style="italic" style:font-name-asian="Albany AMT" 
style:font-name-complex="Albany AMT" style:font-size-asian="12pt" 
style:font-style-asian="italic" style:font-style-complex="italic"/></style:style>
T16: 1 spans
…
=== Text style T15:
. 
. 
. 
. 
. 
.”
collin@p3:~/kstyle> 
So the font size is too big here -- also it's the wrong font! And did you notice that the big font was just a '.' in several cases? How could you ever find those?

After looking at a bunch of them, I eventually decided I could collapse 3 and 5 to 2, and most of the rest to 1. I ended up with this:

collin@p3:~/kstyle> ./kstyles.py CreativeProjectFeb10.odt d.odt T3=T2  T5=T2  T6=T1 \
T7=T1 T8=T1 T9=T1 T10=T1 T11=T1 T12=T1 T13=T1  T14=T1 T15=T2
Changing style "T3" to "T2".
Changing style "T5" to "T2".
Changing style "T6" to "T1".
Changing style "T7" to "T1".
Changing style "T8" to "T1".
Changing style "T9" to "T1".
Changing style "T10" to "T1".
Changing style "T11" to "T1".
Changing style "T12" to "T1".
Changing style "T13" to "T1".
Changing style "T14" to "T1".
Changing style "T15" to "T2".
collin@p3:~/kstyle> 
The output file is called "d.odt" (due to typing laziness), and it looked fine. Here's the script:
    1   #!/usr/bin/python -utt
    2   # vim:et:sw=4
    3   '''Unzip an ODF document and list or kill/substitute text styles.
    4
    5   Usage: kstyles INPUT OUTPUT [-l] [-dsty1]... [sty1=sty2]...
    6       -l
    7           list styles
    8
    9       -dsty1
   10           Show text spans having property sty1
   11
   12       sty1=sty2
   13           Text elements which are sty1 are assigned sty2
   14
   15   $Id: kstyles.py,v 0.4 2012/02/12 01:25:52 collin Exp collin $
   16   '''
   17
   18   import codecs
   19   import os
   20   import sys
   21   import xml.dom.minidom
   22   import zipfile
   23
   24   CONTENT = 'content.xml'
   25   STYLE = 'style:style'
   26   SFAMILY = 'style:family'
   27   SNAME = 'style:name'
   28   TSTYLENAME = 'text:style-name'
   29   TSPAN = 'text:span'
   30
   31   def main(args):
   32       '''Unpack ODF/XML (zipfile).  Discover text styles.
   33       Find # of text elements which have each style; if "-l", display.
   34       If "-dXX", display text:spans with style XX
   35       For each "sty1=sty2" provided:
   36           change any text elements of sty1 to sty2'''
   37       try:
   38           infile_name = args[0]
   39           outfile_name = args[1]
   40           ops = args[2:]
   41       except:
   42           usage()
   43       if not os.path.exists(infile_name):
   44           print "Couldn't find input file %s" % INFILE
   45           usage()
   46       INFILE = zipfile.ZipFile(infile_name, 'r')
   47       # Sanity-check input file before doing anything else.
   48       if CONTENT not in INFILE.namelist():
   49           print "Couldn't find %s in %s's zip archive" % (CONTENT, infile_name)
   50           print 'Is it an ODF file?'
   51           sys.exit(1)
   52       # Read and parse content.
   53       cdata = INFILE.read(CONTENT)
   54       cdom = xml.dom.minidom.parseString(cdata)
The above checks parameters, unpacks the ".odt" file, and ensures that CONTENT (viz., "content.xml"; see line 24) is there. Then it creates a document object model (DOM) from what's in the content.
   55       # Find text styles
   56       cstyles = cdom.getElementsByTagName(STYLE)
   57       text_styles = [X for X in cstyles
   58                           if X.getAttribute(SFAMILY) == 'text']
   59       text_style_names = [X.getAttribute(SNAME) for X in text_styles]
   60       # print text_styles
   61       style_counts = dict()
   62       for astyle in text_style_names:
   63           style_counts[astyle] = 0
   64       for aspan in [X for X in cdom.getElementsByTagName(TSPAN)
   65                       if X.hasAttribute(TSTYLENAME)]:
   66           style_counts[aspan.getAttribute(TSTYLENAME)] += 1
The above looks for all the text-styles, and counts how many "text:span" items refer to each text-style.
   67
   68       if '-l' in ops:
   69           for idx in range(len(text_styles)):
   70               astyle = text_style_names[idx]
   71               print '%s: %d spans' % (astyle, style_counts[astyle])
   72               if style_counts[astyle]:
   73                   print '\t%s' % text_styles[idx].toxml()
   74           for astyle in [X for X in style_counts if X not in text_style_names]:
   75               print '??? %s: %d spans' % (astyle, style_counts[astyle])
   76           while '-l' in ops:
   77               ops.remove('-l')
...and this part prints the information, if you want it
   78
   79       # Before the following fun stuff, make stdout be utf8
   80       utf8_enc = codecs.getencoder('utf8')
I need line 80 to avoid encoding errors. The next part handles each operation (or "command") -- "-dT15" for example.
   81
   82       for op in ops:
   83           if op.startswith('-d'):
   84               astyle = op[2:]
   85               if astyle not in style_counts:
   86                   print "*** Couldn't find style %s" % astyle
   87                   continue
   88               print "=== Text style %s:" % astyle
   89               for aspan in [X for X in cdom.getElementsByTagName(TSPAN)
   90                               if X.getAttribute(TSTYLENAME) == astyle]:
   91                   print utf8_enc(aspan.firstChild.data)[0]
   92               continue
So that was the "-d" part -- dump out the text spans referring to a particular style.
   93           styles = op.split('=')
   94           if len(styles) > 2:
   95               print >> sys.stderr, "Can't parse: '%s'" % op
   96               usage()
   97           if len(styles) < 2:
   98               print >> sys.stderr, 'Not yet implemented: %s' % op
   99               continue
  100           print 'Changing style "%s" to "%s".' % (styles[0], styles[1])
  101           for aspan in [X for X in cdom.getElementsByTagName(TSPAN)
  102                           if X.getAttribute(TSTYLENAME) == styles[0]]:
  103               aspan.setAttribute(TSTYLENAME, styles[1])
Line 93 interprets "T15=T1" and assigns styles[0]="T15", styles[1]="T1". Then lines 101-103 find all the text-spans matching "T15" and does the setAttribute to change it to "T1". The last part puts the content back into a new ODF file:
  104
  105       if os.path.exists(outfile_name):
  106           os.unlink(outfile_name)
  107       OUTFILE = zipfile.ZipFile(outfile_name, 'w')
  108       for oldinfo in INFILE.infolist():
  109           fname = oldinfo.filename
  110           fsize = oldinfo.file_size
  111           #print 'archive member "%s", %d bytes' % (fname, fsize)
  112           if fsize > 0:
  113               if fname == CONTENT:
  114                   OUTFILE.writestr(fname, utf8_enc(cdom.toxml())[0])
  115               else:
  116                   OUTFILE.writestr(fname, INFILE.read(fname))
  117           else:
  118               OUTFILE.writestr(fname, '')
  119       OUTFILE.close()
  120
  121
  122   def usage():
  123       print >> sys.stderr, __doc__
  124       sys.exit(1)
  125
  126   if __name__ == '__main__':
  127       main(sys.argv[1:])

Update: May 2014

I ran into this issue again with a collection of short stories. A single document (a short story, to which a half-dozen other stories were appended via snarf'n'barf with the mouse) had something like 35 or 36 text styles, which were largely unnecessary. I had to update the above script (which by now I've renamed kstyles.py, but I don't remember what "k" meant) to account for a few things
  • Sometimes line 91 didn't work:
       91                   print utf8_enc(aspan.firstChild.data)[0]
    because firstChild wasn't Plain Old Text; it was this:<text:s/>
  • Sometimes a text style couldn't be found in content.xml; I had to look in styles.xml (no, really); who would have known?
  • While I was in there, I added some sanity checks and updated the documentation
You can see the resulting script at http://cpwriter.net/kstyles.py-v0.6 Wait, did I say documentation?
 
 
kstyles
index
/mnt/home/collin/projects/kstyle/kstyles.py

Unzip an ODF document and list or kill/substitute text styles.
 
Usage: kstyles INPUT OUTPUT [-l] [-dsty1]... [sty1=sty2]...
    INPUT
        name of file to read
 
    OUTPUT
        name of file to write new (modified) file
 
    -l
        list styles
 
    -dsty1
        Show text spans having property sty1
 
    sty1=sty2
        Text elements which are sty1 are assigned sty2
 
$Id: kstyles.py,v 0.6 2014/05/10 22:19:13 collin Exp collin $

 
Modules
       
codecs
os
sys
xml
zipfile

 
Functions
       
main(args)
Unpack ODF/XML (zipfile).  Discover text styles.
Find # of text elements which have each style; if "-l", display.
If "-dXX", display text:spans with style XX
For each "sty1=sty2" provided:
    change any text elements of sty1 to sty2
usage()

 
Data
        CONTENT = 'content.xml'
SFAMILY = 'style:family'
SNAME = 'style:name'
STYLE = 'style:style'
STYLES = 'styles.xml'
TSPAN = 'text:span'
TSTYLENAME = 'text:style-name'

No comments: