Usage

Converting files using the command line interface

Using the pydocx command, you can specify the output format with the input and output files:

$ pydocx --html input.docx output.html

Converting files using the library directly

Choose the conversion class, and then pass in either the full path to an existing MS Word document on the filesystem or pass in a file-like object. The parsed content can then be accessed using the parsed attribute.

Examples:

from pydocx.parsers import Docx2Html

# Pass in a path to an existing file
parser = Docx2Html(path='file.docx')
print parser.parsed

# Pass in a file pointer
parser = Docx2Html(open('file.docx', 'rb'))
print parser.parsed

# Pass in a file-like object
from cStringIO import StringIO
buf = StringIO()
with open('file.docx') as f:
   buf.write(f.read())
parser = Docx2Html(buf)
print parser.parsed

Currently Supported HTML elements

  • tables
    • nested tables
    • rowspans
    • colspans
    • lists in tables
  • lists
    • list styles
    • nested lists
    • list of tables
    • list of pragraphs
  • justification
  • images
  • styles
    • bold
    • italics
    • underline
    • hyperlinks
  • headings

HTML Styles

The base parser Docx2Html relies on certain css class being set for certain behaviour to occur. Currently these include:

  • class pydocx-insert -> Turns the text green.
  • class pydocx-delete -> Turns the text red and draws a line through the text.
  • class pydocx-center -> Aligns the text to the center.
  • class pydocx-right -> Aligns the text to the right.
  • class pydocx-left -> Aligns the text to the left.
  • class pydocx-comment -> Turns the text blue.
  • class pydocx-underline -> Underlines the text.
  • class pydocx-caps -> Makes all text uppercase.
  • class pydocx-small-caps -> Makes all text uppercase, however truly lowercase letters will be small than their uppercase counterparts.
  • class pydocx-strike -> Strike a line through.
  • class pydocx-hidden -> Hide the text.
  • class pydocx-tab -> Represents a tab within the document.

Exceptions

There is only one custom exception (MalformedDocxException). It is raised if either the xml or zipfile libraries raise an exception.

Deviations from the ECMA-376 Specification

Missing val attribute in underline tag