Usage¶
Converting files using the command line interface¶
Using the pydocx
command,
you can specify the output format
with the input and output files:
$ pydocx --html input.docx output.html
Converting files using the library directly¶
Choose the conversion class, and then pass in either the full path to an existing MS Word document on the filesystem or pass in a file-like object. The parsed content can then be accessed using the parsed attribute.
Examples:
from pydocx.parsers import Docx2Html
# Pass in a path to an existing file
parser = Docx2Html(path='file.docx')
print parser.parsed
# Pass in a file pointer
parser = Docx2Html(open('file.docx', 'rb'))
print parser.parsed
# Pass in a file-like object
from cStringIO import StringIO
buf = StringIO()
with open('file.docx') as f:
buf.write(f.read())
parser = Docx2Html(buf)
print parser.parsed
Currently Supported HTML elements¶
- tables
- nested tables
- rowspans
- colspans
- lists in tables
- lists
- list styles
- nested lists
- list of tables
- list of pragraphs
- justification
- images
- styles
- bold
- italics
- underline
- hyperlinks
- headings
HTML Styles¶
The base parser Docx2Html
relies on certain css class being set for certain behaviour to occur.
Currently these include:
- class
pydocx-insert
-> Turns the text green. - class
pydocx-delete
-> Turns the text red and draws a line through the text. - class
pydocx-center
-> Aligns the text to the center. - class
pydocx-right
-> Aligns the text to the right. - class
pydocx-left
-> Aligns the text to the left. - class
pydocx-comment
-> Turns the text blue. - class
pydocx-underline
-> Underlines the text. - class
pydocx-caps
-> Makes all text uppercase. - class
pydocx-small-caps
-> Makes all text uppercase, however truly lowercase letters will be small than their uppercase counterparts. - class
pydocx-strike
-> Strike a line through. - class
pydocx-hidden
-> Hide the text. - class
pydocx-tab
-> Represents a tab within the document.
Exceptions¶
There is only one custom exception (MalformedDocxException
).
It is raised if either the xml
or zipfile
libraries raise an exception.
Deviations from the ECMA-376 Specification¶
Missing val attribute in underline tag¶
In the event that the
val
attribute is missing from au
(ST_Underline
type), we treat the underline as off, or none. See also http://msdn.microsoft.com/en-us/library/ff532016%28v=office.12%29.aspxIf the val attribute is not specified, Word defaults to the value defined in the style hierarchy and then to no underline.