Extending PyDocX

Customizing The Parser

DocxParser includes abstracts methods that each parser overwrites to satisfy its own needs. The abstract methods are as follows:

class DocxParser:

    @property
    def parsed(self):
        return self._parsed

    @property
    def escape(self, text):
        return text

    @abstractmethod
    def linebreak(self):
        return ''

    @abstractmethod
    def paragraph(self, text):
        return text

    @abstractmethod
    def heading(self, text, heading_level):
        return text

    @abstractmethod
    def insertion(self, text, author, date):
        return text

    @abstractmethod
    def hyperlink(self, text, href):
        return text

    @abstractmethod
    def image_handler(self, path):
        return path

    @abstractmethod
    def image(self, path, x, y):
        return self.image_handler(path)

    @abstractmethod
    def deletion(self, text, author, date):
        return text

    @abstractmethod
    def bold(self, text):
        return text

    @abstractmethod
    def italics(self, text):
        return text

    @abstractmethod
    def underline(self, text):
        return text

    @abstractmethod
    def superscript(self, text):
        return text

    @abstractmethod
    def subscript(self, text):
        return text

    @abstractmethod
    def tab(self):
        return True

    @abstractmethod
    def ordered_list(self, text):
        return text

    @abstractmethod
    def unordered_list(self, text):
        return text

    @abstractmethod
    def list_element(self, text):
        return text

    @abstractmethod
    def table(self, text):
        return text
    @abstractmethod
    def table_row(self, text):
        return text

    @abstractmethod
    def table_cell(self, text):
        return text

    @abstractmethod
    def page_break(self):
        return True

    @abstractmethod
    def indent(self, text, left='', right='', firstLine=''):
        return text

Docx2Html inherits DocxParser and implements basic HTML handling.

class Docx2Html(DocxParser):

    #  Escape '&', '<', and '>' so we render the HTML correctly
    def escape(self, text):
        return xml.sax.saxutils.quoteattr(text)[1:-1]

    # return a line break
    def linebreak(self, pre=None):
        return '<br />'

    # add paragraph tags
    def paragraph(self, text, pre=None):
        return '<p>' + text + '</p>'

However, let’s say you want to add a specific style to your HTML document. In order to do this, you want to make each paragraph a class of type my_implementation. Simply extend Docx2Html and add what you need.

class My_Implementation_of_Docx2Html(Docx2Html):

   def paragraph(self, text, pre = None):
       return <p class="my_implementation"> + text + '</p>'

OR, let’s say FOO is your new favorite markup language. Simply customize your own new parser, overwritting the abstract methods of DocxParser.

class Docx2Foo(DocxParser):

    # because linebreaks in are denoted by '!!!!!!!!!!!!' with the FOO markup langauge  :)
    def linebreak(self):
        return '!!!!!!!!!!!!'

Custom Pre-Processor

When creating your own Parser (as described above) you can now add in your own custom Pre Processor. To do so you will need to set the pre_processor field on the custom parser.

class Docx2Foo(DocxParser):
    pre_processor_class = FooPreProcessor

The FooPreProcessor will need a few things to get you going:

class FooPreProcessor(PydocxPreProcessor):
    def perform_pre_processing(self, root, *args, **kwargs):
        super(FooPreProcessor, self).perform_pre_processing(root, *args, **kwargs)
        self._set_foo(root)

    def _set_foo(self, root):
        pass

If you want _set_foo to be called you must add it to perform_pre_processing which is called in the base parser for pydocx.

Everything done during pre-processing is executed prior to parse being called for the first time.

Optional Arguments

You can pass in convert_root_level_upper_roman=True to the parser and it will convert all root level upper roman lists to headings instead.