HTMLElement class

This class can be used for parsing or for creating DOM manually.

DOM building

If you want to create DOM from HTMLElements, you can use one of theese four constructors:

HTMLElement()
HTMLElement("<tag>")
HTMLElement("<tag>", {"param": "value"})
HTMLElement("tag", {"param": "value"}, [HTMLElement("<tag1>"), ...])

Tag or parameter specification parts can be omitted:

HTMLElement("<root>", [HTMLElement("<tag1>"), ...])
HTMLElement(
   [HTMLElement("<tag1>"), ...]
)

Examples

Blank element

>>> from dhtmlparser import HTMLElement
>>> e = HTMLElement()
>>> e
<dhtmlparser.HTMLElement instance at 0x7fb2b39ca170>
>>> print e

>>>

Usually, it is better to use HTMLElement("").

Nonpair tag

>>> e = HTMLElement("<br>")
>>> e.isNonPairTag()
True
>>> e.isOpeningTag()
False
>>> print e
<br>

Notice, that closing tag wasn’t automatically created.

Pair tag

>>> e = HTMLElement("<tag>")
>>> e.isOpeningTag() # this doesn't check if tag actually is paired, just if it looks like opening tag
True
>>> e.isPairTag()    # this does check if element is actually paired
False
>>> e.endtag = HTMLElement("</tag>")
>>> e.isOpeningTag()
True
>>> e.isPairTag()
True
>>> print e
<tag></tag>

In short:

>>> e = HTMLElement("<tag>")
>>> e.endtag = HTMLElement("</tag>")

Or you can always use string parser:

>>> e = d.parseString("<tag></tag>")
>>> print e
<tag></tag>

But don’t forget, that elements returned from parseString() are encapsulated in blank “root” tag:

>>> e = d.parseString("<tag></tag>")
>>> e.getTagName()
''
>>> e.childs[0].tagToString()
'<tag>'
>>> e.childs[0].endtag.tagToString() # referenced thru .endtag property
>>> e.childs[1].tagToString() # manually selected entag from childs - don't use this
'</tag>'
'</tag>

Tags with parameters

Tag (with or without <>) can have as dictionary as second parameter.

>>> e = HTMLElement("tag", {"param":"value"})  # without <>, because normal text can't have parameters
>>> print e
<tag param="value">
>>> print e.params  # parameters are accessed thru .params property
{'param': 'value'}

Tags with content

You can create content manually:

>>> e = HTMLElement("<tag>")
>>> e.childs.append(HTMLElement("content"))
>>> e.endtag = HTMLElement("</tag>")
>>> print e
<tag>content</tag>

But there is also easier way:

>>> print HTMLElement("tag", [HTMLElement("content")])
<tag>content</tag>

or:

>>> print HTMLElement("tag", {"some": "parameter"}, [HTMLElement("content")])
<tag some="parameter">content</tag>

HTMLElement class API

HTMLElement class used in DOM representation.

dhtmlparser.htmlelement.NONPAIR_TAGS = ['br', 'hr', 'img', 'input', 'meta', 'spacer', 'frame', 'base']

List of non-pair tags. Set this to blank list, if you wish to parse XML.

class dhtmlparser.htmlelement.HTMLElement(tag='', second=None, third=None)[source]

This class is used to represent single linked DOM (see makeDoubleLinked() for double linked).

childs list

List of child nodes.

params dict

SpecialDict instance holding tag parameters.

endtag obj

Reference to the ending HTMLElement or None.

openertag obj

Reference to the openning HTMLElement or None.

find(tag_name, params=None, fn=None, case_sensitive=False)[source]

Same as findAll(), but without endtags.

You can always get them from endtag property.

findB(tag_name, params=None, fn=None, case_sensitive=False)[source]

Same as findAllB(), but without endtags.

You can always get them from endtag property.

findAll(tag_name, params=None, fn=None, case_sensitive=False)[source]

Search for elements by their parameters using Depth-first algorithm.

Parameters:
  • tag_name (str) – Name of the tag you are looking for. Set to “” if you wish to use only fn parameter.
  • params (dict, default None) – Parameters which have to be present in tag to be considered matching.
  • fn (function, default None) – Use this function to match tags. Function expects one parameter which is HTMLElement instance.
  • case_sensitive (bool, default False) – Use case sensitive search.
Returns:

list – List of HTMLElement instances matching your criteria.

findAllB(tag_name, params=None, fn=None, case_sensitive=False)[source]

Simple search engine using Breadth-first algorithm.

Parameters:
  • tag_name (str) – Name of the tag you are looking for. Set to “” if you wish to use only fn parameter.
  • params (dict, default None) – Parameters which have to be present in tag to be considered matching.
  • fn (function, default None) – Use this function to match tags. Function expects one parameter which is HTMLElement instance.
  • case_sensitive (bool, default False) – Use case sensitive search.
Returns:

list – List of HTMLElement instances matching your criteria.

wfind(tag_name, params=None, fn=None, case_sensitive=False)[source]

This methods works same as find(), but only in one level of the childs.

This allows to chain wfind() calls:

>>> dom = dhtmlparser.parseString('''
... <root>
...     <some>
...         <something>
...             <xe id="wanted xe" />
...         </something>
...         <something>
...             asd
...         </something>
...         <xe id="another xe" />
...     </some>
...     <some>
...         else
...         <xe id="yet another xe" />
...     </some>
... </root>
... ''')
>>> xe = dom.wfind("root").wfind("some").wfind("something").find("xe")
>>> xe
[<dhtmlparser.htmlelement.HTMLElement object at 0x8a979ac>]
>>> str(xe[0])
'<xe id="wanted xe" />'
Parameters:
  • tag_name (str) – Name of the tag you are looking for. Set to “” if you wish to use only fn parameter.
  • params (dict, default None) – Parameters which have to be present in tag to be considered matching.
  • fn (function, default None) – Use this function to match tags. Function expects one parameter which is HTMLElement instance.
  • case_sensitive (bool, default False) – Use case sensitive search.
Returns:

obj – Blank HTMLElement with all matches in childs property.

Note

Returned element also have set _container property to True.

match(*args, **kwargs)[source]

wfind() is nice function, but still kinda long to use, because you have to manually chain all calls together and in the end, you get HTMLElement instance container.

This function recursively calls wfind() for you and in the end, you get list of matching elements:

xe = dom.match("root", "some", "something", "xe")

is alternative to:

xe = dom.wfind("root").wfind("some").wfind("something").wfind("xe")

You can use all arguments used in wfind():

dom = dhtmlparser.parseString('''
    <root>
        <div id="1">
            <div id="5">
                <xe id="wanted xe" />
            </div>
            <div id="10">
                <xe id="another wanted xe" />
            </div>
            <xe id="another xe" />
        </div>
        <div id="2">
            <div id="20">
                <xe id="last wanted xe" />
            </div>
        </div>
    </root>
''')

xe = dom.match(
    "root",
    {"tag_name": "div", "params": {"id": "1"}},
    ["div", {"id": "5"}],
    "xe"
)

assert len(xe) == 1
assert xe[0].params["id"] == "wanted xe"
Parameters:
  • *args – List of wfind() parameters.
  • absolute (bool, default None) – If true, first element will be searched from the root of the DOM. If None, _container attribute will be used to decide value of this argument. If False, find() call will be run first to find first element, then wfind() will be used to progress to next arguments.
Returns:

list – List of matching elements (blank if no matchin element found).

isTag()[source]
Returns:bool – True if the element is considered to be HTML tag.
isEndTag()[source]
Returns:bool – True if the element is end tag (</endtag>).
isNonPairTag(isnonpair=None)[source]

True if element is listed in nonpair tag table (br for example) or if it ends with /> (<hr /> for example).

You can also change state from pair to nonpair if you use this as setter.

Parameters:isnonpair (bool, default None) – If set, internal nonpair state is changed.
Returns:book – True if tag is nonpair.
isPairTag()[source]
Returns:bool – True if this is pair tag - <body> .. </body> for example.
isOpeningTag()[source]

Detect whether this tag is opening or not.

Returns:bool – True if it is opening.
isEndTagTo(opener)[source]
Parameters:opener (obj) – HTMLElement instance.
Returns:bool – True, if this element is endtag to opener.
isComment()[source]
Returns:bool – True if this element is encapsulating HTML comment.
tagToString()[source]

Get HTML element representation of the tag, but only the gag, not the childs or endtag.

Returns:str – HTML representation.
toString(original=False)[source]

Returns almost original string (use original = True if you want exact copy).

If you want prettified string, try prettify().

Parameters:original (bool, default False) – If True, return parsed element, so if you changed something in params, there will be no traces of those changes.
Returns:str – Complete representation of the element with childs, endtag and so on.
getTagName()[source]
Returns:str – Tag name or while element in case of normal text (not isTag()).
getContent()[source]
Returns:str – Content of tag (everything between opener and endtag).
prettify(depth=0, separator=' ', last=True, pre=False, inline=False)[source]

Same as toString(), but returns prettified element with content.

Note

This method is partially broken, and can sometimes create unexpected results.

Returns:str – Prettified string.
containsParamSubset(params)[source]

Test whether this element contains at least all params, or more.

Parameters:params (dict/SpecialDict) – Subset of parameters.
Returns:bool – True if all params are contained in this element.
isAlmostEqual(tag_name, params=None, fn=None, case_sensitive=False)[source]

Compare element with given tag_name, params and/or by lambda function fn.

Lambda function is same as in find().

Parameters:
  • tag_name (str) – Compare just name of the element.
  • params (dict, default None) – Compare also parameters.
  • fn (function, default None) – Function which will be used for matching.
  • case_sensitive (default False) – Use case sensitive matching of the tag_name.
Returns:

bool – True if two elements are almost equal.

replaceWith(el)[source]

Replace value in this element with values from el.

This useful when you don’t want change all references to object.

Parameters:el (obj) – HTMLElement instance.
removeChild(child, end_tag_too=True)[source]

Remove subelement (child) specified by reference.

Note

This can’t be used for removing subelements by value! If you want to do such thing, try:

for e in dom.find("value"):
    dom.removeChild(e)
Parameters:
  • child (obj) – HTMLElement instance which will be removed from this element.
  • end_tag_too (bool, default True) – Remove also child endtag.

Table Of Contents

Previous topic

SpecialDict class

Next topic

Quoter submodule

This Page