Marc XML / Marc OAI parser

Module for parsing and high-level processing of MARC XML records.

About format and how the class work; Standard MARC record is made from three parts:

  • leader - binary something, you can probably ignore it
  • controlfileds - marc fields < 10
  • datafields - important information you actually want

Basic MARC XML scheme uses this structure:

<record xmlns=definition..>
    <leader>optional_binary_something</leader>
    <controlfield tag="001">data</controlfield>
    ...
    <controlfield tag="010">data</controlfield>
    <datafield tag="011" ind1=" " ind2=" ">
        <subfield code="scode">data</subfield>
        <subfield code="a">data</subfield>
        <subfield code="a">another data, but same code!</subfield>
        ...
        <subfield code"scode+">another data</subfield>
    </datafield>
    ...
    <datafield tag="999" ind1=" " ind2=" ">
    ...
    </datafield>
</record>

<leader> is optional and it is parsed into MARCXMLRecord.leader as string.

<controlfield>s are optional and parsed as dictionary into MARCXMLRecord.controlfields, and dictionary for data from example would look like this:

MARCXMLRecord.controlfields = {
    "001": "data",
    ...
    "010": "data"
}

<datafield>s are non-optional and are parsed into MARCXMLRecord.datafields, which is little bit more complicated dictionary. Complicated is mainly because tag parameter is not unique, so there can be more <datafield>s with same tag!

scode (subfield code) is always one character (ASCII lowercase), or number.

Example dict:

MARCXMLRecord.datafields = {
    "011": [{
        "ind1": " ",
        "ind2": " ",
        "scode": ["data"],
        "scode+": ["another data"]
    }],

    # real example
    "928": [{
        "ind1": "1",
        "ind2": " ",
        "a": ["Portál"]
    }],

    "910": [
        {
            "ind1": "1",
            "ind2": " ",
            "a": ["ABA001"]
        },
        {
            "ind1": "2",
            "ind2": " ",
            "a": ["BOA001"],
            "b": ["2-1235.975"]
        },
        {
            "ind1": "3",
            "ind2": " ",
            "a": ["OLA001"],
            "b": ["1-218.844"]
        }
    ]
}

As you can see in 910 record example, sometimes there are multiple records in a list!

Warning

NOTICE, THAT RECORDS ARE STORED IN ARRAY, NO MATTER IF IT IS JUST ONE RECORD, OR MULTIPLE RECORDS. SAME APPLY TO SUBFIELDS.

Example above corresponds with this piece of code from real world:

<datafield tag="910" ind1="1" ind2=" ">
<subfield code="a">ABA001</subfield>
</datafield>
<datafield tag="910" ind1="2" ind2=" ">
<subfield code="a">BOA001</subfield>
<subfield code="b">2-1235.975</subfield>
</datafield>
<datafield tag="910" ind1="3" ind2=" ">
<subfield code="a">OLA001</subfield>
<subfield code="b">1-218.844</subfield>
</datafield>

OAI

To prevent things to be too much simple, there is also another type of MARC XML document - OAI format.

OAI documents are little bit different, but almost same in structure.

leader is optional and is stored in MARCXMLRecord.controlfields["LDR"], but also in MARCXMLRecord.leader for backward compatibility.

<controlfield> is renamed to <fixfield> and its “tag” parameter to “label”.

<datafield> tag is not named datafield, but <varfield>, “tag” parameter is “id” and ind1/ind2 are named i1/i2, but works the same way.

<subfield>s parameter “code” is renamed to “label”.

Real world example:

<oai_marc>
<fixfield id="LDR">-----nam-a22------aa4500</fixfield>
<fixfield id="FMT">BK</fixfield>
<fixfield id="001">cpk19990652691</fixfield>
<fixfield id="003">CZ-PrNK</fixfield>
<fixfield id="005">20130513104801.0</fixfield>
<fixfield id="007">tu</fixfield>
<fixfield id="008">990330m19981999xr-af--d------000-1-cze--</fixfield>
<varfield id="015" i1=" " i2=" ">
<subfield label="a">cnb000652691</subfield>
</varfield>
<varfield id="020" i1=" " i2=" ">
<subfield label="a">80-7174-091-8 (sv. 1 : váz.) :</subfield>
<subfield label="c">Kč 182,00</subfield>
</varfield>
...
</oai_marc>

MARC documentation

Definition of MARC OAI, simplified MARC XML schema and full MARC XML specification of all elements (19492 lines of code) is freely accessible for anyone interested.

API

edeposit.amqp.aleph.marcxml.resorted(values)[source]

Sort values, but put numbers after alphabetically sorted words.

This function is here to make outputs diff-compatible with Aleph.

Example

>>> sorted(["b", "1", "a"])
['1', 'a', 'b']
>>> resorted(["b", "1", "a"])
['a', 'b', '1']
Parameters:values (iterable) – any iterable object/list/tuple/whatever.
Returns:list of sorted values, but with numbers after words
class edeposit.amqp.aleph.marcxml.Person[source]

Bases: edeposit.amqp.aleph.marcxml.Person

This class represents informations about persons as they are defined in MARC standards.

name str
second_name str
surname str
title str
class edeposit.amqp.aleph.marcxml.Corporation[source]

Bases: edeposit.amqp.aleph.marcxml.Corporation

Some informations about corporations (fields 110, 610, 710, 810).

name str

Name of the corporation.

place str

Location of the corporation/action.

date str

Date in unspecified format.

class edeposit.amqp.aleph.marcxml.MarcSubrecord(arg, ind1, ind2, other_subfields)[source]

Bases: str

This class is used to store data returned from MARCXMLRecord.getDataRecords().

It looks kinda like overshot, but when you are parsing the MARC XML, values from subrecords, you need to know the context in which the subrecord is put.

This context is provided by the i1/i2 values, but sometimes it is also useful to have access to the other subfields from this subrecord.

This class provides this access by getI1()/getI2() and getOtherSubfiedls() getters. As a bonus, it is also fully replaceable with string, in which case only the value of subrecord is preserved.

arg str

Value of subrecord.

ind1 char

Indicator one.

ind2 char

Indicator two.

other_subfields dict

Dictionary with other subfields from the same subrecord.

getI1()[source]
getI2()[source]
getOtherSubfiedls()[source]

Return reference to dictionary, from which the subrecord was given.

Note

This method is used to get backlink to other fields (reference to field in MARCXMLRecord.datafields). It is not clean, but it works.

class edeposit.amqp.aleph.marcxml.MARCXMLRecord(xml=None)[source]

Class for serialization/deserialization of MARC XML and MARC OAI documents.

This class parses everything between <root> elements. It checks, if there is root element, so please, give it full XML.

Internal format is described in module docstring. You can access internal data directly, or using few handy methods on two different levels of abstraction.

No abstraction at all

You can choose to access data directly and for this use, there is few important properties:

leader string

Leader of MARC XML document.

oai_marc bool

True/False, depending if doc is OAI doc or not

controlfields dict

Controlfields stored in dict.

datafields dict of arrays of dict of arrays of strings ^-^

Datafileds stored in nested dicts/arrays.

controlfields is simple and easy to use dictionary, where keys are field identificators (string, 3 chars, all chars digits). Value is always string.

datafields is little more complicated; it is dictionary made of arrays of dictionaries, which consists of arrays of strings and two special parameters.

It sounds horrible, but it is not that hard to understand:

.datafields = {
    "011": ["ind1": " ", "ind2": " "]  # array of 0 or more dicts
    "012": [
        {
            "a": ["a) subsection value"],
            "b": ["b) subsection value"],
            "ind1": " ",
            "ind2": " "
        },
        {
            "a": [
                "multiple values in a) subsections are possible!",
                "another value in a) subsection"
            ],
            "c": [
                "subsection identificator is always one character long"
            ],
            "ind1": " ",
            "ind2": " "
        }
    ]
}

Notice ind1/ind2 keywords, which are reserved indicators and used in few cases thru MARC standard.

Dict structure is not that hard to understand, but kinda long to access, so there is also higher-level abstraction access methods.

Lowlevel abstraction

To access data little bit easier, there are defined two methods to access and two methods to add data to internal dictionaries:

Getters are also simple to use:

getControlRecord() is just wrapper over controlfields and works same way as accessing .controlfields[controlfield].

.getDataRecords(datafield, subfield, throw_exceptions) return list of MarcSubrecord objects* with informations from section datafield subsection subfield.

If throw_exceptions parameter is set to False, method returns empty list instead of throwing KeyError.

*As I said, function returns list of MarcSubrecord objects. They are almost same thing as normal str (they are actually subclassed strings), but defines few important methods, which can make your life little bit easier:

getOtherSubfiedls() returns dictionary with other subsections from subfield requested by calling getDataRecords(). It works as backlink to object, from which you get the record.

Highlevel abstractions

There is also lot of highlevel getters:

addControlField(name, value)[source]

Add new control field value with under name into control field dictionary controlfields.

addDataField(name, i1, i2, subfields_dict)[source]

Add new datafield into datafields.

Parameters:
  • name (str) – name of datafield
  • i1 (char) – value of i1/ind1 parameter
  • i2 (char) – value of i2/ind2 parameter
  • subfields_dict (dict) – dictionary containing subfields

subfields_dict is expected to be in this format:

{
    "field_id": ["subfield data",],
    ...
    "z": ["X0456b"]
}

Warning

field_id can be only one character long!

Function takes care of OAI MARC.

getControlRecord(controlfield)[source]

Return record from given controlfield. Returned type: str.

getDataRecords(datafield, subfield, throw_exceptions=True)[source]

Return content of given subfield in datafield.

Parameters:
  • datafield (str) – Section name (for example “001”, “100”, “700”)
  • subfield (str) – Subfield name (for example “a”, “1”, etc..)
  • throw_exceptions (bool) – If True, KeyError is raised if method couldn’t found given datafield/subfield. If False, blank array [] is returned.
Returns:

list – of MarcSubrecord. MarcSubrecord is practically same thing as string, but has defined getI1() and getI2() properties. Believe me, you will need to be able to get this, because MARC XML depends on them from time to time (name of authors for example).

getName()[source]
Returns:str – Name of the book.
Raises:KeyError – when name is not specified.
getSubname(undefined='')[source]
Parameters:undefined (optional) – returned if sub-name record is not found.
Returns:str – Sub-name of the book or undefined if name is not defined.
getPrice(undefined='')[source]
Returns:str – Price of the book (with currency).
getPart(undefined='')[source]
Returns:str – Which part of the book series is this record.
getPartName(undefined='')[source]
Returns:str – Name of the part of the series.
getPublisher(undefined='')[source]
Returns:str – name of the publisher (“Grada” for example)
getPubDate(undefined='')[source]
Returns:str – date of publication (month and year usually)
getPubOrder(undefined='')[source]
Returns:str – information about order in which was the book published
getFormat(undefined='')[source]
Returns:str – dimensions of the book (‘23 cm‘ for example)
getPubPlace(undefined='')[source]
Returns:str – name of city/country where the book was published
getAuthors()[source]
Returns:list – authors represented as Person objects
getCorporations(roles=['dst'])[source]
Parameters:roles (list, optional) – specify which types of corporations you need. Set to ["any"] for any role, ["dst"] for distributors, etc.. See http://www.loc.gov/marc/relators/relaterm.html for details.
Returns:listCorporation objects specified by roles parameter.
getDistributors()[source]
Returns:list – distributors represented as Corporation object
getISBNs()[source]
Returns:list – array with ISBN strings
getBinding()[source]
Returns:list – array of strings with bindings (["brož."]) or blank list
getOriginals()[source]
Returns:list – of original names
getI(num)[source]

This method is used mainly internally, but it can be handy if you work with with raw MARC XML object and not using getters.

Returns:str – current name of i1/ind1 parameter based on oai_marc property.
toXML()[source]

Convert object back to XML string.

Returns:str – String which should be same as original input, if everything works as expected.

Table Of Contents

Previous topic

Aleph lowlevel API

Next topic

ISBN validation module

This Page