![]() |
![]() |
Cheshire3 Objects: DocumentFactory |
DocumentFactories are the main means by which Documents are ingested into the system. Once the 'load' argument has been called, a DocumentFactory should be able to return, on request, one or more Documents. The way in which it does this will depend on the way in which it has been configured, and how 'load' was called. For example it may locate all documents, and cache them internally (e.g. for multiple XML documents within a single file), or it may crawl, locating and returning the documents one at a time (e.g. for many large files in a directory structure.)
The following implementations are pre-configured and ready to use.
They may be used out-of-the-box in configurations for Workflows, or in code by getting the object from a Server.
A 'Smart' DocumentFactory that will attempt to do the most sensible thing to generate documents, based on the arguments passed to its 'load' method.
A DocumentFactory that accumulates data across multiple 'load' calls to produce 1 or more Documents.
The main difference between this and the defaultDocumentFactory is that 'load' can be called repeatedly without losing Documents already discovered.
The DocumentFactory will try to guess the format based on the data argument passed to it, however if you know the format, you can tell the documentFactory by using the format keyword argument. e.g.
documentFacory.load(session, "/home/user/data/", format="dir")
documentFacory.load(session, "/home/user/data.zip", format="zip")
A DocumentFactory will use an appropriate DocumentStream to deal with each format. Part of the 'smart'ness of DocumentFactories is that the DocumentStreams can be recursively called. e.g. You could call 'load' on a directory which contained a number of zip files, each of which were made up of a number of XML files. The DocumentFactory would use a DirectoryDocumentStream, a ZipDocumentStream, a FileDocumentStream and a XmlDocumentStream in turn to find and return XML Documents.
At the present time, the following formats are supported by defaultDocumentFactory and defaultAccumulatingDocumentFactory.
Note Well: DocumentStreams are only intended for use by DocumentFactories, and are unlikely to behave correctly if called directly by users' scripts.
Short name | DocumentStream used | Description |
---|---|---|
xml | XmlDocumentStream | Given data, finds XML instances within it and treats each as a Document.
By default the documentFactory will use the first tag that it encounters as the basis of all future Documents,
but if you know the name of the tag to use, you can supply this with the tagName keyword argument. e.g.documentFactory.load(session, "/home/user/myFile.xml", format="xml", tagName="myTag")
|
marc | MarcDocumentStream | Given data containing MARC records, treats each MARC record as a Document (see also docs for MarcParser and MarcRecord.) |
dir | DirectoryDocumentStream | Given a directory name, walks though all files and sub-directories within it looking for Documents. |
tar | TarDocumentStream | Given the data which makes up a tar file, extract the files from it as Documents. |
zip | ZipDocumentStream | Given the data which makes up a zip file, extract the files from it as Documents. |
cluster | ClusterDocumentStream | Given the path to a raw cluster data file (as created by a ClusterExtractionDocumentFactory), merge and create documents. |
locate | LocateDocumentStream | Given a name or pattern, locates files whose names match. |
component | ComponentDocumentStream | Given a Record, finds component Documents using a configured Selector. |
termHash | TermHashDocumentStream | Given data consisting of a hash of terms, treat each term as a Document |
file | FileDocumentStream | Given the path to a file, open it, and read the contents. |
Module: cheshire3.documentFactory
Classes:
documentFactory.load(session, "/home/user/myFile.xml", format="xml", tagName="myTag")
DocumentFactory Methods:
Function | Parameters | Returns | Description |
---|---|---|---|
__init__ | session, config, parent | The constructor takes the config node for the object, and its parent (usually a database). | |
load | session, ?data, ?cache, ?format, ?tagName, ?codec | Load the data provide (or use the configured default if not provided). The way the data is loaded is dependent on the other parameters (or their configured defaults if absent):
| |
get_document | session, ?index | Document | Return the index'th document in the factory if index is provided, otherwise return the next document. |
register_stream | session, format, class | Class method to register the supplied class of DocumentStream with the document factory for the given format. This class will be used the next time 'load' is called with this format. |
DocumentStream Methods:
Function | Parameters | Returns | Description |
---|---|---|---|
__init__ | session, stream, format, ?tagName, ?codec, ?factory | The constructor takes the location of the data stream, the format. Optional arguments are the tagName to look for, the codec to use to read in data and the DocumentFactory that initialized the stream. | |
open_stream | streamLocation | data stream | Perform any operations needed before the data stream can be read (e.g. open files). |
fetch_document | index | data/Document | Return the index'th piece of data or Document. |
find_documents | session, cache | Find documents within the data stream. |
Sub-Package: web
Module: cheshire3.web.documentFactory
Classes:
Sub-Package: vdb
Module: cheshire3.vdb.documentFactory
Classes: