Developer Interface

Extracting items

class ocd_backend.extractors.BaseExtractor(source_definition)

The base class that other extractors should inherit.

run()

Starts the extraction process.

This method must be implemented by the class that inherits the BaseExtractor and should return a generator that yields one item per value. Items should be formatted as tuples containing the following elements (in this order):

  • the content-type of the data retrieved from the source (e.g. application/json)
  • the data in it’s original format, as retrieved from the source (as a string)
class ocd_backend.extractors.HttpRequestMixin

A mixin that can be used by extractors that use HTTP as a method to fetch data from a remote source. A persistent requests.Session is used to take advantage of HTTP Keep-Alive.

http_session

Returns a requests.Session object. A new session is created if it doesn’t already exist.

class ocd_backend.extractors.oai.OaiExtractor(*args, **kwargs)
get_all_records()

Retrieves all available OAI records.

Returns:a generator that yields a tuple for each record, a tuple consists of the content-type and the content as a string.
oai_call(params={})

Makes a call to the OAI endpoint and returns the response as a string.

Parameters:params (dict) – a dictionary sent as arguments in the query string

Transforming items

class ocd_backend.transformers.BaseTransformer
run(*args, **kwargs)

Start transformation of a single item.

This method is called by the extractor and expects args to contain the content-type and the original item (as a string). Kwargs should contain the source_definition dict.

Parameters:
  • raw_item_content_type (string) – the content-type of the data retrieved from the source (e.g. application/json)
  • raw_item (string) – the data in it’s original format, as retrieved from the source (as a string)
  • source_definition (dict.) – The configuration of a single source in the form of a dictionary (as defined in the settings).
Returns:

the output of transform_item()

transform_item(raw_item_content_type, raw_item, item)

Transforms a single item.

The output of this method serves as input of a loader.

Parameters:
  • raw_item_content_type (string) – the content-type of the data retrieved from the source (e.g. application/json)
  • raw_item (string) – the data in it’s original format, as retrieved from the source (as a string)
  • item (dict) – the deserialized item
Returns:

a tuple containing the new object id, the item structured for the combined index (as a dict) and the item item structured for the source specific index.

class ocd_backend.items.BaseItem(source_definition, data_content_type, data, item, processing_started=None)

Represents a single extracted and transformed item.

Parameters:
  • source_definition (dict) – The configuration of a single source in the form of a dictionary (as defined in the settings).
  • data_content_type (str) – The content-type of the data retrieved from the source (e.g. application/json).
  • data (unicode) – The data in it’s original format, as retrieved from the source.
  • item – the deserialized item retrieved from the source.
  • processing_started (datetime or None) – The datetime we started processing this item. If None, the current datetime is used.
meta_fields

Allowed key-value pairs for the item’s meta

combined_index_fields

Allowed key-value pairs for the document inserted in the ‘combined index’

get_all_text()

Retrieves all textual content of the item as a concatenated string. This text is used in the combined index to allow retrieving content that is not included in one of the combined_index_fields fields.

This method should be implemented by the class that inherits from BaseItem.

Return type:unicode.
get_collection()

Retrieves the name of the collection the item belongs to.

This method should be implemented by the class that inherits from BaseItem.

Return type:unicode.
get_combined_index_data()

Returns a dictionary containing the data that is suitable to be indexed in a combined/normalized repository, together with items from other collections. Only keys defined in combined_index_fields are allowed.

This method should be implemented by the class that inherits from BaseItem.

Return type:dict
get_combined_index_doc()

Construct the document that should be inserted into the ‘combined index’.

Returns:a dict ready to be indexed.
Return type:dict
get_index_data()

Returns a dictionary containing index-specific data that you want to index, but does not belong in the combined index. Can contain whatever fields, and should be handled an validated (with care) in the class that inherits from BaseItem.

Return type:dict
get_index_doc()

Construct the document that should be inserted into the index belonging to the item’s source.

Returns:a dict ready for indexing.
Return type:dict
get_object_id()

Generates a new object ID which is used within OCD to identify the item.

By default, we use a hash containing the id of the source, the original object id of the item (get_original_object_id()) and the original urls (get_original_object_urls()).

Raises UnableToGenerateObjectId:
 when both the original object id and urls are missing.
Return type:str
get_original_object_id()

Retrieves the ID used by the source for identify this item.

This method should be implemented by the class that inherits from BaseItem.

Return type:unicode.
get_original_object_urls()

Retrieves the item’s original URLs at the source location. The keys of the returned dictionary should be named after the document format to which the value of the dictionary item, the URL, points (e.g. json, html or csv).

This method should be implemented by the class that inherits from BaseItem.

Return type:dict.
get_rights()

Retrieves the rights of the item as defined by the source. With ‘rights’ we mean information about copyright, licenses, instructions for reuse, etcetera. “Creative Commons Zero” is an example of a possible value of rights.

This method should be implemented by the class that inherits from BaseItem.

Return type:unicode.
class ocd_backend.items.StrictMappingDict(mapping)

A dictionary that can only contain a select number of predefined key-value pairs.

When setting an item, the key is first checked against mapping. If the key is not in the mapping, a KeyError is raised. If the value is not of the datetype that is specified in the mapping, a TypeError is raised.

Parameters:mapping (dict) – the mapping of allowed keys and value datatypes.

Enriching items

class ocd_backend.enrichers.BaseEnricher

The base class that enrichers should inherit.

enrich_item(enrichments, object_id, combined_index_doc, doc)

Enriches a single item.

This method should be implemented by the class that inherits from BaseEnricher. The method should modify and return the passed enrichments dictionary. The contents of the combined_index_doc and doc can be used to generate the enrichments.

Parameters:
  • enrichments (dict) – the dict that should be modified by the enrichment task. It is possible that this dictionary already contains enrichments from previous tasks.
  • object_id (str) – the identifier of the item that is being enriched.
  • combined_index_doc (dict) – the ‘combined index’ representation of the item.
  • doc (dict) – the collection specific index representation of the item.
Returns:

a modified enrichments dictionary.

run(*args, **kwargs)

Start enrichment of a single item.

This method is called by the transformer or by another enricher and expects args to contain a transformed (and possibly enriched) item. Kwargs should contain the source_definition dict.

Parameters:
  • item – The item tuple as returned by a transformer or by a previously runned enricher.
  • source_definition (dict.) – The configuration of a single source in the form of a dictionary (as defined in the settings).
  • enricher_settings (dict.) – The settings for the requested enricher, as provided in the source definition.
Returns:

the output of enrich_item()

class ocd_backend.enrichers.media_enricher.MediaEnricher

An enricher that is responsible for enriching external media (images, audio, video, etc.) referenced in items (in the media_urls array).

Media items are fetched from the source and then passed on to a set of registered tasks that are responsible for the analysis.

enrich_item(enrichments, object_id, combined_index_doc, doc)

Enriches the media objects referenced in a single item.

First, a media item will be retrieved from the source, than the registered and configured tasks will run. In case fetching the item fails, enrichment of the media item will be skipped. In case a specific media enrichment task fails, only that task is skipped, which means that we move on to the next task.

fetch_media(url, partial_fetch=False)

Retrieves a given media object from a remote (HTTP) location and returns the content-type and a file-like object containing the media content.

The file-like object is a temporary file that - depending on the size - lives in memory or on disk. Once the file is closed, the contents are removed from storage.

Parameters:
  • url (str.) – the URL of the media asset.
  • partial_fetch (bool.) – determines if the the complete file should be fetched, or if only the first 2 MB should be retrieved. This feature is used to prevent complete retrieval of large a/v material.
Returns:

a tuple with the content-type, content-lenght and a file-like object containing the media content. The value of content-length will be None in case a partial fetch is requested and content-length is not returned by the remote server.

Loading items

class ocd_backend.loaders.BaseLoader

The base class that other loaders should inherit.

run(*args, **kwargs)

Start loading of a single item.

This method is called by the transformer and expects args to contain the output of the transformer as a tuple. Kwargs should contain the source_definition dict.

Parameters:
  • item
  • source_definition (dict.) – The configuration of a single source in the form of a dictionary (as defined in the settings).
Returns:

the output of transform_item()

class ocd_backend.loaders.ElasticsearchLoader

Indexes items into Elasticsearch.

Each item is added to two indexes: a ‘combined’ index that contains items from different sources, and an index that only contains items of the same source as the item.

Each URL found in media_urls is added as a document to the RESOLVER_URL_INDEX (if it doesn’t already exist).

Command Line Interface

The OpenCultuurData source code provides a Command Line Interface (CLI) for managing your instance. The CLI is largely self-documented (run ./manage.py [<COMMAND>] --help for further assistance).

Dumps

manage.create_dump()

Create a dump of an index. If you don’t provide an --index option, you will be prompted with a list of available index names. Dumps are stored as a gzipped txt file in settings.DUMPS_DIR/<index_name>/< timestamp>_<index-name>.gz, and a symlink <index-name>_latest.gz is created, pointing to the last created dump.

Parameters:
  • ctx – Click context, so we can issue other management commands
  • index – name of the index you want to create a dump for
manage.download_dumps()

Download dumps of OCD collections to your machine, for easy ingestion.

Parameters:
  • api_url – URL to API instance to fetch dumps from. Defaults to ocd_frontend.settings.API_URL, which is set to the API instance hosted by OpenCultuurData itself.
  • destination – path to local directory where dumps should be stored. Defaults to ocd_frontend.settings.LOCAL_DUMPS_DIR.
  • collections – Names of collections to fetch dumps for. Optional; you will be prompted to select collections when not provided.
  • all_collections – If this flag is set, download all available dumps. Optional; you will be prompted to select collections when not provided.
manage.list_dumps()

List available dumps of API instance at api_address. Use this option to obtain information about dumps available at other OpenCultuurData API instances.

Parameters:api_url – URL of API location
manage.load_dump()

Restore an index from a dump file.

Parameters:
  • collection_dump – Path to a local gzipped dump to load.
  • collection_name – Name for the local index to restore the dump to. Optional; will be derived from the dump name, at your own risk. Note that the pipeline will add a “ocd_” prefix string to the collection name, to ensure the proper mapping and settings are applied.

Elasticsearch

manage.es_put_template()

Put a template into Elasticsearch. A template contains settings and mappings that should be applied to multiple indices. Check es_mappings/ocd_template.json for an example.

Parameters:template_file – Path to JSON file containing the template. Defaults to es_mappings/ocd_template.json.
manage.es_put_mapping()

Put a mapping for a specified index.

Parameters:
  • index_name – name of the index to PUT a mapping for.
  • mapping_file – path to JSON file containing the mapping.
manage.create_indexes()

Create all indexes for which a mapping- and settings file is available.

It is assumed that mappings in the specified directory follow the following naming convention: “ocd_mapping_{SOURCE_NAME}.json”. For example: “ocd_mapping_rijksmuseum.json”.

manage.delete_indexes()

Delete all Open Cultuur Data indices. If option --delete-template is provided, delete the index template too (index template contains default index configuration and mappings).

Parameters:delete-template – if provided, delete template too
manage.available_indices()

Shows a list of collections available at ELASTICSEARCH_HOST:ELASTICSEARCH_PORT.

Extract

manage.extract_list_sources()

Show a list of available sources (preconfigured pipelines).

Parameters:sources_config – Path to file containing pipeline definitions. Defaults to the value of settings.SOURCES_CONFIG_FILE
manage.extract_start()

Start extraction for a pipeline specified by source_id defined in --sources-config. --sources-config defaults to ``settings.SOURCES_CONFIG_FILE.

Parameters:
  • sources_config – Path to file containing pipeline definitions. Defaults to the value of settings.SOURCES_CONFIG_FILE
  • source_id – identifier used in --sources_config to describe pipeline

Frontend

manage.frontend_runserver()

Run development server on host:port.

Parameters:
  • host – host to run dev server on (defaults to 0.0.0.0)
  • port – defaults to 5000