-
Notifications
You must be signed in to change notification settings - Fork 6
Description
If we want to expand the number of supported content models in the Datahub, we need to design a consistent content model across all API's.
Detailed description
Current situation
Version 1.0 of the Datahub is based on a simple content model:
- Data is stored as "records" in the database. A record is a container which stores a serialised representation of the data.
- Data is ingested through POST/PUT requests on the API. The API only expects XML strings. Upon ingest, the XML string is stored in a record. At ingest time, the XML string is transformed into a JSON string via Clark Notation. As such, transformation doesn't imply a mapping between data models.
- The REST API returns records as JSON documents which are Clark annotated XML documents. The API leverages HATEOAS for discovery of these JSON documents.
- The OAI-PMH endpoint returns the actual XML documents from the records.
- We only support one format per instance. It's not possible to ingest multiple formats (MODS, JSON-LD, EAD, MARCXML, etc.)
A foundational principle of the application is that it needs to stay agnostic about the content model of the data. That is, the application doesn't do transformations of the stored data. Such transformations are by definition opinionated and context-bound. Supporting such transformations would also add hard-to-manage complexity to the application.
Instead, the application is designed to act as an end-point that packs data and publishes data as an abstract interface, hiding the underlying layers in a larger network of distributed data repositories. Data processing should happen as information gets funnelled through ETL or ELT pipelines.
However, the current simple model has severe drawbacks:
- It's not possible to combine multiple content models in the application (i.e. EAD & MODS)
- Clark Notation is very hard to consume. It forces clients to do an extra conversion step: JSON to XML and then extracting data from XML.
- Clark Notation isn't a common model. There are only few parsers and libraries that support this model.
- JSON and XML are serialisation formats. Whereas XML allows for explicit description of the content model (XSD schema's), JSON doesn't do that by design. Presenting data simply as application/json or application/xml isn't enough to give clients a cue of the actual content model.
- Even if we did add support for multiple formats and models, we need to hash out how we are going to approach unambiguously identification of resources. Multiple knowledge domains and contexts (i.e. library v. collection v. archive plus various sub-collections, hybrids, etc.) may wield identifier schemes which result in overlaps (i.e. record "000.123" may refer to a monograph in the library collection and a painting in a museal collection.) What are our design principles governing identification of resources?
Context
Hashing this out as a set of governance principles / architectural design / content model is core to the entire application.
Clearly defining what is and is not possible is important because (a) it determines the flexibility of the application to cater to as many business cases as possible and (b) it allows potential users to assess whether the application is a good fit for their needs.
Finally, this is important because as solid foundation is crucial for defining a clear long-term roadmap in terms of adding new functionality and maintaining existing functionality.
Possible implementation
To be discussed in this issue.