Looking Towards an API for SNAP:DRGN

During the first SNAP:DRGN workshop a breakout group was convened to discuss the potential API for the project. Rather than come up with a specific API during that session, we instead focused on creating a “wish list” of applications and functions that we wanted to support. We were then able to abstract the functions that would be needed to support the list.

The most vital functions support the querying of the dataset to retrieve the the original id of the entity and all the information about that entity. The reference and provider information can be extracted from the general data about an entity so creating the individual functions for specific data is unnecessary and testing will need to be done to see if these are helpful in reducing the load on the server since they are much more directed and thus require less overhead.

getChildIds(id) takes SNAP id, returns ids of entity/entities that the current entity has been derived from
getSnapIds(id) takes partner id, returns list of ids of entity/entities that are derived from the current entity
getLiveSnapIds(id) takes partner id, returns list of ids of non-depricated entity/entities that are derived from the current entity
getPersonInfo(id, ?content_type) takes id, returns full details of the entity with that id in the format specified (‘rdf’ or ‘json’). If no content_type is specified then the function will return RDF
getReferences(id) takes id, returns a list of all the references associated with the given entity
getProviders(id) takes id, returns the list of data provider(s)/project(s) associated with the given entity

 

These functions will support the main internal functions which will be used to populate the main body of the website:

getPersonPage(personId) takes id, returns html page with details of the entity with that id
getProjectPage(projectId) takes dataset/project identifier, returns html page with the information about that dataset/project

The remaining API functions that were identified have been divided by priority. Higher priority has been given to functions that seem likely to be most useful to users, will be used by other functions or which will support immediate issues such as disambiguation and basic filtering. Priority has also been given to functions that rely on information that is currently supported within the ingested data or which supports the promotion our partner projects.

High Priority:

getBaseIds(id) takes a SNAP id, returns earliest id of entity/entities that the current entity has been derived from i.e. it returns the external URIs that the entity was ultimately derived from no matter how long the chain of derivation
getDatasetInfo(dataset, content_type) takes dataset identifier, returns full details of the dataset as per data-info in the format specified (‘rdf’ or ‘json’)
getDates(id) takes id, returns the list of associated date value for the given entity
getNames(id) takes id, returns the list of name values for the given entity
getNames(id, lang) takes id, returns the list of name values that match the specified language code for the given entity
getPlaces(id) takes id, returns the list of associated place value(s) for the given entity
getEntitiesByRef(reference) takes reference, returns a list of all the entities that point to the given reference

 

Medium Priority:

getEntityNetwork(id) takes id, returns ids of any entities that are either derived from the given entity or that the given entity is derived from or are asserted to be the same entity
getRelatedEntities(id) takes id, returns list of ids of related entities and how they are related
getEntitiesByName(name) takes name value, returns list of entities with an associated name value that exactly matches the given value.
getEntitiesBySource(source) takes source value, returns list of entities referenced within that source document

 

Low Priority:

getRelation(id, relationship) takes id, returns list of entities that have the given relationship to the given entity
getRelationships(id1, id2) returns the direct relationship (if there is one) from id1 to id2
getEntitiesByPublisher(publisher) takes publisher identifier, returns list of entities associated with the given provider
getEntitiesByDate(date) takes date value, returns list of entities with an associated date value that matches or contains the given value
getEntitiesByPlace(place) takes place value, returns list of entities with an associated place value that matches or contains the given value
getCountByPublisher(publisher) takes publisher identifier, returns count of entities from that publisher
getCountByName(name) takes name, returns count of entities with that exact name
getCountByPlace(place) takes place, returns count of entities associated with that place
getCountByRef(reference) takes reference, returns count of entities associated with the reference
getCountBySource(source) takes source, returns count of entities associated with that source document
getTemporalRange(publisher) takes publisher identifier, returns temporal range of dataset as given in dataset description
getGeoRange(publisher) takes publisher identifier, returns geographical range of dataset as given in dataset description

 

Future work, probably beyond the scope of the immediate pilot project will expand the available functions, both in the public API and in the internal systems. In addition to those functions listed below, the existing functions that return lists of entities will be expanded to have optional parameters which will allow the returned result to be filtered by date range and (potentially) place.

Aspirational functions:

getRelationshipPath(id1, id2) returns the sequence of relationship links (if there is one) that joins id1 to id2
getRelatedNames(name) takes name, returns list of names that are specified as variants to the given name
getEntitiesByDate(tpq, taq) takes start and end date values, returns list of entities with an associated date value that matches or falls within the given period
getEntitiesByApproxName(name) takes name value, returns list of entities with an associated name value that matches or closely matches the given value
getEntitiesByAltName(name) takes name value, returns list of entities with an associated name value that is an alternate matches the given value
getAssertionAuthority(id) takes id, returns the given authority (may be original publisher) that identified the entity as an entity
isDepricated(id) takes id, returns true if entity has been marked as depricated
getCurrent(replaced-id) takes id of replaced entity and returns the list of ids of entities that have replaced it
getAgreementMetric(id) return a value for the level of agreement for the existence of a given entity as determined by the conflicting assertions
getCertainty(id, authority) return the certainty value for a given authority for a given assertion (id)

 

Beyond the basic querying functions, the highest priority will be to add functioned to support the making assertions about existing entities. These assertions will result in the creation of new entities and as such each assertion will need to have an identifier for the person making the claim attached to it. Optional values will allow the assert-er to specify their level of certainty and to give a reference to a resource which back up their statement. The primary assertions (that the entities specified in the first argument do or do not represent co-references) will be implemented initially but as the assertion system is finalised this list will be expanded and refined.

coRefAssertion(entity-list, authority, ?certainty, ?reason) make a co-reference assertion
notCoRefAssertion(entity-list, authority, ?certainty, ?reason) make an assertion that two entities are not co-references
deprecateEntity(old-id, replacement-id) annotates the entity identified by the old-id as deprecated and adds a pointer to the replacement entity as defined by replacement-id.

 

Internally, restricted functions intended to support the addition of new datasets and allow more automation of the data ingestion system will be added. This will allow us to streamline the ingestion workflow and more easily add data from partner projects:

addDataset(data-info-uri, data-uri) ingests rdf from source given in file defined by data-info, and rdf description of the dataset from file indicated by data-info-uri
addDatasetDescription(data-info-uri) adds description of the dataset to the triple store
addDatasets(data-info-uri, data-uri-list) ingests rdf from each source listed in the given file defined by data-info, and rdf description of the datasets from file indicated by data-info-uri

 

Finally, purely internal functions to query externally held linked data will allow us to connect with datasets that have not been directly ingested. These datasets may be outside the scope of SNAP, for example place rather than person data, or may represent prosopography datasets which have been published and hosted in a compatible format by other projects with just the minimum information (snap identifier and reference) held locally:

lookupEntity(uri) query external datasource for RDF relating to entity as specified by URI
lookupSource(uri) query external datasource for information on/text of the source as given URI
lookupPlace(uri) query external datasource, e.g. Pelagios, for RDF information on place as specified by URI

 

The “wish list” identified in the workshop is detailed below. The list in not presented in any particular order and was conceived as purely aspirational.

  • Lightweight (popup) widgets for embedding in external websites
  • Data to support visualisations
  • Find mappings (that is, given one identifier, find all coreferencing identifiers)
  • Correlate identifier to provider
  • Filter identifiers by a variety of criteria
    • provider
    • date (created / modified)
    • subjectOf assertions (depends on precise character of assertions)
    • objectOf assertions (depends on precise character of assertions)
  • Read information about entity
    • Retrieve information from providers
    • Retrieve information/text from source (where available online)
  • Search on name – get result or disambiguation page
    • Language independent as much as possible
    • Disambiguation
      • The Fuzzy Person (Entity and closely related entities)
      • The Fuzzy Name (name variations)
  • Autosuggest
    • match hinting
    • name variation (soundex, name variation…)
  • Assertion creation
    • must allow external systems to publish assertions
    • public identification needed to link assertion to who did it
  • Dataset ingestion
    • single dataset
    • bulk datasets
  • Trust/Authority analysis on assertions
    • certainty assertions
    • agreement assertions
  • Prosopography/Project Summaries
    • Size
    • Temporal span
    • Geographic span
    • EAC Suggested Archive Descriptions:
        • Reference Code (Required)
        • Name and Location of Repository (Required)
        • Title (Required)
        • Date (Required)
        • Extent (Required)
        • Name of Creator(s) (Required, If Known)
        • Administrative/Biographical History (Optimum)
        • Conditions Governing Access (Required)
        • Physical Access (Added Value)
        • Technical Access (Added Value)
        • Conditions Governing Reproduction and Use (Added Value)
        • Languages and Scripts of the Material (Required)
        • Custodial History (Added Value)
        • Immediate Source of Acquisition (Added Value)
        • Appraisal, Destruction, and Scheduling Information
        • (Added Value)
        • Accruals
        • Related Materials Elements
        • Existence and Location of Originals (Added Value)
        • Existence and Location of Copies (Added Value)
        • Related Archival Materials (Added Value)
        • Publication Note (Added Value)
        • Notes Element
        • Description Control Element
  • List by Source/Reference , Project
  • List by Project
  • Depreciation
  • Language translation
    • Latin transliteration if not provided
    • Greek from Betacode?
    • Other languages?
  • Re-ingestion
    • Partner projects mark entities as deprecated/changed
  • Output
    • Web query – web response
    • RDF dump
    • SPARQL Endpoint
    • Other formats by demand later

One thought on “Looking Towards an API for SNAP:DRGN”

Leave a Reply

Your email address will not be published. Required fields are marked *