Now that the SNAP project has started ingest finalized data from the initial core datasets, it is time to think about how to bring in material from the other partners. For some, this will be easy, as they already know to make available their data in RDF form on the open web and simply need to follow the guidelines in the Cookbook. For others quite a lot of work will be involved getting SNAP ready. This post describes some of the stages you may go through, and some of the problems that you may meet.
I have divided the work into six steps:
- Decide whether you have a set of names, a set of attestations, or a prosopography
- Identify your records
- Establish the identities online
- Data wrangling
- Transform the data
- Make the RDF available
The first step is the most critical – what does your data actually represent? If your starting point is a text, and you have extracted the personal names, you do not have prosopographical records yet, but rather a set of attestations of names in a text. This is in itself a good thing to do, but to get to the prosopography you have to decide how many actual individuals are referred in the text – 10 occurrences of the name Marcus Aurelius may refer to between 1 to 10 persons (or one cat!).
If what you have is a set of attestations, what you probably want to is contribute a slightly different form of data to the SNAP network, namely links from people to sources. This will mean deciding which person in SNAP your name (Aelius Florus natione Pann., for example) refers to, and then generate RDF which links the URL of the source text to the SNAP person.
The second task, data wrangling, can be quite laborious. It involves turning the list of names into a set of person records, and creating a single record for each person which (for example) lists all the different names they are attested under, and what the source of those attestations is. At this point you will also be assembling other information from your sources (place of birth, sex, profession etc). You will under some circumstances be comparing what you have with established authority lists of names or persons.
Now that we have a set of people, the third task, identifying your records. It seems obvious, but if all you are doing is analyzing the information in a spreadsheet, you may never have needed to.
The fourth minimum piece of work is to establish your identifiers online, ie available as URIs. If you have a person called Soranus, you have allocated number 1001 to him, and you are able to use the domain http://my.people.net/, you might decide that his public identifier is http://my.people.net/person/1001. At the least, make it so that this displays an HTML page about the person for anyone visiting that URL. The very simplest way to do this is to make an HTML file for each person, and arrange the web server so that the request for /person/1001 returns that web page (look at facilities for URL rewriting). This assumes, of course, that you have a web server you can use to put up files, with some reasonable prospect of that remaining in place for the coming decades or longer. Larger institutions may have a repository which can do this job for you, and even assign Digital Object Identifiers (DOIs) to your data.
If you can go further, and map your permanent urls to queries against a database (where, for example, http://my.people.net/person/1001 is turned behind the scenes into http://my.people.net/query.php?id=1001, and retrieves the data from a relational database), you will have a more powerful (but harder to sustain) resource. At this point you can consider having your web server do content negotiation, and return different formats of data in response to Accept headers (technical details at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html) so that humans read HTML and computers read RDF.
The fifth stage is transforming your data to the RDF format which SNAP wants, as explained in the cookbook. This may involve an XSLT transform of XML data (RDF can be represented in XML), or a Python script reading a database, or an output template in your publication system. Incidentally, if you’re wondering how to get data out of an Excel spreadsheet, the OxGarage can turn the sheet into XML for you.
If you’ve got this far, and created some RDF suitable for submitting to SNAP, that’s great. You may also want to move onto the sixth stage, making your RDF available on the web, that’s even better, so that SNAP can harvest it at intervals and keep up to date. This stage means looking at your own RDF triple store. Setting this sort of software up yourself isn’t so easy, but you may wish to read up on open source projects like OntoWiki, Sesame, and Fuseki. Maybe one of us will blog more about this in future.