Using CIDOC CRM for CLAROS
The Future of the Past: Using CIDOC CRM for CLAROS (Classical Arts Research Online Services)
David Shotton, University of Oxford (firstname.lastname@example.org)
The CLAROS data web (http://www.clarosnet.org/about/default.htm) integrates data concerning classical art objects located in museums worldwide that are curated by the following academic research centres: The Beazley Archive, Oxford, the Lexicon of Greek Personal Names, Oxford, the Lexicon Iconographicum Mythologiae Classicae (LIMC), Paris, and the Research Archive for Ancient Sculpture, Cologne. These distributed resources, which contain heterogeneous data, use non-compatible database systems and employ different data models, have been integrated into a single data web, following the data web vision that I first proposed at the CIDOC CRM workshop at Imperial College in March 2006 (http://cidoc.ics.forth.gr/london_workshop.htm).
Our first exemplar data web, OpenFlyData (http://openflydata.org), integrates distributed heterogeneous gene expression data about the fruit fly Drosophila melanogaster, and involved establishing a SPARQL endpoint for each resource, solving the co-reference problem caused by gene name synonyms using a FlyBase look-up service, and then harvesting RDF metadata relevant to a particular SPARQL query from each endpoint on the fly, integrating this information into a single Web browser window for use by the research scientist (see live applications, or poster at http://imageweb.zoo.ox.ac.uk/pub/2009/publications/Shotton_16-09-09_OpenFlyData_poster.ppt).
We have now directly repurposed these technologies to create the CLAROS data web, having chosen CIDOC CRM to provide the data model to which the individual database schemas of the various classical arts data providers have been harmonized. This has worked well, and the selected CLAROS metadata for some two million records and images held by our CLAROS Partners are now aligned to this international standard, and held as ~10 million RDF triples in a Jena TDB triple store.
In order to accommodate the particulars of our partners' datasets, we have extended CIDOC-CRM using some additional RDF vocabulary, particularly for time metadata relating to imprecise periods and eras, for example claros:not_before and claros:not_after applied to a crm:E61.Time_Primitive object, which allows us to capture partial quantitative information that is not expressed by a crm:has_PrimitiveTime property. We have undertaken these extensions minimally and in a principled manner, so as not to obscure the available CIDOC-CRM-encodable information. That is to say, all the information that can be expressed in CIDOC CRM is expressed in a way that an application that knows about CIDOC-CRM, but does not understand the added vocabulary, is able to access. For example, this means we have not replaced a general property with an extension property that more precisely captures the intended meaning, as that would cause the more general relationship information to be lost to a basic CIDOC-CRM processing application. In general terms, this approach follows that recommended in our 2005 paper Ontologies for Sharing, Ontologies for Use, such that the standard CIDOC CRM provides the common 'ontology for sharing', while CIDOC CRM plus our CLAROS extensions forms the local 'ontology for use'.
The CLAROS data web development is a work in progress, which I will demonstrate at the workshop. Further work is in hand to speed the performance of the back-end RDF triple store, which is at present slow for certain types of SPARQL query that require extensive sorting or that generate large query returns. Enhanced performance, anticipated to involve multiple LARQ indexes, will be achieved prior to the planned public launch of CLAROS in 2010.