Week 2 began with standardizing a term list for researchers to use when creating metadata and for users to query the database. While on its face, a seemingly simple task—in reality, a great example of the complexities presented by “packaging up” the humanities to generate data for use by digital tools. A significant aspect of this process coincided with a linked open data presentation I made for our weekly meeting. Linked open data requires packing humanities data into a machine-readable format, achievable only by converting information into standardized data. Database structuring and querying requires reducing complex information into a 1/0 binary. Creating digital representations of the complex lives and relationships of people in the past first requires untangling these complexities and imposing both the logic and rationale needed for computation.
Using methods implemented by digital archives to organize collections as data, PRINT datasets will be standardized according to established authority conventions and controlled vocabularies, or ontologies. Implementation of these authority structures not only enables querying the database, but also produces data sets to connect the project and partner institutions with larger digital scholarship networks. Imposing authority ontologies and controlled vocabulary on data pulled from the “Keywords” field—containing everything from Sender and Receiver Name and Place, people and locations mentioned within the body of the letter, and themes identified in its contents. Computation of these keywords into database relationships and eventually a visualization requires hand coding different elements to tell the computer how to read the data during a query.
At the moment, the different elements we are standardizing focus on several categories: document types (correspondence, manuscripts, ledgers, etc.), religions and organizations, as well as people and place names. Standardized controlled vocabularies become crucial to capturing the variations that exist when incorporating these elements into data sets. With plans for our documents to span several languages and with historic documents containing a variety of spelling conventions, controlled vocabularies need to be supplemented by URIs, or unique resource identifiers. Assigning alphanumeric codes to aspects of the letter’s contents that we wish to describe ensures that all instances are captured, despite variations in language and spelling.
To standardize our data, I am currently working through a spreadsheet containing an export of the Keywords from the Pemberton Papers EndNote database. This involves identifying individuals by connecting variations in names, spelling, and input structure—often determining if misspellings refer to different entities or the same one. While we have taken steps to assign URIs to people, we had not previously considered assigning them to place names. After working through the spreadsheet, I realized the significant variations that exist within place names, specifically British ones. Many cities share names, requiring further explanation in the form of county names. However, over the years and following a series of legislative acts, the county names wherein British cities are located have changed. For the project, we want to capture the world in which our correspondents were living. After a bit of research, I found the Gazetteer of British Place Names and the Historic Counties Standards, accessible as both a searchable map and a CSV data set. The Gazetteer of Place Names offers a URI that maintains the properties of the place being described and captures this information, regardless of the historic or modern county name associated with the city. Both the Gazetteer and the HCT are open with their data, making it accessible twofold. Not only is it published in a user-friendly way, but it is easily downloadable in a variety of ways for incorporation into a user’s own project.
To truly exist as linked open data, controlled vocabularies, authority ontologies, and digital schemas must be publicly documented and accessible by all. This aligns with PRINT’s commitment to do the same with documents received from partner institutions. As such, we plan to implement schemas and ontologies such as VIAF (Virtual International Authority Files), Wikidata, LCNAF (the Library of Congress’ Name Authority Files), and now the Gazetteer to identify people and place names and to describe their relationships to one another. This level of standardization not only makes our data sets more effective for internal use, but also ensures it has a higher level of connectivity to digital scholarship networks upon export.
Comments