With encoding underway among the project teams, I again conducted a presentation to ensure understanding of and alignment with the protocols and standards. Encoders’ questions are addressed while referencing the relevant section of the protocol documents to aid navigation in the future. Observing interactions between user and project pieces also aids future refinements made to improve usability. Not only are these presentations helpful to address team member concerns, but they also highlight potential incompatibilities or issues with the database structure that were not evident prior. Questions about the nature of the encoding application often prompt discussions between the project PI and the database team concerning the functions certain data encoding protocols need to accommodate.
With a new round of issues addressed, I then returned to the Standards List from earlier in the semester. The Standards List will provide a data set to test both the database structure and the incorporation of standardized vocabularies for different project components. This week, I focused on the Place Names aspect of this data set, ensuring each term adhered to the established protocols. Variations of data input throughout rounds of metadata encoding produced similar terms throughout the Place Name data set. For a system to note potential matches within variations to ensure multiple instances all refer to the same singular entity, I cross-referenced potential variations by noting line numbers in a dedicated column. Once all cross-referenced, this standardization process will store different iterations of the term as variant spellings of a standardized entity name.
Partly due to retroactive application of data processing procedures, export of the data from EndNote to Excel led to many of the Place Names turning up in the Keywords list only. To sort these from a mixed list of Place Names, People Names among descriptions, subject terms, and Mentions, I applied a system of color coding: green for Place Names, and yellow for People Names, orange for potential variations in spelling, and red for future deletions. [Figure 1]
I used the “Sort by Color” Excel function to group the term types together. This filtered the Place Names to the top, allowing the collective selection of grouped green cells before copying and pasting them to the Place Names sheet. I then reversed the Sort by Color in the Keywords list to preserve line cross-referencing numbers. [Figure 2]
With the Place Names list now consolidated across the two sheets, the terms required further description to ensure machine readability. Further description involves assigning latitude and longitude coordinates for georeferencing, associating variant spellings, and separating out term attributes according to our defined encoding structure. Developing standards to properly describe and encode place names required extensive thought regarding the hierarchical structure of the place name attributes.
The database team developed Place Names protocols to contain five characteristics, encoded as separate subfields. Encoding as separate subfields within an Excel document ensures attributes of the Place Name data are properly described and machine-readable by defining them according to the hierarchy level they are describing. Figure 3 shows the Place Name table within the EndNote protocol document which describes the hierarchy to the encoding teams. Figure 4 shows how these levels translate to machine-readable encoding when creating this sample set. Terms from the EndNote export (stored in column A) follow a specific to general format with five elements: Venue, or an individual site; Description, or street; City; County/Territory; and Colony/Country. Separating these elements out into the individual columns assigns them a specific characteristic which the database will eventually read to display within the visualization elements of the project.
Comentários