Lisa Cliggett, Associate Professor at the University of Kentucky, and Oona Schmid, Director of Publishing at the American Anthropological Association
A data registry would provide a centralized finding guide. If developed, it would help researchers sift through the range of source materials that are currently dispersed across myriad archives, institutional repositories, and subject-level data banks. The registry would not contain data artifacts; it would point to the extant source materials, be they ethnographic photographs, physical specimens, linguistic recordings, archaeological data, field notes, LIDaR scans, biometric databases, or sound recordings. While the registry would support a uniform interface, it would respect subdisciplinary standards and, above all, encourage researchers to continue to negotiate on a case-by-case basis the tricky space between the AAA Principles of “do no harm” and “protect and preserve your records. A registry thus would support anthropologists in navigating individual solutions for their records; complement the existing investments being made in analog archives and digital repositories; yet also maximize discoverability and increase access to the range of anthropological source materials.
Imagine being able to locate a nearly comprehensive list of source records by linguistic-cultural group. In the current widely-distributed system, scholars need a priori knowledge of source records. A registry might help searchers identify lesser-known works; and might foment discovery across the subdisciplines. If a researcher seeks to build on prior research on Kiowa language, currently s/he might turn to the published record and then to the authors of these works: William Meadows and John P Harrington. A registry might well point to Harrington’s papers at the National Anthropological Archives and Meadows’ linguistic materials deposited at the American Philosophical Society’s Library. But the registry might also draw attention to Kiowa grammar texts and linguistic audio recordings (like those catalogued in the Online Language Archives Consortium); recordings of dances such as those within University of New Mexico’s digital repository, the Kiowa drawings at the National Anthropological Archives; archeological surveys of Kiowa lands deposited at tDAR; Jane Richardson Hanks’s field notes and correspondences held at the Newberry Library… and much more. A fully-realized registry would also connect the dispersed works of a single researcher, which are often separated by media for optimal preservation and storage, such as when the sound recordings go to a sound / music library and the paper records held by an ephemeral archive.
With funds from the NSF (grant number: BCS-1159109), Oona Schmid and Lisa Cliggett assembled some of the foremost collectors of digital and analog collections, including Stephen Abrams of the California Digital Library; George Alter of the Inter-University Consortium for Political Science Research; Aaron Bittel of the UCLA Ethnomusicology Archive; Sonia Barbosa of the Murray Archive; Chris Cieri of the Linguistic Data Consortium; Louise Corti of the UK Data Archive; Kathleen Creeley of the Tuzin Melanesian Archive ; Carol Ember of the Human Relations Area Files; Candace Greene of the National Anthropological Archives; Robert Hilliker of the Columbia University’s Academic Commons; Bert Lyons of the American Folklife Center; Frank McManamon of Digital Antiquity; and Chris Miller of Cross-Cultural Dance Resources. In addition two funding agencies sent representatives: Mark Mahoney of Wenner-Gren and Deborah Winslow of the NSF. These 17 individuals met in September 2012 and discussed the best means to collect the necessary metadata, defined the various data fields that would be critical to a data registry, and elaborated on best ways to take the project forward.
SECTION 1 (SATURDAY MORNING DISCUSSION):
Discussions on Saturday morning focused on the collated datamap, an excel sheet showing how the metadata from the participating archives synchronize. This discussion honed the list of fields that would be critical to include in a registry of anthropological data sets.
Conversation and Identification of Critical Fields:
The workshop attendees discussed which of these fields are essential for a data registry to support its aims of aiding in discovery of relevant anthropological materials. The overwhelming consensus of these experts was less is more. George Alter of ICPSR commenting that it is better to “err on the side of less structure” and Stephen Abrams suggesting the focus be “just get it [the information], because you can enrich it over time.” Sonia Barbosa of IQSS added that the Murray Archive currently has more than 100 fields, of which only one is mandatory: the title field. The participants of the workshop felt that users no longer are fluent in structured searching, further undermining reasons to create a highly structured registry.
Value of Controlled Vocabulary
There was a lot of conversation about the value versus the limitations of using a controlled vocabulary in different fields. Clearly a controlled vocabulary can help produce more consistent records and Frank McManamon pointed out that tDAR gets a lot of “junk” (misspellings for instance) in their fields where researchers can type in freeform text. But many participants felt that controlled vocabularies—particularly in the context of the registry—would limit the number of records that archives and repositories would be able to contribute.
Interestingly, the participants agreed that crowd-sourced records produce richer description, and there was a widespread sentiment that no controlled vocabulary can adequately cover the real circumstances and nuances of data sets. For instance, Lisa Cliggett pointed out that in the instance of Somali immigrants in Maine, it is virtually impossible to use a controlled geographic list or a country authority list to describe the data set. Most of the archives represented in the discussion offered examples of relying on free-form descriptions. QualiData using a geographic drop down and then an open “population” field. Similarly OLAC uses ISO abbreviations for languages but allows extension coding to address the limitations of the ISO language codes. tDAR reviews its subject matter field regularly and incorporates keywords into a dynamic controlled vocabulary. Chris Cieri basically summarized this discussion by concluding that if the registry used an authority list, he would encourage extendability of that system.
There were three places, however, where the workshop conversation underscored the value of a controlled vocabulary.
One place where there needs to be some oversight of the data records would be with the “primary investigator” field, the field that captures or describes creator and primary research collector of the data set. Bert Lyons recommended an internal authority system of names, and shared that the Folklore Archives Initiative started with OCLC’s Virtual International Authority File (VIAF) and then allowed users to submit additions and updates. Thus, users have to go first into the name table to verify that the individual is not already listed. Carol Ember pointed out that this field should include institutions who may sometimes play the role of author.
One concern of even a controlled list is the problem of disambiguation, ascertaining if Jeffrey Clark (the archaeologist at North Dakota) is the same as Jeffrey Clark (the archaeologists of the American Southwest). Oona Schmid suggested AAA might investigate whether its dissertation lists could help disambiguate anthropologists, although obviously people’s names do change. The group was split over the efficacy of open researcher collaboration and research ID (ORCID) identifiers, unique numbers that researchers have begun to use that facilitate disambiguation between authors. Concerns about ORCID hinged on how long it will be before posthumous researchers have ORCIDs and the incompleteness of the initiative.
However the controlled list is developed, a significant contribution of a registry would be to concatenate the products of individual authors. This would offer a huge user benefit to the registry. The current state of records is that the researcher’s corpus is often divided, such as when the sound files go to a separate archive than the papers and field notes of the same individual. In the case of Joel Halpern, he has donated his photographs to the Human Relations Area Files and donated his papers to the University of Massachusetts at Amherst. It came out of the discussion that he plans to give his quantitative data to ICPSR.
Candace Greene pointed out that authorship has become clouded because there’s a feeling that informant communities are co-authors. The key take-away from the workshop, however, emphasized that searching for data sets is very likely to continue to be done around the “primary investigator.” I take from this conversation that the registry would have a controlled vocabulary for various contributor roles. While “primary investigator” may be the only mandatory field, the registry would offer the ability to include other roles and provide definitions to help ensure parallel use. The commonly held definition at the table for “author/primary investigator” was the person who collected the data. Some instances of additional roles thrown out in the course of discussion included:
- · co-authors/co-creators (or collaborators) [individuals who worked with the primary investigator in the acquisition of the data set];
- · collectors [individuals who maintained and archived the data record for some period of time];
- · informants [subjects of images, recordings, interviews, and surveys];
- · donors [individuals who deposited the data set, in the event this is not the co-author or collector];
- · grantees [institution or person who funded the research inquiry];
- · producers/editors [individuals who shaped or coded the source materials]; and
- · distributors/publishers [individuals who play a role in the public availability of the data set]
Relationship of Data Sets
Participants felt that a data registry’s value would in part stem from its ability to codify relationships between respective data sets. For instance, a registry would create a linkage between longitudinal studies or a reanalysis or recoding of an initial data set or secondary transcriptions. Chris Cieri gave the example of datasets that are annotations or refactoring of older datasets, each of which are its own data record in OLAC . These relationships would need to be defined with a controlled vocabulary.
Level of Granularity
Furthermore the participating organizations at the workshop create records with varying levels of granularity. A registry would need to support the range of granularity, in which some archives would contribute collection level records and some data repositories and institutional repositories would provide record-level entries. A controlled vocabulary of different entry types, with definitions provided to the data hosts, would be critical for making the registry useful.
Important Information for Registry:
The group did not come to a consensus on the question of scope of the registry regarding published materials. Louise Corti pointed out that the publication can provide the documentation and the contextualize the data. Rob Hilliker articulated a case to include dissertations and theses. My feeling is that the unpublished data sets are less discoverable than electronic theses and dissertations (which for instance are collected by the granting institution and indexed by Dissertation Abstracts) or published books and journals (which are widely circulated and archived through multiple sites.) Frank McManamon pointed out that in archaeology there is a lot of gray literature which is not being published in traditional (academic) outlets. George Alter felt much of this resolution would come from the definition of a mission statement.
Another key finding from the conversation was that the field which has led to the most discovery of data sets is the abstract. Abstracts offer tremendous flexibility and depth of rich description of the object. The abstract of the dataset must be full-text searchable for registry users to locate data sets of significance to them.
Everyone agreed that the registry would ideally link to the data object. If a repository or archive assigns a DOI, this would be the preferred handle. If a repository or archives uses a different fixed handled, participants felt this link—such as to a MARC record—would serve the need equally well. Participants at the table felt that if this link is well-handled and well-maintained, the registry would not need to include permissions and access and copyright guidance, as the host institutions fixed handle for the record would provide this information. Recommendation was made that registry run a link-checker once a week or more often.
The question arose what to do if the host doesn’t have a record. For instance, Kathy Creeley raised the example of the small local historical society that has a collection without a MARC record or fixed handle. Oona Schmid asked about the one-off websites that house anthropological data, inclusion of which would be a main contribution of a data registry. Louise Corti indicated that in the UK, the creation of Qualidata encouraged smaller societies to enter records, even very short records.
SECTION 2 (SATURDAY AFTERNOON DISCUSSION):
Discussion in the afternoon focused on defining the relationship between a registry and its source archives, the hosts of the actual data sets.
Data Sources for Registry:
Clearly any data registry would need to a controlled table of hosts, with up-to-date contact information about the various data hosts that provide information to the registry.
One of the conversations at the workshop related to the difference between the different commitments to data preservation—particularly the ongoing stability of digital data. Everyone agreed that the registry could and should indicate the differences between archives that have made a commitment to ongoing data migration versus a commitment to BIT-level preservation; and the differences between archives that have a succession plan and those that do not. George Alter felt that the registry could play a crucial edifying role by including these details and educating anthropologists about these differences in preservation. In addition the group felt that this awareness might be essential as more journals request data, but aren’t making any promises regarding the ongoing preservation and access to this data. Robert Hilliker gave the example of Neuroscience, a journal that encourages supplemental materials and video files, but offers no provision for the archiving of those supporting materials.
In addition, on Sunday morning, the discussion also pointed to the need to educate depositors about the differences in data control and the management/protection of informants. Just as a registry could be coded by differing archival promises, the varying policies regarding access and registration of secondary uses could be also defined for users of the registry. This might also facilitate education and awareness.
Data Feeds or Data Pulls
In terms of the connection the registry to the data hosts, participants felt that different data set owners would feel different about the advantages and disadvantages of the push or pull options and encouraged the registry to facilitate either. Essentially, the consensus was that each archive or data bank would prefer to negotiate on a case by case basis whether they would allow a pull or prefer to manage a push process. Many comments stressed the need to ease the human burden. Candace Greene pointed out that in the case of archives, requiring depositors to provide metadata is impossible; the National Anthropological Archives struggles to manage processes in which the frequency period is infrequent, forcing them to essentially retrain personnel or the original personnel has left. Robert Hilliker pointed out that the Directory of Open Access Repositories includes the status of the repositories participation in the Open Archives Initiative, a standard protocol for sharing (pulling) metadata. Most participants felt that in either a pull or a push case, hosts should be allowed to mark off different records (the records the content owners presume to be a good fit) and then the registry personnel would override and reject any specific records that were not of interest.
As noted above, there was a recommendation that a link check run weekly to test the URLs and prevent link rot. Rob Hilliker suggested the site be reindexed weekly as well.
Potential Participating Data Hosts
One unresolved question was whether a data registry should be international in scope. tDAR and ESRC QualiData report a relationship to funding agencies and, because of this, refer a depositor to the appropriate funding partner. For instance, research funded by the UK would be eligible for deposit in Archaeological Data Service and the ESRC QualiData; similarly research not funded in the UK is not eligible for these data banks.
Another unresolved question centered on the question of whether data depositing trends are more likely to extend along subject-specific repositories (like tDAR and OLAC) or if researchers are more likely to deposit at a local institution (their university for instance.) Because this trend is hard to predict, a data registry might need to create alliances with both communities.
Candace Greene and Deborah Winslow both expressed a desire that the registry draw on a four-field understanding, as allowing discovery and awareness of data across the subdisciplines is another potential contribution. Deborah Winslow cited the example of linguistic data being collating with genomic data and Candace Greene described the importance of breaking down the very high wall between the physical and cultural anthropologists.
SECTION 3 (SUNDAY MORNING DISCUSSION):
Discussion on Sunday morning, around the topic of “how to encourage scholars to deposit primary source materials into archives” centered around understanding some of the barriers to archiving that are specific to anthropology, and steps that various constituents could take to encourage and facilitate data archiving, with attention to specific next steps.
Discussion of Barriers:
The most frequently discussed barriers to archiving for qualitative researchers include: loss of control of data, management of confidentiality, IRB requirements (of extreme confidentiality and / or data destruction at the end of a project), concerns of being scooping by other researchers, notions of data ownership and copyright, fear of critique / embarrassment (ie: famous cases such as Derek Freeman’s attack of Margaret Mead, Malinowski’s diaries, general scrutiny), time investment with no reward (issues of academic valuing and reward / recognition) and tendency of technological aversion among some qualitative researchers (although it was pointed out that this is changing – current generation of students are “born digital” in their skills and practices).
Because the participants in this workshop represented the Archival community, discussion primarily centered on issues of data control and managing confidentiality. During this discussion it became clear that no standard protocol exists among archives in handling qualitative data. Instead, there is a continuum from highly controlled data access, to fully open access with little knowledge of the secondary data user. The discussion demonstrated the need for anthropologists/ qualitative researchers to actively anonymize and clean their data (remove sensitive material) before archiving it, but also to investigate the policies of a given archive about data access. For example, ICPSR, DataVerse, and Qualidata require (or allow depositors to set the terms for) some level of data use agreements for secondary users, which typically include evidence of research intentions (usually demonstrated by affiliation with a university). Other archives, such as the TAMA (Melanesian), UCLA Ethnomusicology Archive, California Digital Archive support access with minimal constraints in adherence with ideas around public institutions, need for access from non-academic groups - often the host (study) populations, etc.
Based on the discussion around the range of data access polices, qualitative researchers may be justified in their concern for loss of control of their data, and issues of confidentiality. However these concerns can be ameliorated with two specific steps: 1) thoroughly clean and anonymize data (with a code book for proper names, which might or might not be deposited with the data set) prior to deposit; 2) investigate which archive will best serve the anthropologist’s needs in terms of data use agreements, embargos, and depositor established controls, or more open access to facilitate data sharing among a broad population (including study/host communities). Additionally, with the growth of university based repositories, the qualitative research community must educate administrators about the needs for this particular type of data, ideally establishing policies for qualitative data that meets the needs of this scholarly community. Such education is much more likely to result in appropriate policies now, when the repository movement is starting and systems can be developed simultaneously with the repository development itself.
Supportive Steps to Encourage Archiving
During discussion of supportive steps to encourage archiving among anthropologists, the constituents mentioned included: funders, societies, archives, universities, IRBs, publishers, anthropologists and students. We discussed aspects associated with all of these groups to some extent, but the majority of discussion centered on activities of funders, societies, and archives.
Funders and Associations
A range of activities that funders and associations (such as the AAA) can undertake to promote archiving include publication of (and supporting / offering courses on) “best practices” and guidelines related to archiving (and using secondary data). These documents or short courses could include the topics: steps in metadata capture and documentation, preparing data for archiving (at point of capture / data collection), proposal development for projects using secondary data, how to publish without revealing identities (whether using primary or secondary data); tips and “boiler plate” language for use with IRB / Human Subjects reviews that clarifies how data archiving does not undermine confidentiality and protection of research subjects.
These kinds of guides and courses would help publicize and raise awareness of the idea, and need, of archiving data and using secondary data to the broad anthropological community. Workshop participants also suggested that funding agencies could fund projects to use secondary data as one way of mainstreaming data sharing and reuse. Overall, there was strong support for the role of funders and associations in publicizing, promoting, funding and disseminating knowledge about archiving and archival practices across the broad population of anthropologists.
The archives represented at the workshop have a broad range of practices to attract depositors, although only archives specific to anthropology target that qualitative research community. In broad terms, archives reach out to scholarly communities by attending professional conferences – both to present to the research community and to informally meet with researchers, and authoring articles and chapters about aspects of archiving. Some archives have sufficient staff and resources to reach out to specific groups – such as graduate students – to offer small grants and/ or technical support (equipment loan, software packages, transcription services, tools for backup in the field, etc) in exchange for deposit of data upon return from fieldwork / data collection. Other archives publicize the value of archiving data as a path to data repatriation (to study / host communities) as one way to attract data depositors.
There was animated discussion around the importance of early adopters for promoting archiving of data. Specifically, Corti (from the UK Qualidata archive, arguably the earliest/ oldest qualitative digital archive) said that strong mandates do not work in encouraging scholars to archive data. The better strategy is to attract early adopters and celebrate them as “pioneers of social science research.” In Qualidata’s case, they had a group of depositors join together in a conference presentation, with a subsequent special issue of a leading journal (International Journal of Social Research Methods). From that base, other scholars became aware of archiving practices and also wanted to be seen as pioneers, thus joining the archiving community.
Other participants mentioned identifying well known scholars or data sets and seeking them out for deposit. By attracting well know studies, other scholars may follow along. The importance of networking – word of mouth among researchers and also representatives from archives meeting with individual researchers – was echoed by many workshop participants. One suggestion included archive staff reaching out to one or two faculty members from each department at the home institution, allowing faculty members to help spread the word among their colleagues on campus.
On Saturday, several conversations touched on the potential role of journals. On the one hand, there are journals that are including data in their publications as supporting (or supplemental) materials. These journals, such as Neuroscience, do not necessarily have an archiving provision for the data. Above under the discussion on “Data Sources for the Registry,” I include some discussion in relation to archiving and preservation of data sets. On the other hand, journals could play a key role in aiding in the citing of data sets. On Saturday, there was extensive conversation about the importance of researchers citing data sets. This is both to return acknowledgement to the value and contribution of data sets, but more importantly because citation indexes like Thomson Reuters (publisher of the journal citation report) just launched a data citation index, announced after the workshop ended on October 16th. Journal style guides could clearly support this with recommendations around the citation practices.
Prioritizing post-workshop activities
The final session on Sunday morning sought to summarize and prioritize the next steps in building the anthropological data registry (and the undergirding activity of promoting deposit of data among the anthropological community).
Overall, the general sense of the group was to 1) target strategies that will help generate a group of “early adopters” in archiving, and 2) produce a set of guidelines and best practices that can easily be made available through funding agencies and professional associations.
Some of the most well received ideas for attracting early adopters included: offering funding (small and large) to researchers wanting to deposit their data in an archive, to archives for attracting and digitizing high profile data sets, and for secondary users to work with archived data sets. Publicizing examples of what can happen to data that isn’t archived (the “stories of woe”) can also speak to scholars at a more personal level, pointing out that technological and natural disasters can happen at any moment, and are outside of an individual’s control.
A remaining challenge for promoting archiving is the demonstration of how archiving and secondary data use moves the discipline further, and advances new understanding in human life. Thus, demonstrating how secondary data use / longitudinal studies can produce significant findings is key to celebrating the success of data archiving. Publicizing examples of successful re-analysis / longitudinal research that builds on existing archived data sets is one way to demonstrate the value of archiving beyond simple preservation. Such publicizing could occur through professional association meetings, publication of a collection of articles based on secondary data use / longitudinal work, and other types of high profile public presentations.
Production of guides and best practices facilitates outreach to the anthropological community in an encouraging and supportive tone, rather than authoritarian mandates and regulation with no related support for the desired behavior. Framing the guides as work flow suggestions for different points in the career cycle – archiving data around dissertation time frames, at end of projects and near retirement – and guides for proposal/ project development tied to secondary data use, publication with appropriate controls for confidentiality and sensitivity to privacy were the most clearly articulated suggestions.
Workshop participants also agree that establishing the anthropological data registry itself and beginning to populate the registry with high profile examples of successful archiving would also serve to mainstream the idea of archiving and attract more depositors. Importantly, the registry should include both analog (hard copy) and digital archives in order to capture of the broadest possibly representation of archived data.
] see “Statement on Ethics: Principles of Professional Responsibility,” American Anthropological Association, approved by membership November 2012