Keeping Up With... Cultural Heritage Crowdsourcing

This edition of Keeping Up With… was written by Victoria Van Hyning.

Victoria Van Hyning is an Assistant Professor of Library Innovation at the University of Maryland, College Park, College of Information Studies (iSchool), email: vvh@umd.edu.

Introduction

Online cultural heritage crowdsourcing, also known as commons-based peer production, niche-sourcing, and co-creation, invites people, usually volunteers, to enhance the descriptions of Libraries Archives and Museum (LAM) items through tagging, transcription, and providing first-hand knowledge. Cultural heritage crowdsourcing has surged since 2015, when several tools and platforms became widely available, such as the free project builder on Zooniverse.org, the subscription transcription platform FromThePage, PyBossa, Scripto for Omeka, MicroPasts, Transkribus, and social media sites such as Flickr. Several institutions have built their own platforms and made the code available, such as the Library of Congress’ Concordia platform, underpinning the By the People project. Projects on these platforms span centuries, languages, continents, and subject matter, and have engaged millions of people around the world. Projects can be designed to limit engagement to a group of specific people, i.e. those with a particular language skill, or may be open to anyone with an internet connection, time, and interest. 

Studies of volunteer motivation since 2010 reveal strong links between altruism and engagement, and significant learning outcomes for some participants. These findings resonate with the experiences of many crowdsourcing community managers, of which I was one from 2018 to 2020, working on By the People. During the first year of the COVID-19 pandemic, many LAM crowdsourcing projects experienced significant increases in participation from the public, typically more than double previous levels. Projects were also used by many LAMs to engage staff in meaningful work during lock-downs and extended work-from-home periods.

As part of my early career project Crowdsourced Data: Accuracy, Accessibility, Authority (CDAAA), funded by the IMLS, my students and I have created a public Zotero library of crowdsourcing articles and other resources. Anyone is free to consult this library and extract or add sources to it. You’ll find ample literature with examples, advice, and resources for LAM practitioners to create their own projects.[1] This article will therefore not focus on how to create, launch, and sustain projects, but instead on less-frequently discussed considerations, affordances, and barriers to successful crowdsourcing projects, and a final thought about how crowdsourcing might intersect with emerging machine learning (ML) and artificial intelligence (AI) methods in the near to medium-term. 

Crowdsourcing Platforms as Reference Forums

Crowdsourcing discussion platforms provide excellent opportunities for outreach, troubleshooting, and engagement between LAMs and participants. But these platforms can also be understood as spaces for conducting open reference services, and gathering information about what users need to know to engage with your collections online.

History Hub is a good example of open reference in action. It was designed by the National Archives and Records Administration as an open reference forum that could be used by any federal LAM. Thus far, there are dedicated spaces for NARA’s Citizen Archivist and LOC’s By the People, where dedicated crowdsourcing community managers and reference staff from multiple divisions at NARA and LOC answer questions in a public setting, which can in turn be consulted by future users. Open reference covers basic FAQs, as well as more complex questions stemming from crowdsourcing volunteers' deep engagement with LAM materials. 

Though sophisticated, many of these open reference questions are not attached to in-progress research projects, for which researchers and LAM staff alike might prefer an in-person reference interview or other communication. Community managers for crowdsourcing projects have many roles, and open reference is one that should be calculated in regular reporting about reference services, as well as public engagement.

Data and Metadata

Crowdsourcing data and metadata may not fit neatly into the CMSs and descriptive workflows at your institution. Forethought and planning can help avoid wasted staff and volunteer effort. 

Many crowdsourcing projects are built in applications other than LAMs’ core discovery systems, and the resulting data needs to be brought back into a CMS or published in a data repository. Transcriptions, tags, notes, and other content can radically enhance discovery, but only if the content can be incorporated into appropriate meta/data fields and integrated into the search functionality of the discovery platform. Common technical barriers to data integration include character length limits for metadata fields, and notes fields that are not exposed to search, while human barriers can include concerns about integrating crowdsourced data into the authoritative record.[2]

I strongly recommend that institutions test their systems and explore affordances and concerns with staff before embarking on crowdsourcing projects. Document from the outset (if possible) who will take ownership of the data or metadata, where it will live, in what format(s), and how it will be incorporated into the authoritative record, if at all. Alternatives to the latter include making the data available as a bundle and depositing it in Dataverse, Zenodo, or an institutional repository.  

Accessibility

Crowdsourcing has significant potential to increase accessibility to collections, but only if the systems that hold the resulting data are accessible. 

Access to collections for people with one or more visual, auditory, sensory, mobility or cognitive disabilities is woefully subpar at most institutions. Crowdsourcing has been deployed as a method for increasing access to non-machine-readable information in digitized images of manuscripts, photos, AV and 3D objects, by creating transcriptions, tags, and descriptions. However, if LAM discovery systems cannot in fact hold these kinds of data or present them in ways that are discoverable to, for example, screen-reader technology, then this potential will not be realized.[3] There is significant opportunity for expanding access to collections, which should be a priority for our institutions.

AI Approaches

Emerging human-in-the loop, machine learning, and other artificial intelligence approaches, including ChatGPT, can be thoughtfully layered with crowdsourcing task flows to automate the boring stuff

Crowdsourcing tasks have long relied on volunteers to do tasks that machines can’t do well enough, i.e. transcribe,or identify and describe key features of images. As automated approaches have evolved, some teams have found ways of building in ML or AI protocols to narrow the pool of tasks that volunteers need to work on. In June 2023, Ben and Sara Brumfield, of FromThePage, demonstrated how ChatGPT can be used to create regularized transcriptions from diplomatic or semi-diplomatic transcriptions, and automate indexing, which have the power to improve accessibility for people who use screen readers, and search more broadly.[4]

Conclusion

With thoughtful planning and design, crowdsourcing, ML, and AI approaches can be harnessed to dramatically increase access to our collections. Instead of responding to this moment with fear for our jobs and future opportunities for our students, I suggest we meet this moment by asking “What can we do better and more comprehensively if we adopt these tools, that we currently struggle to do because we don’t have the resources or support?” In closing, I’ll echo Jeb Bartlett by asking “What’s next?

Notes

[1] For example, Ridge, Mia, Samantha Blickhan, Meghan Ferriter, Austin Mast, Ben Brumfield, Brendan Wilkins, Daria Cybulska, et al. 2021. The Collective Wisdom Handbook. Digital Scholarship at the British Library. London: British Library. https://britishlibrary.pubpub.org/.

[2] Crowe, Katherine, Katrina Fenlon, Hannah Frisch, Diana Marsh, and Victoria Van Hyning. 2021. “Inviting and Honoring User-Contributed Content.” In The Lighting the Way Handbook: Case Studies, Guidelines, and Emergent Futures for Archival Discovery and Delivery. online: Stanford. https://doi.org/10.25740/gg453cv6438. Van Hyning, Victoria, and Mason Jones, 2021. “Data’s Destinations: Three Case Studies in Crowdsourced Transcription Data Management and Dissemination.” Startwords, no. 2 (December). https://zenodo.org/record/5750691.

[3] Van Hyning, Victoria, 2022. “Crowdsourced Data: Accuracy, Accessibility, Authority.” RE-252344-OLS-22. http://www.imls.gov/grants/awarded/re-252344-ols-22.

[4] Brumfield, Ben, and Sara Brumfield. Prompt Writing and Interacting with ChatGPT for Librarians and Archivists. 2023. Video. YouTube. https://www.youtube.com/watch?v=xdZteUOCaG8.