LibGuides: Linguistics: Data Sets & Corpora

Best Bet

OLAC Language Resource Catalog
OLAC, the Open Language Archives Community, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. The catalog provides access to information about thousands of languages, including details of text collections, audio recordings, dictionaries, and software, sourced from dozens of digital and traditional archives.

Data Sources

The Archive of the Indigenous Languages of Latin America (AILLA)
Free access to original historical linguistic data from 400 indigenous languages. Registration is required to download files.
Child Language Data Exchange System (CHILDES)
CHILDES is the child language component of the TalkBank system. TalkBank is a system for sharing and studying conversational interactions.
corpus.byu.edu
Website with free partial access to different corpora including: News on the Web (NOW), Global Web-Based English (GloWbE), Wikipedia Corpus, Hansard Corpus (British Parliament), Early English Books Online, Corpus of Contemporary American English (COCA), Corpus of Historical American English (COHA), Corpus of US Supreme Court Opinions, TIME Magazine Corpus, Corpus of American Soap Operas, British National Corpus (BYU_BNC), Strathy Corpus (Canada), CORE Corpus, Corpus del Español, Corpus do Português, and Google Books corpora in American English, British English, and Spanish.
Corpus of Historical American English (COHA)
A large structured corpus of historical English, contains more than 400 million words of text from the 1810s-2000s. Reed has purchased and downloaded the full text of COHA, contact the Data Services Librarian to learn how to access the data.
DataCite Search
Federated search across interdisciplinary data repositories. DataCite gathers metadata for each DOI assigned to a research object.
Linguistic Data Consortium
The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and government research laboratories that creates and distributes a wide array of language resources including corpora.
PHOIBLE Online
PHOIBLE (Phonetics Information Base and Lexicon) Online is a repository of cross-linguistic phonological inventory data, which have been extracted from source documents and tertiary databases and compiled into a single searchable convenience sample.
Reed Linguistics Gender and Language Project
This study by Becker, Khan, and Zimman explores the relationship between gender identity and the use of creaky voice ("vocal fry"). The dataset contains audio files and tabular data.

Registry for Research Data Repositories (re3data.org)
re3data.org is a comprehensive registry of research data repositories from different academic disciplines including Biology, Chemistry, Economics, Linguistics, Physics, and Psychology.

TalkBank
Shared databases of recordings and coded transcripts within subfields studying communication, including aphasia, audiology, bilingualism, Child Language Data Exchange System (CHILDES), conversational analysis, dementia, phonological and phonetic analysis, second language acquisition, and traumatic brain injury.
UCLA Phonetics Lab Archive
Recordings of over 200 languages.
University of Oxford Text Archive
The Oxford Text Archive collects and preserves electronic literary and linguistic resources in more than 25 different languages.