The following is a listing of corpora (compiled by Alicia Wassink and Rachel Tatman) that may be of interest to phoneticians and sociolinguists. Some of these corpora are freely available, or held by the UW, others may require membership or payment.
- 1 Corpora to Know About
- 1.1 The Afranaph Project
- 1.2 The American National Corpus (ANC)
- 1.3 The Audio-Aligned and Parsed Corpus of Appalachian English (AAPCAppE)
- 1.4 The Automatic Tagging and Recognition of Stance (ATAROS) Project
- 1.5 Clitics in Romance Database
- 1.6 Corpus of Contemporary American English (COCA)
- 1.7 Diachronic Electronic Corpus of Tyneside English (DECTE)
- 1.8 Geneva Corpus of Early High German
- 1.9 The Interactive Atlas of Intonation
- 1.10 International Corpus of English (ICE)
- 1.11 UW Linguistics Data Consortium (LDC) Holdings
- 1.12 Library of Congress Voices from Slavery Archives
- 1.13 Michigan Corpus of Spoken Academic English (MICASE)
- 1.14 Middle English Dictionary Project
- 1.15 The Newcastle Electronic Corpus of Tyneside English (NECTE)
- 1.16 Origins of New Zealand English (ONZE) Project
- 1.17 PHOIBLE
- 1.18 The Sign Language AnalYses Database (SLAY)
- 1.19 The Spanish in Texas Corpus
- 1.20 The Syntactic Atlas of Spanish
- 1.21 Talk of the Town (TOON)
- 1.22 USC-TIMIT
- 1.23 University of Washington/Northwestern University Corpora
- 2 Soundfiles Online
- 3 Data and Annotations for Sociolinguistics
- 4 Morewords.com
- 5 Douglas Biber (Northern Arizona University)
Corpora to Know About
The Afranaph Project
- "The main goal of the Afranaph Project, as it is presently constituted, is to develop rich descriptions of a wide range of African languages in order to serve the interests of linguistic research into the nature and distribution of empirical patterns in natural language."
- Contains information on specific syntactic phenomena from linguistically-trained native speakers of African languages.
- Rich annotation system (in many cases more thorough than the traditional four-line gloss) and search functions.
The American National Corpus (ANC)
- online version available, OANC, freely available for download (OANC = Open ANC, 14 million words)
- OANC annotations: structural markup, sentence boundaries, words, noun chunks, verb chunks
- OANC written corpora: Charlotte, Switchboard
- OANC written corpora: 911 Report, Berlitz, Biomed, Eggan, ICIC, OUP, PLOS, Slate, Verbatim, Web data (government)
- The ANC directors welcome contributions of written texts and transcribed speech produced in/after 1990
The Audio-Aligned and Parsed Corpus of Appalachian English (AAPCAppE)
- Tortora, C., Santorini, B., & Blanchette, F.
- "The Audio-Aligned and Parsed Corpus of Appalachian English is an in-progress project. The ultimate product will be a 1-million word corpus of Appalachian English, with two basic components: Transcripts which are time-aligned with the speech signal, and fully text-searchable and A part-of-speech tagged and parsed version of the transcripts"
- Includes data from numerous historical recordings, including a parse (with speech disfluencies removed).
- Due to the nature of the recordings, not suited for fine-grained phonetic analysis.
- Project ongoing, will be finished by the end of 2015.
The Automatic Tagging and Recognition of Stance (ATAROS) Project
- Levow, G-A, Wright, R, & Ostendorf, M.
- The ATAROS project aims to identify acoustic signals of stance-taking (opinions, evaluations, judgments, etc.) in order to inform the development of automatic stance recognition in natural speech. Because existing corpora generally have a low frequency of stance-taking in conversation, we have created an audio corpus of dyads completing collaborative tasks designed to elicit a high density of stance-taking at increasing levels of involvement. Funded by NSF grant IIS-1351034.
- For access, email firstname.lastname@example.org.
Clitics in Romance Database
- Repetti, L., & Ordonnez, F.
- "CRL is a searchable database that allows users to examine one of the rare patterns involving a class of pronouns (clitics) that are normally stressless, but which, in the languages under investigation, do exhibit or modify stress. This pattern is found in some minority Romance languages spoken in the South of France, the South and North of Italy, the Balearic Islands, and the islands of Corsica and Sardinia. This system is designed to be free to the public and open-ended. Anyone can use the database to perform queries."
- Includes close transcriptions as well as sound files and syntactic information.
- Many endangered languages included.
Corpus of Contemporary American English (COCA)
- 425 million words, from 1990-2011
- Searchable by section (including "spoken") and year
- Good for frequency info
- Alternative interface for COCA
Diachronic Electronic Corpus of Tyneside English (DECTE)
Geneva Corpus of Early High German
- Zimmerman, R.
- "The Geneva Corpus of early German (GeCeG) is a syntactically parsed corpus of medieval German. It will be released in 2015."
- Parsing includes some new modifications to the markup language used in other CorpusSearch corpora such as the York Corpus of Old English (YCOE) and the Penn-Parsed Corpus of Middle English (PPCME).
The Interactive Atlas of Intonation
- Prieto, P., Borràs-Comes, J., & Roseano, P.
- "The Interactive Atlas of Romance Intonation presents audio and video materials for the study of intonation of different Romance languages. Such materials are utterances representing different sentence-types, as well as conversations and interviews. These materials are accessible by means of interactive maps of Europe and the Americas. In addition to this, the Atlas offers a selection of resources available online about the intonation of Romance languages."
- Includes a modified TOBI annotation.
- Data collected using parallel elicitation strategies over many languages
International Corpus of English (ICE)
- One million words each of spoken and written English produced after 1989 in 24 different locations.
- Some demographic data
UW Linguistics Data Consortium (LDC) Holdings
- Housed on the Computational Linguistics Laboratory's corpus server, link
- License conditions (at least for some corpora, maybe all LDC corpora) may prohibit making copies on personal machines.
- Some include speech, some POS tagging, many are text-only, very few have demographic information (see "data type" field in individual records)
- Some LDC corpora are coded for some demographic categories, speech style, and type of files included (.wav).
- The sociolinguistics laboratory members have posted some materials used in current research here.
- Anything ever distributed by the LDC could in principle be installed on the Linguistics server
- LDC corpora of particular interest:
- 1996-7 English Broadcast News Transcripts (HUB4) (NPR archives)
- American National Corpus (see below)
- Brown Corpus, The (ICAME Collection of English Grammar Corpora, 2nd ed.)
- CALLFRIEND American English-Non-Southern Dialect
- CALLFRIEND American English-Southern Dialect
- CALLHOME American English Lexicon (PRONLEX)
- CALLHOME American English Transcripts
- CELEX2 (Lexical Databases)
- CSLU Voices (Imitation)
- Treebanks (various languages, POS tagged for Syntactic analysis)
- European Language Newspaper Text
- Europarl Parallel Corpus (v3) European Parliament Proceedings (Dutch, English, German, Danish, Swedish)
- HCRC Map Task Corpus (pair dyadic conversations, Scots English)
- HTIMIT, TIMIT (Texas Instrument/MIT Corpus) Controlled sentences, different microphone handsets
- London-Lund Corpus of Spoken English (ICAME Collection of English Grammar Corpora, 2nd ed.) (text only)
- NPS Internet Chatroom Conversations (text only, chatroom conversations)
- Switchboard-1, Cellular, NXT Switchboard Annotations (telephone conversational data)
- Santa Barbara Corpus of Spoken American English, Parts I-IV (transcribed and time-stamped)
- SLX Corpus of Classic Sociolinguistic Interviews, Talkbank project (transcribed speech)
- Speech in Noisy Environments (SPINE) Evaluation Transcripts
- The CMU Kids Corpus (read sentences)
- The New York Times Annotated Corpus (text)
- Webster's Unabridged Dictionary (1913 Edition) (lexicon)
Library of Congress Voices from Slavery Archives
- To get to soundfile list, Select Browse Collection By > Audio interviews...
Michigan Corpus of Spoken Academic English (MICASE)
- Part of the Michigan Corpus Linguistics Project (link)
- Coded for speech style, speaker demographics
Middle English Dictionary Project
- Multi-University Project housed at University of Michigan, spearheaded by Dr. Henk Aertsen
The Newcastle Electronic Corpus of Tyneside English (NECTE)
Origins of New Zealand English (ONZE) Project
- University of Canterbury, Department of Linguistics (Jennifer Hay, PI)
- Moran, S., McCloy, D., & Wright, R.
- PHOIBLE Online is a repository of cross-linguistic phonological inventory data, which have been extracted from source documents and tertiary databases and compiled into a single searchable convenience sample. The 2014 edition includes 2155 inventories that contain 2160 segment types found in 1672 distinct languages.
The Sign Language AnalYses Database (SLAY)
- Tatman, R.
- A meta-analytic sign-language database including information on languages, references (descriptive grammars, computational work and in a few cases primary sources) and grammatical structures.
- Currently contains information on the cross-linguistic distribution of parameters.
- Not yet permanently hosted. Please contact Rachael Tatman for a copy.
- Forking of new versions encouraged.
The Spanish in Texas Corpus
- Bullock, B., Toribio, A. J., & Serigos, J.
- "The goal of this project is to develop a pedagogically useful corpus of Spanish and bilingual Spanish-English speech samples culled from interviews and conversations among speakers of diverse personal profiles and regional origins throughout Texas."
- Includes video and audio data, along with time-aligned transcriptions and other information, such as the language of each word.
- The researchers are actively looking for additional data and ask for contributions of raw data. They may be contacted here.
The Syntactic Atlas of Spanish
- Gallego, A. J., Ordonnez, F. & Roca, F.
- "The Syntactic Atlas of Spanish (ASinEs) is a research project developed by the Centre de Lingüística Teòrica (UAB) that seeks to provide a tool to study the syntactic variation of the different Spanish dialects."
- Contains information on syntactic structures in both Europe and the Americas.
- Rich search interface available in Spanish.
Talk of the Town (TOON)
- An archive of local language and stories
- Daughter site of the NECTE
- Has a cool online quiz for laypeople about Geordie dialects
- Presented by Hagedorn, C.
- "USC-TIMIT is a database of speech production data under ongoing development, which currently includes real-time magnetic resonance imaging data from five male and five female speakers of American English, and electromagnetic articulography data from four of these speakers."
- Includes MRI and EMA data for speakers from a variety of dialectal areas using TIMIT elicitation sentences.
- MRI data can be viewed using a custom MAT-LAB plug in, EMA data has been head corrected but will require additional analysis not included in the database.
University of Washington/Northwestern University Corpora
- Version 1.0:
- McCloy, D. R., Souza, P. E., Wright, R. A., Haywood, J., Gehani, N., & Rudolph, S.
- 3600 audio files with time-aligned textgrids
- Files are readings of 180 sentences of the IEEE “Harvard” by 20 different talkers (5 males and 5 females from each of two dialect regions of American English: the Pacific Northwest and the Northern Cities).
- Version 2.0:
- Panfili, L. M., Haywood, J., McCloy, D. R., Souza, P. E., and Wright, R. A.
- 22,460 audio files with time-aligned textgrids
- Files are readings of the IEEE “Harvard” sentences by 33 different talkers from each of two dialect regions of American English: the Pacific Northwest (11 males, 9 females) and the Northern Cities (7 males, 6 females). Pacific Northwest speakers read the full set of 720 sentences, while Northern Cities speakers read a subset of 620 sentences
Here are some resources for finding soundfiles online (either to stream within your browser or download) that aren't necessarily part of a large corpus.
British Library Sound Archive
International Dialects of English Archive
- features many world-wide English dialects
- includes two commonly-used reading passages ("The Rainbow Passage" and "Comma Gets a Cure") for diagnosing dialect differences:
- Central America
- North America
- South America
- Special Collections
The Voice and Speech Source
- Eric Armstrong, York University, Canada
Data and Annotations for Sociolinguistics
- Hosted at U Penn
- "The Sociolinguistic Annotation project will investigate the well-documented process of t/d deletion in four large digital speech corpora: TIMIT, Switchboard-1, CallHome American English and Hub-4 English Broadcast News."
Some of us have found the website above to be useful when we have to come up with minimal pairs or sets in English. The list is not exhaustive - it's based upon the Enable2k North American word list, which is used in well-known word games. This list contains 173,528 words (English and American spellings). However, it's really helpful if your alternative is to sit with pencil and paper and think up lists of words satisfying certain criteria. To get started, go to the "examples" page for different types of search options.
One useful type of search of interest to some of us in this class might be one where we use our knowledge of orthographic representations to get words exemplifying particular conditioning environments. For example, if you want to come up with forms for a word list examining the prevelar merger in PNW English (BAG, BAKE, BEG), you can search their database of most common words ending with (eg) by specifying the following, which will generate [http:/www/morewords.com/most-common-ends-with/eg a list of all words ending with this string and their lexical frequencies].
You will find you can also search all English words by:
- word length
- combination of letters
- lexical frequency
Douglas Biber (Northern Arizona University)
- Scholar well-known for his work on variation in written texts.
- Presently active in corpus-based analyses of English, using several large corpora, including: TOEFL 2000 Spoken and Written Academic Language corpus