The Corpus of Spoken Yiddish in Europe (CSYE) is an Open Access digital language archive sourced from interviews with Yiddish-speaking Holocaust survivors. The interviews come from the USC Shoah Foundation Visual History Archive (VHA), a repository of video-recorded testimonies from Jewish survivors of the Holocaust, as well as from survivors and witnesses of other genocides. Although Yiddish accounts for fewer than 700 testimonies in the VHA, these interviews are an extremely rich source of information on all aspects of the language: its regional dialects, grammatical structures, registers and styles, prosody, co-speech gestures, and other topics. The goal of the CSYE is to increase the accessibility of these recordings by producing high-quality transcripts, maps, and other digital resources for researchers, students, educators, and members of the public.
Testimonies in the VHA are conducted by trained volunteers, most often in survivors’ homes. The particular testimonies licensed for inclusion in the CSYE were conducted between 1995 and 2001. Interviews are broken up into thirty-minute “tapes,” which correspond to digitized video cassettes, and they usually follow survivors’ life experiences chronologically: before, during, and after the Holocaust. The first tape begins with an introduction and a spelling of the survivor’s name, and the final tape often ends with a verbal description of the survivor’s family photographs or a brief conversation with members of the survivor’s family. More information on interview methodology is available on the VHA website.
The transcripts in the CSYE are produced by a small team of expert transcribers and reviewers. We first use speaker diarization software to identify moments of continuous speech (vs. silence) and assign them to “tiers” depending on who is speaking (the survivor or the interviewer); these files are then manually transcribed in ELAN using the Latin alphabet. Our transcription guidelines include conventions to mark personal and place names, lexical borrowings and code-switches, and false starts and other speech disfluencies. After transcription and review, transcript files are respelled in the Hebrew-based Yiddish alphabet and exported to a number of useful text formats.
Transcripts can be viewed on the CSYE website on each survivor’s testimony page, both in a searchable table and as subtitles embedded in the video player. They can also be downloaded from our public GitHub repository for offline use and more advanced searching. Both the CSYE website and the public repository are updated in real time as transcript files are added or modified.
All materials in the CSYE are provided for free. Use of the corpus is governed by the USC Shoah Foundation Terms of Use as well as the CSYE Terms of Use, both of which can be viewed in our User Guide.
Additional background on the corpus, its development, and possible use cases are provided in Bleaman & Nove (forthcoming).
Dr. Isaac L. Bleaman, Assistant Professor of Linguistics, University of California, Berkeley
Dr. Chaya R. Nove, Postdoctoral Scholar in Linguistics, University of California, Berkeley
Ronald Sprouse, Information Systems Analyst, University of California, Berkeley
Yakov Blum
Eli Jany
Eliezer Niborski
Izzy Posen
Ben Sadock
If you would like to suggest a correction to an interview transcript or corpus metadata, please submit an “Issue” on our transcripts repository. Choose the template that best matches your issue and click “Get started” to fill out the submission form. You will need to sign in to GitHub (or create an account if you are new to the platform) before you can access the form.
For all other questions and comments about the CSYE, please send an email to info@yiddishcorpus.org.
We gratefully acknowledge the USC Shoah Foundation – The Institute for Visual History and Education for its support of this project. All interviews are from the archive of the USC Shoah Foundation – The Institute for Visual History and Education. For more information: http://sfi.usc.edu/
This project was made possible by grant support from the National Science Foundation (Grant No. BCS-2142797) and fellowship support from the University of California, Berkeley. Any opinions, findings, conclusions or recommendations expressed in the material on this website are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or the University of California, Berkeley.