What's New
corpusVarious
Author(s):
Description:
The HELLO CAMPANIA! Ghana collection contains 12 sociolinguistic interviews collected with 4 first generation migrants and 8 second generation migrants living in Naples. It also contains 9 language portraits.
Academic Use
corpusVarious
Author(s):
Description:
The corpus consists of 48 audio files for a total of 20:38 of recordings (public) and their relative transcriptions in ELAN (upon request). This collection includes 15 language portraits. The collection is organized in four bundles: - 1G_audio: contains all the audio files collected with 1st generation migrants (30 files) - 1G_portrait: contains the language portraits collected 1st generation migrants (13 files) - 2G_audio: contains all the audio files collected with 2nd generation migrants (18 files) - 2G_portrait: contains the language portraits collected 2nd generation migrants (2 files)
Academic Use
corpusVarious
Author(s):
Description:
The Ukrainian collection contains data for 26 speakers of first generation (G1), 19 females and 6 males. The collection contains three folders for each group: the sociolinguistic interview and a language portrait.
Academic Use
Most Viewed Items - Last Month
corpusVarious
Author(s):
Description:
The GerSumCo (German Summary Corpus) is a learner corpus comprising syntheses written by L2 German writers (CEFR B2/C1) and writers of L1 German. The corpus has been created with the objective of conducting a comparative analysis of the academic writing of L1 German and L2 German students. The two subcorpora (L1 and L2) contain a total of 286 texts (178 L1 and 108 L2), written by 286 students at 14 universities and language schools in Germany (Bamberg, Bochum, Dresden, Hamburg, Hildesheim, Kiel, Leipzig, Magdeburg, Osnabrück, Potsdam, Trier, Wuppertal), Poland (Gdansk) and China (Hangzhou). The texts were collected between 2022 and 2024 as part of a PhD research project about a contrastive interlanguage analysis using GerSumCo and Beldeko to identify L1-dependent features in cohesion in L2/L1 German. The metadata files (Meta_GerSumCo_L1 & Meta_GerSumCo_L2) contain the following information: - Up to three L1s of the writers - Up to three L2s of the writers - Collection date - Topic - Whether the text was written as homework or in class - Group of students the texts belonged to The file names contain the following information: - Whether the text is part of the L1 or L2 subcorpus - Topic The summaries, on average, consist of 230 words. The texts were either produced in class on computers or as homework, within a 60-minute time frame. Students were permitted to use online dictionaries, but no AI-based auxiliary means. They were required to summarise two texts on one of four topics related to language variation in German: Kiezdeutsch, Mundartdebatte in der Schweiz, Viadrinisch and Varianten-Wörterbuch des Deutschen. This version contains the TXT files of the texts and the CSV files containing the manual annotations of the texts with token ID, sentence ID, source text form, target form, automatic annotated lemma, POS (STTS) and simple UPOS part-of-speech tag.
Publicly Available
corpusCMC & WaC
Author(s):
Description:
The DiDi corpus has an overall size of around 600.000 Tokens gathered from 136 South Tyrolean Facebook users who participated in the DiDi project. It consists of 11.102 Facebook wall posts, 6.507 wall comments and 22.218 private messages. All messages were written by the participants throughout the year 2013. Please read the fulldescription of the corpus for further details. Please consider also the description of the method of data collection and the full description of the DiDi project and its research questions. As every participant could offer either his/her private messages, his/her texts on the wall or both, the corpus comprises wall posts and wall comments from 130 profiles and private messages of 56 profiles; 50 participants granted access to both types of data. Free access to the corpus is given to the wall posts and comments. Due to privacy issues the access to the private messages is restricted. Access to the private messages can be given for scientific research only, after signing a non-disclosure agreement. In case you are interested in the data for scientific reasons, please contact the research team. All texts were anonymised in order to guarantee that the participants' identity cannnot be infered from the texts. The anonymisation included person names, group names, geographical names and adjectival references, institution names, hyperlinks, mail addresses, phone numbers, numbers of bank accounts, servers, postal codes and other private information. Please, read the anonymisation document for the anonymisation keys. The corpus offers a vast range of research opportunities for linguists that are interested in CMC in general, and more specific in multilingual language use, the use of regional varieties, code switching, code shifting and code mixing phenomena, etc. Access to the DiDi corpus: https://commul.eurac.edu/annis/didi
Academic Use
Author(s):
Description:
LEONIDE is a longitudinal corpus of student essays documenting the language competences and writing development of lower secondary school students in three different languages. The corpus contains 2.512 texts from 163 pupils, who participated in the project “One school, many languages” conducted in eight schools in the officially multilingual Italian province of South Tyrol / Alto Adige (Zanasi & Stopfner, 2018). The aim of the project was to document the development of the pupils' plurilingual linguistic and communicative skills by collecting oral and written language samples in Italian, German and English, in order to obtain a global view of their individual linguistic repertoire. LEONIDE contains all the texts written by the participating students during the course of the project, the overall size of the corpus amounts to ca. 240.000 tokens. The texts were collected over the span of 3 consecutive years (2015-2018) in public middle schools (i.e. lower secondary school, grade 6 to grade 8). The pupils were 11 years old at the beginning of the data collection and 13 years old at the end. In each grade, two written texts were collected that differ with respect to genre: the first text was elicited using a picture story re-telling task; the second text is an opinion text on different aspects related to the pupils’ life and public discourse. For each genre and each grade, the corpus provides texts in the three languages German, Italian and English. In order to reflect the school system of the Province of South Tyrol / Alto Adige, about half of the texts was collected in four schools in which German is the main language of teaching and Italian is taught as L2. The other half of the texts was collected in four schools in which Italian is the main language of teaching and German is taught as L2. In all schools, English is taught as L3 (i.e. as a foreign language at school). Subdivided by language, the corpus contains 844 Italian, 833 German and 835 English texts. Manual annotation: The corpus is fully anonymised and annotated with target hypotheses correcting orthography errors in the text as well as annotations on structural elements (paragraphs, line breaks, bullet points, symbols or emoticons etc.), foreign word insertions and transcript surface features (e.g. deletions, corrections or insertions of the student, unreadable or ambiguous items). Automatic annotation: Automatic linguistic annotation included sentence splitting, tokenisation, lemmatisation and part-of-speech-tagging. Text metadata: The corpus provides a series of relevant person-related metadata (e.g. age, gender, first language(s), school and possible special needs of the students) as well as task-related metadata (e.g. task year, text genre, etc.) Usage: As the corpus documents the development of plurilingual competences of individual learners over a period of three years, it will allow both quantitative research on the characteristics of young learners’ language over a relatively long period, as well as investigations of the development of individuals taking into account a wide range of person related metadata. In addition, it allows contrastive analyses of the young learners’ progress in their L1, L2 and L3. Availability: The corpus will be available for corpus queries via an ANNIS search interface and as download for academic purposes (ACA-BY-NC-NORED 1.0) on the Eurac Research Clarin Centre by the end of 2020.  References: Zanasi, L. & Stopfner, M. (2018). Rilevare, osservare, consultare. Metodi e strumenti per l’analisi del plurilinguismo nella scuola secondaria di primo grado. In C. M. Coonan, A. Bier Ada & E. Ballarin (Ed.), La didattica delle lingue nel nuovo millennio. Le sfide dell’internazionalizzazione (pp. 135-148). Edizioni Ca’Foscari. http://doi.org/10.30687/978-88-6969-227-7/009 Glaznieks, A., Frey, J.-C., Stopfner, M., Zanasi, L. & Nicolas, L. (accepted): LEONIDE: A longitudinal trilingual corpus of young learners of Italian, German and English. In: International Journal of Learner Corpus Linguistics.
Academic Use