sentence | wd | sLength | freq | IPA | nSyl | stress |
---|---|---|---|---|---|---|
El sera rode dela careta (2). | el | 6 | 11152 | 'el | 1 | final |
El sera rode dela careta (2). | sera | 6 | 178 | 'se-ɾa | 2 | penult |
El sera rode dela careta (2). | rode | 6 | 34 | 'ɾo-de | 2 | penult |
El sera rode dela careta (2). | dela | 6 | 775 | 'de-la | 2 | penult |
El sera rode dela careta (2). | careta | 6 | 109 | ka-'ɾe-ta | 3 | penult |
Bragheta, quel di, se gavea perso drio el tordo ferio. | bragheta | 10 | 21 | bɾa-'ge-ta | 3 | penult |
Bragheta, quel di, se gavea perso drio el tordo ferio. | quel | 10 | 888 | 'kwel | 1 | final |
Bragheta, quel di, se gavea perso drio el tordo ferio. | di | 10 | 244 | 'di | 1 | final |
Bragheta, quel di, se gavea perso drio el tordo ferio. | se | 10 | 4333 | 'se | 1 | final |
Bragheta, quel di, se gavea perso drio el tordo ferio. | gavea | 10 | 758 | ga-'ve-a | 3 | penult |
Corpus
Our corpus consists of internet texts from the IIA as well as excerpts from books written in Talian. Text processing is being done in R (R Core Team 2020), and optical character recognition (OCR) is being carried out using Google’s Tesseract (Smith 2007). As a starting point, we used trained data from Italian in Tesseract, and later checked for potential mismatches. As of July 2024, the corpus contains 308,985 words and 24,226 sentences.
Last updated: July 11, 2024
What the corpus looks like
The corpus currently has 25 variables/columns. In the table below, you can see a subset of the columns/variables in the corpus.
The corpus follows a tidy data
approach (Wickham et al. 2014), so little (if any) data wrangling is needed to analyze the data.
Download corpus
To access the corpus, click here. We highly recommend that you load tidyverse
before loading the corpus itself.
Publications and presentations
Garcia, G. D. & N. B. Guzzo. (2023). Talian Corpus: Um corpus de dados escritos to talian (vêneto brasileiro). Presentation at Abralin em Cena 17, June 27 2023 [in Portuguese].
Guzzo, N. B. & G. D. Garcia. (2020). Phonological variation and prosodic representation: clitics in Portuguese-Veneto contact. Journal of Language Contact, 13(2):389–427.
How to cite the corpus
APA:
Garcia, G. D., & Guzzo, N. B. (2021, April 12). Talian corpus: a written corpus of Brazilian Veneto. https://doi.org/10.17605/OSF.IO/63NRX
. Available at https://nataliaguzzo.github.io/talian
.
BibTeX:
@misc{Garcia_Guzzo_2021,
title={Talian corpus: a written corpus of Brazilian Veneto},
url={osf.io/63nrx},
DOI={10.17605/OSF.IO/63NRX},
publisher={OSF},
author={Garcia, Guilherme D and Guzzo, Natália B},
year={2021},
month={Apr}}
Copyright © 2024 Natália Brambatti Guzzo