Corpus

Our corpus consists of internet texts from the IIA as well as excerpts from books written in Talian. Text processing is being done in R (R Core Team 2020), and optical character recognition (OCR) is being carried out using Google’s Tesseract (Smith 2007). As a starting point, we used trained data from Italian in Tesseract, and later checked for potential mismatches. As of July 2024, the corpus contains 308,985 words and 24,226 sentences.

Last updated: July 11, 2024


What the corpus looks like

The corpus currently has 25 variables/columns. In the table below, you can see a subset of the columns/variables in the corpus.

sentence wd sLength freq IPA nSyl stress
El sera rode dela careta (2). el 6 11152 'el 1 final
El sera rode dela careta (2). sera 6 178 'se-ɾa 2 penult
El sera rode dela careta (2). rode 6 34 'ɾo-de 2 penult
El sera rode dela careta (2). dela 6 775 'de-la 2 penult
El sera rode dela careta (2). careta 6 109 ka-'ɾe-ta 3 penult
Bragheta, quel di, se gavea perso drio el tordo ferio. bragheta 10 21 bɾa-'ge-ta 3 penult
Bragheta, quel di, se gavea perso drio el tordo ferio. quel 10 888 'kwel 1 final
Bragheta, quel di, se gavea perso drio el tordo ferio. di 10 244 'di 1 final
Bragheta, quel di, se gavea perso drio el tordo ferio. se 10 4333 'se 1 final
Bragheta, quel di, se gavea perso drio el tordo ferio. gavea 10 758 ga-'ve-a 3 penult


The corpus follows a tidy data approach (Wickham et al. 2014), so little (if any) data wrangling is needed to analyze the data.


Download corpus

To access the corpus, click here. We highly recommend that you load tidyverse before loading the corpus itself.


Publications and presentations


How to cite the corpus

APA:

Garcia, G. D., & Guzzo, N. B. (2021, April 12). Talian corpus: a written corpus of Brazilian Veneto. https://doi.org/10.17605/OSF.IO/63NRX. Available at https://nataliaguzzo.github.io/talian.

BibTeX:


@misc{Garcia_Guzzo_2021,
  title={Talian corpus: a written corpus of Brazilian Veneto},
  url={osf.io/63nrx},
  DOI={10.17605/OSF.IO/63NRX},
  publisher={OSF},
  author={Garcia, Guilherme D and Guzzo, Natália B},
  year={2021},
  month={Apr}}

Copyright © 2024 Natália Brambatti Guzzo

References

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
Smith, R. 2007. “An Overview of the Tesseract OCR Engine.” In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), 2:629–33.
Wickham, Hadley et al. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23.