Corpus

Our corpus consists of internet texts from the IIA as well as excerpts from books written in Talian. Text processing is being done in R (R Core Team 2020), and optical character recognition (OCR) is being carried out using Google’s Tesseract (Smith 2007). As a starting point, we used trained data from Italian in Tesseract, and later checked for potential mismatches. As of January 2025, the corpus contains 308,985 words and 24,226 sentences.

Last updated: January 14, 2025

What the corpus looks like

The corpus currently has 25 variables/columns. In the table below, you can see a subset of the columns/variables in the corpus.

sentence	wd	sLength	freq	IPA	nSyl	stress
El sera rode dela careta (2).	el	6	11152	'el	1	final
El sera rode dela careta (2).	sera	6	178	'se-ɾa	2	penult
El sera rode dela careta (2).	rode	6	34	'ɾo-de	2	penult
El sera rode dela careta (2).	dela	6	775	'de-la	2	penult
El sera rode dela careta (2).	careta	6	109	ka-'ɾe-ta	3	penult
Bragheta, quel di, se gavea perso drio el tordo ferio.	bragheta	10	21	bɾa-'ge-ta	3	penult
Bragheta, quel di, se gavea perso drio el tordo ferio.	quel	10	888	'kwel	1	final
Bragheta, quel di, se gavea perso drio el tordo ferio.	di	10	244	'di	1	final
Bragheta, quel di, se gavea perso drio el tordo ferio.	se	10	4333	'se	1	final
Bragheta, quel di, se gavea perso drio el tordo ferio.	gavea	10	758	ga-'ve-a	3	penult

The corpus follows a tidy data approach (Wickham et al. 2014), so little (if any) data wrangling is needed to analyze the data.

Download corpus

To access the corpus, click here. We highly recommend that you load tidyverse before loading the corpus itself.

Publications and presentations

Garcia, G. D. & N. B. Guzzo. (2023). Talian Corpus: Um corpus de dados escritos to talian (vêneto brasileiro). Presentation at Abralin em Cena 17, June 27 2023 [in Portuguese].
Guzzo, N. B. & G. D. Garcia. (2020). Phonological variation and prosodic representation: clitics in Portuguese-Veneto contact. Journal of Language Contact, 13(2):389–427.

How to cite the corpus

APA:

Garcia, G. D., & Guzzo, N. B. (2021, April 12). Talian corpus: a written corpus of Brazilian Veneto. https://doi.org/10.17605/OSF.IO/63NRX. Available at https://nataliaguzzo.github.io/talian.

BibTeX:

TeX
@misc{Garcia_Guzzo_2021,
  title={Talian corpus: a written corpus of Brazilian Veneto},
  url={osf.io/63nrx},
  DOI={10.17605/OSF.IO/63NRX},
  publisher={OSF},
  author={Garcia, Guilherme D and Guzzo, Natália B},
  year={2021},
  month={Apr}}

References

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.

Smith, R. 2007. “An Overview of the Tesseract OCR Engine.” In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), 2:629–33.

Wickham, Hadley et al. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23.