Please use the following text to cite this item or export to a predefined format:
Lyding, Verena; et al., 2013, PAISÀ Corpus of Italian Web Text, undefined, http://hdl.handle.net/20.500.12124/3
dc.contributor.authorLyding, Verena
dc.contributor.authorStemle, Egon
dc.contributor.authorBorghetti, Claudia
dc.contributor.authorBrunello, Marco
dc.contributor.authorCastagnoli, Sara
dc.contributor.authorDell’Orletta, Felice
dc.contributor.authorDittmann, Henrik
dc.contributor.authorLenci, Alessandro
dc.contributor.authorPirrelli, Vito
dc.date.accessioned2018-05-29T11:06:34Z
dc.date.available2018-05-29T11:06:34Z
dc.date.issued2013-01
dc.descriptionThe Paisà corpus is a large collection of Italian web texts, licensed under Creative Commons (Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike). It has been created in the context of the project PAISÀ. All documents were selected in two different ways. A part of the corpus was constructed using a method inspired by the WaCky project. We created 50,000 word pairs by randomly combining terms from an Italian basic vocabulary list, and used the pairs as queries to the Yahoo! search engine in order to retrieve candidate pages. We limited hits to pages in Italian with a Creative Commons license of type: CC-Attribution, CC-Attribution-Sharealike, CC-Attribution-Sharealike-Non-commercial, and CC-Attribution-Non-commercial. Pages that were wrongly tagged as CC-licensed were eliminated using a black-list that was populated by manual inspection of earlier versions of the corpus. The retrieved pages were automatically cleaned using the KrdWrd system. The remaining pages in the PAISÀ corpus come from the Italian versions of various Wikimedia Foundation projects, namely: Wikipedia, Wikinews, Wikisource, Wikibooks, Wikiversity, Wikivoyage. The official Wikimedia Foundation dumps were used, extracting text with Wikipedia Extractor. Once all materials were downloaded, the collection was filtered discarding empty documents or documents containing less than 150 words. The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services.
dc.identifier.urihttp://hdl.handle.net/20.500.12124/3
dc.language.isoita
dc.publisherInstitute for Applied Linguistics, Eurac Research
dc.relation.isreferencedbyhttp://aclweb.org/anthology/W14-0406
dc.rightsCreative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.labelPUB
dc.rights.urihttps://creativecommons.org/licenses/by-nc-sa/4.0/
dc.source.urihttp://www.corpusitaliano.it
dc.subjectweb corpus
dc.subjectlanguage learning
dc.titlePAISÀ Corpus of Italian Web Text
dc.typecorpus
local.brandingCMC & WaC
local.contact.personCorpus Manager clarin@eurac.edu Eurac Research CLARIN Centre (ERCC)
local.files.count4
local.files.size2538447018
local.has.filesyes
local.hasCMDIfalse
local.hiddenfalse
local.language.nameItalian
local.size.info380000 pages
local.size.info250M words
local.sponsornationalFunds N/A Ministero dell'Istruzione, dell'Università e della Ricerca (MIUR) Fondo per gli Investimenti della Ricerca di Base (FIRB)
metashare.ResourceInfo#ContentInfo.mediaTypetext
 Files in this item
Loading files... This may take a few seconds as file previews are being generated. If the process takes too long, please contact the system administrator