D4.5 Final Report on the Corpus Acquisition & Annotation subsystem and its components (Rapporti progetti di ricerca)

Type
Label
  • D4.5 Final Report on the Corpus Acquisition & Annotation subsystem and its components (Rapporti progetti di ricerca) (literal)
Anno
  • 2012-01-01T00:00:00+01:00 (literal)
Alternative label
  • Prokopidis, Prokopis [1]; Papavassiliou, Vassilis [1]; Toral, Antonio [2]; Poch Riera, Marc; Frontini, Francesca [3]; Rubino, Francesco [3]; Thurmair, Gregor (2012)
    D4.5 Final Report on the Corpus Acquisition & Annotation subsystem and its components
    (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#autori
  • Prokopidis, Prokopis [1]; Papavassiliou, Vassilis [1]; Toral, Antonio [2]; Poch Riera, Marc; Frontini, Francesca [3]; Rubino, Francesco [3]; Thurmair, Gregor (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#altreInformazioni
  • ID_PUMA: /cnr.ilc/2012-EC-002 (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#url
  • http://www.jotform.com/uploads/fabioaffeilc/30222975566357/225350067351490116/PANACEA (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#supporto
  • Altro (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#affiliazioni
  • [1] ILSP \"Athena\" R.C., Greece; [2] Dublin City University, Ireland; [3] CNR-ILC, Pisa (literal)
Titolo
  • D4.5 Final Report on the Corpus Acquisition & Annotation subsystem and its components (literal)
Abstract
  • PANACEA WP4 targets the creation of a Corpus Acquisition and Annotation (CAA) subsystem for the acquisition and processing of monolingual and bilingual language resources (LRs). The CAA subsystem consists of tools that have been integrated as web services in the PANACEA platform of LR production. D4.2 Initial functional prototype and documentation in T13 and D4.4 Report on the revised Corpus Acquisition & Annotation subsystem and its components in T23 provided initial and updated documentation on this subsystem, while this deliverable presents the final documentation of the subsystem as it evolved after the third development cycle of the project. The deliverable is structured as follows. The Corpus Acquisition Component (i.e. the Focused Monolingual and Bilingual Crawlers (FMC/FBC)) is described in section 2. The final list of tools for corpus normalization (cleaning and de-duplication) is detailed in section 3. Section 4 provides documentation on all NLP tools included in the subsystem. Due to its nature, this deliverable aggregates considerable parts of all previous WP4 deliverables. The main new additions include a) new functionalities for, among others, crawling strategy, de-duplication, and detection of parallel document pairs; and b) new NLP tools for syntactic analysis, named entity recognition, tweet processing and anonymization. (literal)
Prodotto di
Autore CNR
Insieme di parole chiave

Incoming links:


Prodotto
Autore CNR di
Insieme di parole chiave di
data.CNR.it