Lumbricus webis: a parallel and distributed crawling architecture for the Italian web (Rapporti tecnici/preprint/working paper)

Type
Label
  • Lumbricus webis: a parallel and distributed crawling architecture for the Italian web (Rapporti tecnici/preprint/working paper) (literal)
Anno
  • 2010-01-01T00:00:00+01:00 (literal)
Alternative label
  • Felicioli C.; Geraci F.; Pellegrini M. (2010)
    Lumbricus webis: a parallel and distributed crawling architecture for the Italian web
    (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#autori
  • Felicioli C.; Geraci F.; Pellegrini M. (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#note
  • Technical report, 2010. (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#descrizioneSinteticaDelProdotto
  • ABSTRACT: Web crawlers have become popular tools for gattering large portions of the web that can be used for many tasks from statistics to structural analysis of the web. Due to the amount of data and the heterogeneity of tasks to manage, it is essential for crawlers to have a modular and distributed architecture. In this paper we describe Lumbricus webis (short L.webis) a modular crawling infrastructure built to mine data from the web domain ccTLD .it and portions of the web reachable from this domain. Its purpose is to support gathering of advanced statics and advanced analytic tools on the content of the Italian Web. This paper describes the architectural features of L.webis and its performance. L.webis can currently download a mid-sized ccTLD such as \".it\" in about one week. (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#supporto
  • Altro (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#affiliazioni
  • CNR-IIT, Pisa (literal)
Titolo
  • Lumbricus webis: a parallel and distributed crawling architecture for the Italian web (literal)
Prodotto di
Autore CNR
Insieme di parole chiave

Incoming links:


Prodotto
Autore CNR di
Insieme di parole chiave di
data.CNR.it