A Multi-Sequence Alignment Algorithm for Web Template Detection (Contributo in atti di convegno)

Type
Label
  • A Multi-Sequence Alignment Algorithm for Web Template Detection (Contributo in atti di convegno) (literal)
Anno
  • 2011-01-01T00:00:00+01:00 (literal)
Alternative label
  • Geraci F. [1], Maggini M. [2] (2011)
    A Multi-Sequence Alignment Algorithm for Web Template Detection
    in KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval, Paris, France, 26-29 October, 2011
    (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#autori
  • Geraci F. [1], Maggini M. [2] (literal)
Pagina inizio
  • 121 (literal)
Pagina fine
  • 128 (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#altreInformazioni
  • ID_PUMA: cnr.iit/2011-A2-044 (literal)
Note
  • Scopu (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#affiliazioni
  • [1] CNR-IIT, Pisa, Italy; [2] Dip. Ingegneria dell'Informazione, Universita' di Siena, Italy (literal)
Titolo
  • A Multi-Sequence Alignment Algorithm for Web Template Detection (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#isbn
  • 978-989-8425-79-9 (literal)
Abstract
  • Nowadays most of Web pages are automatically assembled by content managementsystems or editing tools that apply a fixed template to give a uniform structure to allthe documents beloging to the same site. The template usually contains side information thatprovides better graphics, navigation bars and menus, banners and advertisements thatare aimed to improve the users' browsing experience but may hinder tools for automaticprocessing of Web documents. In this paper, we present a novel template removing techniquethat exploits a sequence alignment algorithm from bioinformatics that is able to automaticallyextract the template from a quite small sample of pages from the same site. The algorithmdetects the common structure of HTML tags among pairs of pages and merges thepartial hypotheses using a binary tree consensus schema. The experimental resultsshow that the algorithm is able to attain a good precision and recall in the retrievalof the real template structure exploiting just 16 sample pages from the site. Moreover,the positive impact of the template removing technique is shown on a Web page clusteringtask. (literal)
Prodotto di
Autore CNR
Insieme di parole chiave

Incoming links:


Prodotto
Autore CNR di
Insieme di parole chiave di
data.CNR.it