An Incremental Clustering Scheme for Data De-duplication (Articolo in rivista)

Type
Label
  • An Incremental Clustering Scheme for Data De-duplication (Articolo in rivista) (literal)
Anno
  • 2010-01-01T00:00:00+01:00 (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#doi
  • 10.1007/s10618-009-0155-0 (literal)
Alternative label
  • Gianni Costa; Giuseppe Manco; Riccardo Ortale (2010)
    An Incremental Clustering Scheme for Data De-duplication
    in Data mining and knowledge discovery; SPRINGER, DORDRECHT (Paesi Bassi)
    (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#autori
  • Gianni Costa; Giuseppe Manco; Riccardo Ortale (literal)
Pagina inizio
  • 152 (literal)
Pagina fine
  • 187 (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#url
  • http://www.springerlink.com/content/k73p346831034777/ (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#numeroVolume
  • 20-0 (literal)
Rivista
Note
  • Scopu (literal)
  • Google Scholar (literal)
  • ISI Web of Science (WOS) (literal)
  • DBLP (literal)
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#affiliazioni
  • ICAR-CNR; ICAR-CNR; ICAR-CNR (literal)
Titolo
  • An Incremental Clustering Scheme for Data De-duplication (literal)
Abstract
  • We propose an incremental technique for discovering duplicates in large databases of textual sequences, i.e., syntactically different tuples, that refer to the same real-world entity. The problem is approached from a clustering perspective: given a set of tuples, the objective is to partition them into groups of duplicate tuples. Each newly arrived tuple is assigned to an appropriate cluster via nearest-neighbor classification. This is achieved by means of a suitable hash-based index, that maps any tuple to a set of indexing keys and assigns tuples with high syntactic similarity to the same buckets. Hence, the neighbors of a query tuple can be efficiently identified by simply retrieving those tuples that appear in the same buckets associated to the query tuple itself, without completely scanning the original database. Two alternative schemes for computing indexing keys are discussed and compared. An extensive experimental evaluation on both synthetic and real data shows the effectiveness of our approach. (literal)
Editore
Prodotto di
Autore CNR
Insieme di parole chiave

Incoming links:


Prodotto
Autore CNR di
Http://www.cnr.it/ontology/cnr/pubblicazioni.owl#rivistaDi
Editore di
Insieme di parole chiave di
data.CNR.it