5 Matching Annotations
- Nov 2015
www.cs.umd.edu www.cs.umd.edu
Presentation summarizing an approach to duplicate web page detection that was developed by a researcher whilst at Google in the early 2000s
- Sep 2015
Given an LSH familyH, the LSH scheme amplifiesthe gap between the high probabilityP1and the lowprobabilityP2by concatenating several functions
Useful recap of LSH
Recent survey paper for hashing-based approaches to similarity search
www.csd.uoc.gr www.csd.uoc.gr
This paper has a very useful overview of previous work that is worth reading under section 9.
We used the following publicly available real datasets in the experiment
Datasets used are DBPL, ENRON, UNIREF-4GRAM. All small (<1M records) in web terms and I would guess, all with small document sizes.
Given a lengthy paper, could potentially divide into smaller documents (1 doc === 1 page) and do signature calculation on a per-page basis. This could have the benefit of bounding the search time by limiting the number of pages that need to be rendered to text in order to start the lookup process.