Short note on Fingerprinting.
Fingerprinting
An approach for comparing a large number of documents is based on the idea of fingerprinting documents. A document may be divided in all possible substrings of length. These substrings are called shingles. Based on the shingles we can define resemblance R (X, Y) and containment C (X, Y) between two documents X and Y as follows. We assume S (X) and S (Y) to be a set of shingles for documents X and Y respectively.
R (X,Y)= (S(X) S (Y)} (S (X) US (Y)}
and
C (X, Y) = {S(X)S (Y) {S (X)}
An algorithm like the following may now be used to find similar documents:
1. Collect all the documents that one wishes to compare
2. Choose a suitable shingle width and compute the shingles for each document
3. Compare the shingles for each pair of documents
4. Identify those documents that are similar
Comments
Post a Comment