Short note on Fingerprinting.

 Fingerprinting

An approach for comparing a large number of documents is based on the idea of fingerprinting documents. A document may be divided in all possible substrings of length. These substrings are called shingles. Based on the shingles we can define resemblance R (X, Y) and containment C (X, Y) between two documents X and Y as follows. We assume S (X) and S (Y) to be a set of shingles for documents X and Y respectively.

R (X,Y)= (S(X) S (Y)} (S (X) US (Y)}

and

C (X, Y) = {S(X)S (Y) {S (X)}

An algorithm like the following may now be used to find similar documents:

1. Collect all the documents that one wishes to compare

2. Choose a suitable shingle width and compute the shingles for each document

3. Compare the shingles for each pair of documents

4. Identify those documents that are similar

Comments