Home‎ > ‎Current Projects‎ > ‎

Secure Encrypted Search

Privacy enhanced Search

  1. Searching an encrypted database without disclosing the contents of the query.
  2. Providing access to a querier only to those parts of the database relevant to the query.
Document matching 
Facilitating two agencies which have collections of documents to determine the set of documents common to their collections, without exchanging the documents.
Encrypted Bloom Filters are used to solve both the above problems in an efficient manner.

Brief note on Bloom filters
Bloom filter is a probabilistic data structure which can decide membership of an element in a set. A bloom filter uses a bit-array and several independent hash functions (say k). 
Bloom filters can yield false positives, but no false negatives. The false positive rate of bloom filters depends on the size of the bit-array and number of hash functions.

When a new member needs to be added to a set, each of the k hash functions are applied on the member to get k positions in the bit array all of which are set to 1.
To test membership of an element, the above process is repeated to get k bit-array positions. If the bits at any of these positions are 0, then the element does not belong to the set.
Otherwise, the element may be present in the set (or the bits might have been set by the insertion of other elements).

Bloom filters for similarity
Bloom filters can be used to test similarity using the following simple method - perform a bitwise AND of 2 bloom filters, and divide the number of set bits by the number of set bits
in the bitwise OR of the 2 filters. The above method can be used to test similarity between 2 documents (document comparison), or similarity between a query and a document (for search).

Document Comparison and Tracking

Encrypted Bloom filters can also be used to perform blind document similarity testing. To accomplish this, we transform each document into a series of n-grams of either tokens or byte values. We then enter each n-gram into an encrypted Bloom filter; this filter is a representation of the document. To compare documents for similarity, we AND together their Bloom filters and perform a population count of 1 bits. Because of our transformation algorithms, we can compare documents from different collections. Thus, two different agencies can see if they have similar documents in their databases, without either disclosing the actual documents unless and until a match is established.

Evaluating features for use with encrypted Bloom filters:
Much of the work on document comparison is related to evaluation of different feature extraction strategies from documents that perform well in the context of encrypted bloom filters.
Among the approaches we are evaluating for feature extraction from documents are 
  1. Conventional NLP approaches which use n-grams of words
    1.  after removal of stop words and stemming 
    2. summary (most unique n-grams in a document)  obtained using Term Frequency * Inverse Document Frequency method.
  2. Using structural features of documents in MS Word format - such as tables, figure captions, etc