ABSTRACT:
Search engine companies collect
the “database of intentions”, the histories of their users’ search queries.
These search logs are a gold mine for researchers. Search engine companies,
however, are wary of publishing search logs in order not to disclose sensitive
information. In this paper we analyze algorithms for publishing frequent
keywords, queries and clicks of a search log. We first show how methods that
achieve variants of k-anonymity are vulnerable to active attacks. We then
demonstrate that the stronger guarantee ensured by differential privacy
unfortunately does not provide any utility for this problem. Our paper
concludes with a large experimental study using real applications where we
compare ZEALOUS and previous work that achieves k-anonymity in search log
publishing. Our results show that ZEALOUS yields comparable utility to k−anonymity
while at the same time achieving much stronger privacy guarantees.
ARCHITECTURE:
EXISTING SYSTEM:
We show that existing proposals to
achieve anonymity in search logs are insufficient in the light of
attackers who can actively influence the search log. However, we show that it
is impossible to achieve good utility with differential privacy.
DISADVANTAGES:
Existing work on publishing frequent item
sets often only tries to achieve anonymity or makes strong assumptions about
the background knowledge of an attacker
PROPOSED SYSTEM:
The main focus of this paper is search
logs, our results apply to other scenarios as well. For example, consider a
retailer who collects customer transactions. Each transaction consists of a
basket of products together with their prices, and a time-stamp. In this case
ZEALOUS can be applied to publish frequently purchased products or sets of
products. This information can also be used in a recommender system or in a
market basket analysis to decide on the goods and promotions in a store.
ADVANTAGES:
Our results show that ZEALOUS yields
comparable utility to k−anonymity while at the same time achieving much
stronger privacy guarantees.
ALGORITHM:
PRIVACY–PRESERVING
ALGORITHM:
1.
For each user u select a set su of up to m distinct items from u’s search
history.
2.
Based on the selected items, create a histogram consisting of pairs (k, ck), where k denotes an
item and ck denotes the number of user’s u that have k in their search history
su. We call this histogram the original histogram.
3.
Delete from the histogram the pairs (k, ck) with count ck smaller
than τ.
4.
For each pair (k,
ck) in
the histogram, sample a random number ηk from the Laplace distribution Lap (λ) 4, and add ηk to the
count ck, resulting in a noisy count: ˜ck ← ck + ηk.
5.
Delete from the histogram the pairs (k, ˜ck).
6.
Publish the remaining items.
MODULES:
QUERY
SUBSTITUTION:
Query substitutions are suggestions to
rephrase a user query to match it to documents or advertisements that do not
contain the actual keywords of the query. Query substitutions can be applied in
query refinement, sponsored search, and spelling error correction. Query
substitution as a representative application for search quality. First, the
query is partitioned into subsets of keywords, called phrases, based on
their mutual information. Next, for
Each
phrase, candidate query substitutions are determined based on the distribution
of queries.
INDEX
CACHING:
Index caching, as a representative
application for search performance the index caching application does not
require high coverage because of its storage restriction. However, high
precision of the top-j most frequent items is necessary to determine which of
them to keep in memory. On the other hand, in order to generate many query
substitutions, a larger number of distinct queries and query pairs are
required. Thus should be set to a large value for index caching and to a small
value for query substitution. In our experiments we fixed the memory size to be
1 GB. Our inverted index stores the document posting list for each keyword
sorted according to their relevance which allows retrieving the documents in
the order of their relevance.
ITEM SET GENERATION AND
RANKING:
All of our results apply to the more
general problem of publishing frequent items / item sets / consecutive item
sets. Our results (positive as well as negative) can be applied more generally
to the problem of publishing frequent items or item sets. We then compare these
rankings with the rankings produced by the original search log which serve as
ground truth. To measure the quality of the query substitutions it does not
only compare the ranks of a substitution in the two rankings, but is also penalizes
highly relevant substitutions according to [q0, . . . , qj−1] that have a very low
rank in [q_0 , . . . , q_j−1].
SYSTEM REQUIREMENTS:
HARDWARE REQUIRED:
System : Pentium
IV 2.4 GHz
Hard Disk : 40
GB
Floppy Drive : 1.44
MB
Monitor : 15
VGA color
Mouse : Logitech.
Keyboard : 110 keys enhanced
RAM : 256
MB
SOFTWARE REQUIRED:
O/S : Windows
XP.
Language
: Asp.Net,
c#.
Data
Base : Sql
Server 2005.
No comments:
Post a Comment