Abstract:-
MODERN information processing is
moving into a realm where we often need to process data that are pushed or pulled
from autonomous data sources through heterogeneous networks. Adaptive join algorithms have recently
attracted a lot of attention in emerging applications where data are provided
by autonomous data sources through heterogeneous network environments
Their main advantage over
traditional join techniques is that they can start producing join results as
soon as the first input tuples are available, thus, improving pipelining by
smoothing join result production and by masking source or network delays.
In this project, we first propose
Double Index NEsted-loops Reactive join (DINER), a new adaptive two-way join
algorithm for result rate maximization. DINER combines two key elements: an
intuitive flushing policy that aims to increase the productivity of in-memory
tuples in producing results during the online phase of the join, and a novel reentrant
join technique that allows the algorithm to rapidly switch between processing
in-memory and disk-resident tuples, thus, better exploiting temporary delays
when new data are not available
We then extend the applicability of
the proposed technique for a more challenging setup: handling more than two
inputs. Multiple Index NEsted-loop Reactive join (MINER) is a multiway join operator that inherits
its principles from DINER. Our experiments using real and synthetic data sets
demonstrate that DINER outperforms previous adaptive join algorithms in
producing result tuples at a significantly higher rate, while making better use
of the available memory. Our experiments also shows that in the presence of
multiple inputs, MINER manages to produce a high percentage of early results
Existing
System:-
All existing algorithms work in
three stages. During the Arriving phase, a newly arrived tuple is stored in
memory and it is matched against memory-resident tuples belonging to the other
relations participating in the join. Since the allocated memory for the join
operation is limited and often much smaller than the volume of the incoming
data, this results in tuple migration to disk. The decision on what to flush to
disk influences significantly the number of results produced during the
Arriving phase. The Arriving phase is suspended when all data sources are
temporarily blocked and a Reactive phase kicks in and starts joining part of
the tuples that have been flushed to disk. An important desideratum of this
phase is the prompt handover to the Arriving phase as soon as any of the data
sources restarts sending tuples. Each algorithm has a handover delay which depends
on the minimum unit of work that needs to be completed before switching phases.
This delay has not received attention in the past, but we show that it can
easily lead to input buffer overflow, lost tuples, and hence incorrect results.
When all sources complete the data transmission, a Cleanup phase is activated
and the tuples that were not joined in
the previous phases (due to flushing of tuples to disk) are brought from disk
and joined. Even if the overall strategy has been proposed for a multiway join, most existing algorithms are
limited to a two-way join. Devising an effective multiway adaptive join operator
is a challenge in which little progress has been mad.
Proposed System:-
In this project, we propose two new
adaptive join algorithms for output rate maximization in data processing over
autonomous distributed sources. The first algorithm, Double Index NEsted-loop
Reactive join (DINER) is applicable for two inputs, while Multiple Index
NEsted-loop Reactive join (MINER) can be used for joining an arbitrary number
of
input sources.
Implementation :-
DINER (Double Index NEsted-loops Reactive)
MODERN
information processing is moving into a realm where we often need to process
data that are pushed or pulled from autonomous data sources through
heterogeneous networks.
The key differences between DINER
and existing algorithms are 1) an intuitive flushing policy for the Arriving
phase that aims at maximizing the amount of overlap of the join attribute
values between memory resident tuples of the two relations and 2) a lightweight
Reactive phase that allows the algorithm to quickly
move into processing tuples that were flushed to disk when both data sources
block. The key idea of our flushing policy is to create and adaptively maintain
three nonoverlapping value regions that partition the join attribute domain,
measure the “join benefit” of each region at every flushing decision point, and
flush tuples from the region that doesn’t produce many join results in a way
that permits easy maintenance of the three-way partition of the values.
When tuples are flushed to disk they
are organized into sorted blocks using an efficient index structure, maintained
separately for each relation (thus, the part “Double Index” in DINER). This
optimization results in faster processing of these tuples during the Reactive
and Cleanup phases. The Reactive phase of DINER employs a symmetric nested loop
join process, combined with novel bookkeeping that allows the algorithm to
react to the unpredictability of the data sources. The fusion of the two
techniques allows DINER to make much more efficient use of available main
memory. We demonstrate in our experiments that DINER has a higher rate of join
result production and is much more adaptive to changes in the environment,
including changes in the value distributions of the streamed tuples and in
their arrival rates
MINER
MINER extends DINER to multiway
joins and it maintains all the distinctive and efficiency generating properties
of DINER. MINER maximizes the output rate by: 1) adopting an efficient probing
sequence for new incoming tuples which aims to reduce the processing overhead
by interrupting index lookups early for those tuples that do not participate in
the overall result; 2) applying an effective flushing policy that keeps in
memory the tuples that produce results, in a manner similar to DINER; and 3)
activating a Reactive phase when all inputs are blocked, which joins on-disk
tuples while keeping
the result correct and being able to
promptly hand over in the presence of new input. Compared to DINER, MINER faces
additional challenges namely: 1) updating and synchronizing the statistics for
each join attribute during the online
phase, and 2) more complicated bookkeeping in order to be able to guarantee
correctness and prompt handover during reactive phase. Weare able to generalize
the reactive phase of DINER for multiple inputs using a novel scheme that
dynamically changes the roles between relations
The contributions of this
project:-
. We introduce DINER a novel adaptive join
algorithm that supports both equality and range join predicates. DINER builds
on an intuitive flushing policy that aims at maximizing the productivity of tuples
that are kept in memory. .
DINER is the first algorithm to address the need to quickly respond to
bursts of arriving data during the Reactive phase. We propose a novel extension
to nested loops join for processing disk-resident tuples when both sources
block, while being able to swiftly respond to new data arrivals.
We
introduce MINER, a novel adaptive multiway join algorithm that maximizes the
output rate, designed for dealing with cases where data are held by multiple
remote sources. We provide a thorough
discussion of existing algorithms, including identifying some important limitations,
such as increased memory consumption because of their inability to quickly
switch to the Arriving phase and not being responsive enough when value
distributions change.
. We provide an extensive experimental
study of DINER including performance comparisons to existing adaptive join
algorithms and a sensitivity analysis. Our results demonstrate the superiority
of DINER in a variety of realistic scenarios. During the online phase of the
algorithm, DINER manages to produce up to three times more results compared to previous
techniques. The performance gains of DINER are realized when using both real
and synthetic data and are increased when fewer resources (memory) are given to
the algorithm. We also evaluate the performance of MINER, and we show that it
is still possible to obtain early a large percentage of results even in more
elaborated setups where data are provided through multiple inputs. Our
experimental study shows that the performance of MINER is 60 times higher
compared to the existing MJoin algorithm when a four-way star join is executed
in a constrained memory environment.
SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS:
System: Pentium IV 2.4 GHz
Hard Disk: 40GB
Ram: 512 MB
SOFTWARE REQUIREMENTS:
Microsoft visual studio 2008(ASP.NET,c#)
SQL server 2005
No comments:
Post a Comment