Abstract:
The project aims at the development of
integrated knowledge management tools and techniques for communities of users
in the web, that support a) the semi-automatic organization of knowledge based
on collaborative participation mechanisms, b) a high degree of freedom to
define the structure to be used for the representation of knowledge, c)
personalized retrieval and visualization of knowledge by means of the automatic
generation of documents adapted to user needs and other runtime conditions, and
d) the automatic recognition of users’ identity and characteristics, based on
biometric analysis and data mining techniques. The different modules of the environment
to be developed will communicate with each other through a shared ontology that
captures the concepts and categories by which the knowledge of a given domain
is described.
The project follows the Semantic
Web perspective, which proposes a better organized and structured World Wide
Web, by means of the introduction of explicit semantic knowledge about
available resources in the web, to facilitate resource discovery, sharing and
integration. The research proposed here starts from previously developed tools
and results reached by our research team, which has a wide experience in the
areas involved in this proposal. The techniques to be developed in the current
project will be validated with the development of a corporate knowledge
management application for the institution to which our team belongs.
System
Analysis:
Existing
System:
Application
of data analysis techniques to the World Wide Web, referred to as Web analysis,
has been the focus of several recent research projects and papers. However,
there is no established vocabulary, leading to confusion when comparing
research efforts. The term Web analysis has been used in two distinct ways. The
first, called Web content analysis in this paper is the process of information
discovery from sources across the World Wide Web. The second called Web usage
mining is the process of analysis for user browsing and access patterns. In
this paper we define Web mining and present an overview of the various
researches issues, techniques, and development efforts. We briefly describe web
miner, a system for Web usage mining, and conclude this paper by listing
research issues.
Proposed System:
The
web usage visual web mining tools do not indicate which data mining algorithms
are used or provide effective graphical visualizations of the results obtained.
Visual Web Mining techniques can be used to determine typical navigation
patterns in an organizational web site. The process of combining Data analysis
and information visualization techniques in order to discover useful
information about web usage patterns is called visual web mining. The goal of
this paper is to discuss the development of a data analysis prototype called
web patterns, which allows the user to effectively visualize web usage
patterns.
MODULES:
Data Cleaning
User Identification
Session
Identification
Session
Filter
Data Summarization
Data
Cleaning:
The data
cleaning process is used to remove the non analyzed resource request like jpg,
gif, robot request from the server log file; it gets the raw log data as input
and produces the cleaned log data as the output. Data cleaning is the
first step performed in the Web usage mining process. Some low level data
integration tasks may also be performed at this stage, such as combining
multiple logs, incorporating referrer logs, etc. After the data cleaning, the
log entries must be partitioned into logical clusters using one or a series of
transaction identification modules.
The goal of
transaction identification is to create meaningful clusters of references for
each user. The task of identifying transactions is one of either dividing a
large transaction into multiple smaller ones or merging small transactions into
fewer larger ones. The input and output transaction formats match so that any
number of modules to be combined in any order, as the data analyst sees _t. Once
the domain-dependent data transformation phase is completed, the resulting
transaction data must be formatted to conform to the data model of the
appropriate data mining task.
For instance, the format of the data for the association rule discovery
task may be different than the format necessary for mining sequential patterns.
Finally, a query mechanism will allow the user (analyst) to provide more control
over the discovery process by specifying various constraints. For more details
on the WEBMINER.
User
Identification:
The
user identification process is used to identity the user from the cleaned log
data. It assign user id to the unique ip address. If for the same ip address if
the os, or host agent is different it assign different user_id.
The
behavioral data were obtained from the Web server access logs of both School
Web and the German Education Server in the form of a flat file containing one
line per page request. The requests were listed in the commonly used Extended
Log Format, which contains information on the host (IP address), the requested
URL, the date and time of the transaction, and its status or error code (e.g.,
200/success, 404/file not Found 301/permanent redirect). Since log files give
rise to a number of uncertainties (e.g., Berendt & Spiliopoulou, 2000), the
data needed to be cleaned and prepared before analysis.
First, hits needed to be transformed into page impressions by filtering
out, from the Web server log files, all requests for GIF or other image files.
Second, the participants had been instructed to turn off their proxies, so
their individual computers’ current IP numbers appeared in the log files (IP
numbers used in the experiment were known). To Segment the sequence of requests
from one IP number into sessions of different participants, Web pages
controlling the experiment were employed. These asked the participant to “log
into” and “log out of” the experiment. Finally, the participants were asked to empty
their browser’s cache and turn off caching prior to starting the experiment to
ensure that the whole sequence of pages seen by the user, including revisits to
pages previously requested, was recorded in the log file.
Session
Filtering:
The session filtering process
gets the input from the session data and it removes the useless session,
session of length one. Thus, we obtained the full sequence of
pages visited on the School Web or German Education Server for each participant.
Intermediate visits to other servers could only be traced via the log file
referrer field if the return move was initiated by a hyperlink on the remote
page or if a redirect was employed. This was observed only seldom, because the
experimental tasks did not require searching for information outside the
respective servers (if it did occur, it was not recorded, because participants
generally returned via the back button). Measures. The following measures were used.
1. Extent of navigation. This was
measured by the number of
nodes and
the number of edges.7
2. Search strategies. We chose the branching factor of the home page as our
indicator of how broadly a user searched that is, of how much the overall
strategy resembled breadth-first search.
The branching factor equals the
number of different new nodes that is, branches opened up, directly after the
home page. Repeat visits were not counted that is, in a path [home page, X, Y,
home page, Y], only X counts toward
the branching factor, since this back-and forth movement is interpreted as
landmark use rather than as the examination of new ways of answering the
question. In contrast, a depth-first navigation
strategy explores a path that leads the user away from the start page until the
end of that path is reached, or at least until it becomes obvious that this
will not lead to the goal. We considered long paths from the home page to be
indicative of such a strategy and chose the average depth of exploration as our
measure of a depth-first strategy: the number of forward moves divided by the
number of branches emanating from the home page.
Session
Identification:
The
session identification process gets the input from the user identity file for
each user id extract the page accessed. If the time difference of the
subsequent page access exceeds the threshold time then open the new session
otherwise add it to the same session.
This was observed only
seldom, because the experimental tasks did not require searching for
information outside the respective servers (if it did occur, it was not
recorded, because participants generally returned via the back button). Measures. The following measures were used.
In contrast, a depth-first
navigation strategy explores a path that
leads the user away from the start page until the end of that path is reached,
or at least until it becomes obvious that this will not lead to the goal. We considered
long paths from the home page to be indicative of such a strategy and chose the
average depth of exploration as our
measure of a depth-first strategy: the number of forward moves divided by the
number of branches emanating from the home page.
Data
Summarization:
The
data summarization process extracts the fields of the session data and display
the results graphically. It displays the visits/day, visits/hour, top browser,
os, referrer etc.
Web usage data
is collected in various ways, each mechanism collecting attributes relevant for
its purpose. There is a need to pre-process the data to make it easier to mine
for knowledge. Specifically, we believe that issues such as instrumentation and
data collection, data integration and transaction identification need to be
addressed.
Clearly improved data
quality can improve the quality of any analysis on it. A problem in the Web domain
is the inherent conflict between the analysis needs of the analysts (who want
more detailed usage data collected), and the privacy needs of users (who want
as little data collected as possible).
This has lead to the development of cookie
_les on one side and cache busting on the other. The emerging OPS standard on
collecting profile data may be a compromise on what can and will be collected.
However, it is not clear how much compliance to this can be expected. Hence,
there will be a continual need to develop better instrumentation and data
collection techniques, based on whatever is possible and allowable at any point
in time. Portions of Web usage data exist in sources as diverse as Web server
logs, referral logs, registration_les, and index server logs. Intelligent
integration and correlation of information from these diverse sources can
reveal usage information which may not be evident from any one of them.
SYSTEM REQUIREMENTS:
Hardware Interfaces:
Processor
Type : Pentium -IV
Speed : 2.4 GHZ
Ram : 256 MB RAM
Hard disk : 20 GB HD
Software Interfaces:
Operating
System : Win2000
Programming
Package : .Net
Server : ISS Server
Tools : Flowchart .NET
No comments:
Post a Comment