Search This Blog

Sunday 29 January 2012

Visual Data Mining of Web Navigational Data


                                       
Abstract:
The project aims at the development of integrated knowledge management tools and techniques for communities of users in the web, that support a) the semi-automatic organization of knowledge based on collaborative participation mechanisms, b) a high degree of freedom to define the structure to be used for the representation of knowledge, c) personalized retrieval and visualization of knowledge by means of the automatic generation of documents adapted to user needs and other runtime conditions, and d) the automatic recognition of users’ identity and characteristics, based on biometric analysis and data mining techniques. The different modules of the environment to be developed will communicate with each other through a shared ontology that captures the concepts and categories by which the knowledge of a given domain is described.

      The project follows the Semantic Web perspective, which proposes a better organized and structured World Wide Web, by means of the introduction of explicit semantic knowledge about available resources in the web, to facilitate resource discovery, sharing and integration. The research proposed here starts from previously developed tools and results reached by our research team, which has a wide experience in the areas involved in this proposal. The techniques to be developed in the current project will be validated with the development of a corporate knowledge management application for the institution to which our team belongs.
System Analysis:
Existing System:

Application of data analysis techniques to the World Wide Web, referred to as Web analysis, has been the focus of several recent research projects and papers. However, there is no established vocabulary, leading to confusion when comparing research efforts. The term Web analysis has been used in two distinct ways. The first, called Web content analysis in this paper is the process of information discovery from sources across the World Wide Web. The second called Web usage mining is the process of analysis for user browsing and access patterns. In this paper we define Web mining and present an overview of the various researches issues, techniques, and development efforts. We briefly describe web miner, a system for Web usage mining, and conclude this paper by listing research issues.

Proposed System:
                The web usage visual web mining tools do not indicate which data mining algorithms are used or provide effective graphical visualizations of the results obtained. Visual Web Mining techniques can be used to determine typical navigation patterns in an organizational web site. The process of combining Data analysis and information visualization techniques in order to discover useful information about web usage patterns is called visual web mining. The goal of this paper is to discuss the development of a data analysis prototype called web patterns, which allows the user to effectively visualize web usage patterns.
MODULES:
Data Cleaning 
User Identification
Session Identification
Session Filter
 Data Summarization

Data Cleaning:
                                The data cleaning process is used to remove the non analyzed resource request like jpg, gif, robot request from the server log file; it gets the raw log data as input and produces the cleaned log data as the output. Data cleaning is the first step performed in the Web usage mining process. Some low level data integration tasks may also be performed at this stage, such as combining multiple logs, incorporating referrer logs, etc. After the data cleaning, the log entries must be partitioned into logical clusters using one or a series of transaction identification modules.
                            The goal of transaction identification is to create meaningful clusters of references for each user. The task of identifying transactions is one of either dividing a large transaction into multiple smaller ones or merging small transactions into fewer larger ones. The input and output transaction formats match so that any number of modules to be combined in any order, as the data analyst sees _t. Once the domain-dependent data transformation phase is completed, the resulting transaction data must be formatted to conform to the data model of the appropriate data mining task.
               For instance, the format of the data for the association rule discovery task may be different than the format necessary for mining sequential patterns. Finally, a query mechanism will allow the user (analyst) to provide more control over the discovery process by specifying various constraints. For more details on the WEBMINER.
User Identification:
                                The user identification process is used to identity the user from the cleaned log data. It assign user id to the unique ip address. If for the same ip address if the os, or host agent is different it assign different user_id.
                                   The behavioral data were obtained from the Web server access logs of both School Web and the German Education Server in the form of a flat file containing one line per page request. The requests were listed in the commonly used Extended Log Format, which contains information on the host (IP address), the requested URL, the date and time of the transaction, and its status or error code (e.g., 200/success, 404/file not Found 301/permanent redirect). Since log files give rise to a number of uncertainties (e.g., Berendt & Spiliopoulou, 2000), the data needed to be cleaned and prepared before analysis.
                          First, hits needed to be transformed into page impressions by filtering out, from the Web server log files, all requests for GIF or other image files. Second, the participants had been instructed to turn off their proxies, so their individual computers’ current IP numbers appeared in the log files (IP numbers used in the experiment were known). To Segment the sequence of requests from one IP number into sessions of different participants, Web pages controlling the experiment were employed. These asked the participant to “log into” and “log out of” the experiment. Finally, the participants were asked to empty their browser’s cache and turn off caching prior to starting the experiment to ensure that the whole sequence of pages seen by the user, including revisits to pages previously requested, was recorded in the log file.
Session Filtering:
                                The session filtering process gets the input from the session data and it removes the useless session, session of length one. Thus, we obtained the full sequence of pages visited on the School Web or German Education Server for each participant. Intermediate visits to other servers could only be traced via the log file referrer field if the return move was initiated by a hyperlink on the remote page or if a redirect was employed. This was observed only seldom, because the experimental tasks did not require searching for information outside the respective servers (if it did occur, it was not recorded, because participants generally returned via the back button). Measures. The following measures were used.

1. Extent of navigation. This was measured by the number of nodes and the number of edges.7
2. Search strategies. We chose the branching factor of the home page as our indicator of how broadly a user searched that is, of how much the overall strategy resembled breadth-first search.
               The branching factor equals the number of different new nodes that is, branches opened up, directly after the home page. Repeat visits were not counted that is, in a path [home page, X, Y, home page, Y], only X counts toward the branching factor, since this back-and forth movement is interpreted as landmark use rather than as the examination of new ways of answering the question. In contrast, a depth-first  navigation strategy explores a path that leads the user away from the start page until the end of that path is reached, or at least until it becomes obvious that this will not lead to the goal. We considered long paths from the home page to be indicative of such a strategy and chose the average depth of exploration as our measure of a depth-first strategy: the number of forward moves divided by the number of branches emanating from the home page.       
Session Identification:
                                The session identification process gets the input from the user identity file for each user id extract the page accessed. If the time difference of the subsequent page access exceeds the threshold time then open the new session otherwise add it to the same session.

                  This was observed only seldom, because the experimental tasks did not require searching for information outside the respective servers (if it did occur, it was not recorded, because participants generally returned via the back button). Measures. The following measures were used.

                        In contrast, a depth-first  navigation strategy explores a path that leads the user away from the start page until the end of that path is reached, or at least until it becomes obvious that this will not lead to the goal. We considered long paths from the home page to be indicative of such a strategy and chose the average depth of exploration as our measure of a depth-first strategy: the number of forward moves divided by the number of branches emanating from the home page.       
Data Summarization:
                                The data summarization process extracts the fields of the session data and display the results graphically. It displays the visits/day, visits/hour, top browser, os, referrer etc.
                               Web usage data is collected in various ways, each mechanism collecting attributes relevant for its purpose. There is a need to pre-process the data to make it easier to mine for knowledge. Specifically, we believe that issues such as instrumentation and data collection, data integration and transaction identification need to be addressed.
                                      
                    Clearly improved data quality can improve the quality of any analysis on it. A problem in the Web domain is the inherent conflict between the analysis needs of the analysts (who want more detailed usage data collected), and the privacy needs of users (who want as little data collected as possible).
                                                                                                                          
                   This has lead to the development of cookie _les on one side and cache busting on the other. The emerging OPS standard on collecting profile data may be a compromise on what can and will be collected. However, it is not clear how much compliance to this can be expected. Hence, there will be a continual need to develop better instrumentation and data collection techniques, based on whatever is possible and allowable at any point in time. Portions of Web usage data exist in sources as diverse as Web server logs, referral logs, registration_les, and index server logs. Intelligent integration and correlation of information from these diverse sources can reveal usage information which may not be evident from any one of them.


SYSTEM REQUIREMENTS:
Hardware Interfaces:
   

Processor Type                 : Pentium -IV
Speed                                : 2.4 GHZ
Ram                                  : 256 MB RAM
                        Hard disk                          : 20 GB HD


Software Interfaces:
         

Operating System                            :  Win2000          
Programming Package                  :  .Net
Server                                                  :  ISS Server
Tools                                                     :  Flowchart .NET


No comments:

Post a Comment