doi:10.3850/978-981-08-7300-4_0993


A Performance based Search Engine Technique for Web Mining


Ram Kumar Sahu and Yogendra Kumar Jain

Department of Computer Science & Engineering, S.A.T.I, Vidisha, M.P., India

ABSTRACT

A Web Search Engine maintains and catalogs the content of Web pages in order to make them easier to find and browse. Even though the various Search Engines are similar, each one of them differentiates from the other by the methods for scouring, storing, and retrieving information from the Web. Usually Search Engines search through Web pages for specified keywords. In response they return a list containing those documents containing the specified keywords. This list is sorted by a relevance criterion which try to put at the very first positions the documents that best match the user’s query. The usefulness of a search engine to most people, in fact, is based on the relevance of results it gives back. This thesis tries to address some issues regarding some of the major challenges faced by Search Engines. In particular, since the size of the Web is rapidly growing, the main issues regard high performance algorithms for information management. Furthermore, nowadays Web Search Engines receive more than 200 million searches per day over a collection of several billion web pages indexed. These figures, in particular, can easily explain why in such environments the efficiency, as the effectiveness, of Search and Index algorithms have became issues. For these reasons in this thesis we are going toward proposing three novel techniques aimed at enhancing the performance of a Web Search Engine from three different angles. In this thesis we implement a novel Caching policy which obtains high hit-ratio values: the aim of caching is to enhance the throughput of a Web Search Engine (i.e. the number of answered queries per second). Also we implement against a novel problem which is related to Compression of Web Search Engine indexes: the more compressed an index, the better the memory hierarchy of a system is exploited.

Keywords: SDC policy, Indexing web search engines, K-means algorithm.


     Back to TOC

FULL TEXT(PDF)