Explain the role of HITS in web mining.

- January 01, 2022

The HITS Algorithm

The HITS algorithm is based on the idea that if the creator of page p provides a link to q then p confers some authority on page q. One must however realize that not all links confer authority, since some may be paid advertisements and others may be for navigational purposes, for example, links to the homepage in a large website.

In searching, often a very large number of items are retrieved. One way to order these items maybe by the number of in-links but it has been found that this approach does not always work. We present some definitions before describing the HITS algorithm. As noted earlier, a graph may be defined as a set of vertices and edges (V. E). The vertices correspond to the pages and the edges correspond to the links. A directed edge (p, q) indicates a link from page p to page q. The HITS algorithm has two major steps:

1. Sampling step: To collect a set of relevant Web pages given a topic Iterative step: To find hubs and authorities using the information collected during sampling

2. In trying to use HITS, for example for a query "Web mining", the algorithm involves carrying out the sampling step by querying a search engine and then using the most highly ranked Web pages retrieved for determining the authorities and hubs. As noted earlier, posing a query to a search engine often results in abundance. For example, "Java programming" might retrieve millions of pages, and finding authority pages from these millions of pages is likely to be very expensive. Also in some cases, the search engine may not retrieve all relevant pages for the query. The query for Java for example may not retrieve pages for object-oriented programming, some of which may also be relevant.

Search This Blog

Notes for BSc CSIT

Explain the role of HITS in web mining.

Comments

Post a Comment

Popular posts from this blog

What are different steps used in JDBC? Write down a small program showing all steps.

Discuss classification or taxonomy of virtualization at different levels.

Pure Versus Partial EC