A STATE-OF-THE-ART SURVEY: FOCUSED WEB CRAWLING USING NAMED ENTITY RECOGNITION FOR NARROW DOMAINS

Lakshman Jayaratne

A STATE-OF-THE-ART SURVEY: FOCUSED WEB CRAWLING USING NAMED ENTITY RECOGNITION FOR NARROW DOMAINS

Lakshman Jayaratne

Abstract

Within recent years the World Wide Web (WWW) has grown enormously to a large extent where generic web crawlers have become unable to keep up with. As a result, focused web crawlers have gained its popularity which is focused only on a particular domain. But these crawlers are based on lexical terms where they ignore the information contained within named entities; named entities can be a very good source of information when crawling on narrow domains. In this paper we discuss a new approach to focus crawling based on named entities for narrow domains. We have conducted experiments in focused web crawling in three narrow domains: baseball, football and American politics. A classifier based on the centroid algorithm is used to guide the crawler which is trained on web pages collected manually from online news articles for each domain. Our results showed that during anytime of the crawl, the collection built with our crawler is better than the traditional focused crawler based on lexical terms, in terms of the harvest ratio. And this was true for all the three domains considered.

Keywords

web mining, focused crawling, named entity, classification

Full Text:

PDF

Refbacks

There are currently no refbacks.

Username
Password
Remember me

GSTF Journal on Computing (JoC)