Current Project - Corpora Development
Current Project -
- Effects of Web Document Evolution on Genre Classification
CIKM 2005, Bremin Germany Oct 31 - Nov 5
- Abstract:
The World Wide Web is a massive corpus that constantly evolves.
Classification experiments usually grab a snapshot (temporally and
spatially) of the Web for a corpus. In this paper, we examine the
effects of page evolution on genre classification of Web pages.
Web genre refers to the type of the page characterized by features
such as style, form or presentation layout, and meta-content; Web
genre can be used to tune spider crawling re-visits and inform
relevance judgments for search engines. We found that pages in some
genres change rarely if at all and can be used in present-day research
experiments without requiring an updated version. We show that an old
corpus can be used for training when testing on new Web pages, with
only a marginal drop in accuracy rates on genre classification. We
also show that features found to be useful in one corpus do not
transfer well to other corpora with different genres.
Teaching Research
Current Project - Classification of similar genres
- Analysis of similar genres in classifying Web documents
- Genre Classification of Web Documents
Poster for AAAI 2005, Pittsburgh July 10 - 13
- Abstract:
Retrieving relevant documents over the Web
is an overwhelming task when search engines return thousands of Web
documents. Sifting through these documents is time-consuming and
sometimes leads to an unsuccessful search. One problem is that most
search engines rely on matching a query to documents based solely
on topical keywords. However, many users of search engines have
a particular genre in mind for the desired documents. The genre
of a document concerns aspects of the document such as the style or
readability, presentation layout, and meta-content such as words in the
title or the existence of graphs or photos. By including genre in Web
searches, I hypothesize that Web document retrieval could greatly improve
accuracy by better matching documents to the user's information needs.
Before implementing a search engine capable of discriminating on both
genre and topic, a feasibility analysis of genre classification is
needed. Our previous research achieved 91% classification accuracy
across ten genres, whereas similar research range between 60 and 85%
accuracy. However, the ten genres used in our research were mostly
distinct and only exemplar Web documents (consisting of only one genre)
were chosen. This paper discusses our current work which involves an
in-depth analysis of maintaining high accuracy rates among genres that
are very similar.
Master's Thesis
-
Stereotyping the Web: Genre
Classification of Web Documents
- Abstract:
Retrieving relevant documents over the Web is a difficult
task. Currently, search engines rely on keywords for matching
documents to user queries. This paper explores the potential for
discriminating documents based on the genre of the document. I
define genre as a taxonomy that incorporates the style, form and
content of a document which is orthogonal to topic, with fuzzy
classification to multiple genres. I explore how to automate the
classification of Web documents according to their genres. Over 1,600
features of genres are identified and selection methods examined for
distinguishing documents between ten genre types. Classification
of documents using Bayes Net on a subset of 75 features achieved
91% accuracy.
BibTex entry
@book{boese05thesis,
author = "Elizabeth Sugar Boese",
title = {Stereotyping the Web: Genre Classification of Web Documents (M.S. Thesis)},
publisher = "Computer Science Department, Colorado State University",
address = "Fort Collins, CO",
month = "March",
year = "2005"
}
Other Information
Copyright © 2005:
Colorado State University .
All rights reserved.