Studies on Web Corpora


Definitions

Web document: any entity available on the World Wide Web, including MicroSoft Word documents, pdf, ps, text, html, etc.
Web genre: type of the page characterized by features such as style, form or presentation layout, and meta-content.
Style : from UNIX command style; includes readability
Internet address: example: "cs.colostate.edu" as opposed to the domain name which is "colostate.edu".

Access to old versions of Web documents

via Internet Archive: maintains copies of on-line digital documents including updates since 1996.

Limitations of archive:
- Pages blocked with robots.txt
- Pages requested to be removed by owners
- server-side image maps
- JavaScript issues
- Pages not crawled by Alexa Internet


Downloading Web Documents

- 30 second time-out
- HTTP status code recorded (however, "200" could be a custom error msg)

Removed Files from Corpora

Files were removed if:
file size of zero
not a proper URL (URL not specified, local files, ftp files)
illegal codings (found: <!WA0, <!WA1, <!WA2, etc.)
file size of zero after document conversion to text
HTML frame page
Macromedia Flash page
404 Not found and "soft-404's"
permission errors
server errors
re-directs (except in cases of news page refresh)

or small file (< 8500) and contained the following (not case sensitive)
not found
404
redirect
has moved
permanently moved
moved permanently
refresh
frameset
frameset
embed
document.location
swf
Your request failed to connect to our servers

CIKM '05 URLs Used

WebKb
WebKb New URLs used
WebKb Old URLs used
WebKb
Meyer zu Eissen and Stein New URLs used
Meyer zu Eissen and Stein Old URLs used

Features

Note: some of the features overlap between style, form and content.

Style: all 45 data points from the UNIX command style
    includes readability measures and part of speech statistics

Form:

Content:
    BOW
converted to lower case

control characters
numbers:  as percent, as phone, as time, as ISBN, as date, as year, as [default] number
hypertext link
HTML heading: as separated entity, grouped together as emphasized text
HTML list item
image as type [gif|jpg|png] and as image
punctuation: colon, slash, comma, at, period, exclaimation, number, and, open paren,
  close paren, double quote, single quote, plus, minus, question, tilde, percent, 
  open squiggly, close squiggly, open bracket, close bracket, semicolon, backslash, carat
  equal, dollar
   REMOVED: less than, greater than
salutations: grouped all mister/mr[.]/mrs[.]/dr[.] 
email
seasons grouped: fall/spring/winter/summer
days of week grouped: monday/tuesday/wednesday/thursday/friday/saturday/
	including stem of word above
months grouped: january/february/march/april/may/june/july/august/september/
	october/november/december
	including stem of word above
	including abbreviation: jan/feb/mar/apr/jun/jul/aug/sep/sept/oct/nov/dec
bi-grams: work experi, last updat, last modifi, all right reserv, 
	frequent ask question, faq
HTML features: 
  	script, link, anchor, bgcolor, fgcolor, backgroundimage,
	image, style, 
	TAGS: p, BR, center, hr, table, tr, ul, ol, li, 
		dd, dt, dl, embed, font, form, tt, code, u
	GROUPED TAGS for emphasized: I, small, em, B, H1, H2, h3, h4

additionally analyzed in Boese corpus: link text


Web Corpora


© 2005 by E.S.Boese. All Rights Reserved.