style; includes readability
Limitations of archive:
- Pages blocked with robots.txt
- Pages requested to be removed by owners
- server-side image maps
- JavaScript issues
- Pages not crawled by Alexa Internet
Style: all 45 data points from the UNIX command style
includes readability measures and part of speech statistics
Form:
Content:
BOW
converted to lower case
control characters numbers: as percent, as phone, as time, as ISBN, as date, as year, as [default] number hypertext link HTML heading: as separated entity, grouped together as emphasized text HTML list item image as type [gif|jpg|png] and as image punctuation: colon, slash, comma, at, period, exclaimation, number, and, open paren, close paren, double quote, single quote, plus, minus, question, tilde, percent, open squiggly, close squiggly, open bracket, close bracket, semicolon, backslash, carat equal, dollar REMOVED: less than, greater than salutations: grouped all mister/mr[.]/mrs[.]/dr[.] email seasons grouped: fall/spring/winter/summer days of week grouped: monday/tuesday/wednesday/thursday/friday/saturday/ including stem of word above months grouped: january/february/march/april/may/june/july/august/september/ october/november/december including stem of word above including abbreviation: jan/feb/mar/apr/jun/jul/aug/sep/sept/oct/nov/dec bi-grams: work experi, last updat, last modifi, all right reserv, frequent ask question, faq HTML features: script, link, anchor, bgcolor, fgcolor, backgroundimage, image, style, TAGS: p, BR, center, hr, table, tr, ul, ol, li, dd, dt, dl, embed, font, form, tt, code, u GROUPED TAGS for emphasized: I, small, em, B, H1, H2, h3, h4
additionally analyzed in Boese corpus: link text