Programming Assignment 2: Web Search

For this assignment, you will write an application that searches the web for pages that contain a keyword. Your solution will be completely in one Java file, WebSearch.java. It will use command line arguments to get the initial web page to start the search from, as a URL, the word to search for, and the maximum number of pages to search. An ArrayList of WebPage objects will be maintained in sorted order, in descending order keyword counts. You will have to implement a bubble-sort algorithm on the ArrayList.

Method

You must use the WebPage and JSoup code described below. The main idea for your solution is to maintain and ArrayList of WebPage objects sorted in descending order of the number of keywords found in each page. At every iteration of the search, start from the front of this list and find the first page whose links have not been extracted. Extract them, get the documents, count the keywords in each, add them to the ArrayList of links, and resort it. Continue until links have been extracted from as many pages as specified as the maximum number of pages to search.

In a more step-by-step algorithm, implement the following.

  Webpage page = new WebPage(initialURL);
  Get the initial document, using page.getDocument().
  If this returns false, print out the error message shown in next section and exit.
  Count the number of keywords in page, using page.countKeyword(keyword).
  Create a new ArrayList, called Links, to hold WebPage objects.
  Add page to Links.

  While new pages have been added to Links and the maximum number of pages has not been reached:
    Step through Links, using index i:
      Get the ith page from Links.
      If it hasn't been searched yet,
        Extract the links from this page.
        For each of these new links,
          If the new link not already in Links,
            If getDocument succeeds on this link,
              Count the keywords and add it to Links
        Sort Links using a bubble-sort algorithm that you implement.
        Break out of "Step through Links, using index i".

  Print the top 10 pages in Links.

Correct Output

Your code should produce the following output. The search results may differ a little if the web pages change. The asterisk marks pages whose links have been extracted.

> java WebSearch
Usage:   java WebSearch <url> <word to find> <maximum number of pages to search>
Example: java WebSearch http://www.cs.colostate.edu research 20

> java WebSearch http://nowhere research 20
First URL http://nowhere could not be downloaded.

> java WebSearch http://www.cs.colostate.edu research 20
   30: * http://www.chem.colostate.edu/
   20:   http://www.bmb.colostate.edu/
   15:   http://www.biology.colostate.edu/
    9:   http://www.physics.colostate.edu/
    9:   http://www.chem.colostate.edu/employment/
    7:   http://www.chem.colostate.edu/department-announces-symposium-and-campaign-for-new-williams-endowed-chair/
    6:   http://www.chem.colostate.edu/csu-cu-chemistry-team-lands-4-4m-grant-for-sustainable-chemical-synthesis/
    5:   http://www.chem.colostate.edu/faculty-materials/
    5:   http://www.chem.colostate.edu/graduates/chemical-biology-program/
    5:   http://www.chem.colostate.edu/professor-amber-krummel-lands-prestigious-nsf-career-award/

Code You Must Use

Download JSoup from http://jsoup.org. I recommend downloading http://jsoup.org/packages/jsoup-1.7.2-sources.jar and unjar-ring it in the directory where you are developing your code for this assignment.

jar xf jsoup-1.7.2-sources.jar

You will then have a subdirectory with this structure:

org
`---jsoup
|   `---examples
|   `---helper
|   `---nodes
|   `---parser
|   `---safety
|   `---select

We will be using Java import statements to use this code.

Here is the result of running main in the WebPage class.

 java WebPage
    0:   http://www.cs.colostate.edu/
Result of searching for the word 'research' starting from http://www.cs.colostate.edu/
    3: * http://www.cs.colostate.edu/
Links found:
http://www.natsci.colostate.edu/
http://www.colostate.edu/
http://www.cs.colostate.edu/BMAC/
http://www.cs.colostate.edu/TechReports/
http://www.cs.colostate.edu/directory/directory.htm/
http://www.cs.colostate.edu/cgi-bin/webmaster/userlist.cgi/
http://www.cs.colostate.edu/~acm/
http://events.colostate.edu/day_default.asp?ID=7/
http://www.bmb.colostate.edu/
http://www.biology.colostate.edu/
http://www.chem.colostate.edu/
http://www.cs.colostate.edu/
http://www.math.colostate.edu/
http://www.physics.colostate.edu/
http://www.colostate.edu/Depts/Psychology/
http://www.stat.colostate.edu/
http://www.natsci.colostate.edu/college/centers.cfm/
https://advancing.colostate.edu/CNS/MAIN/
http://admissions.colostate.edu/
http://search.colostate.edu/
http://www.colostate.edu/info-contact.aspx/
http://www.colostate.edu/info-disclaimer.aspx/
http://www.colostate.edu/info-equalop.aspx/
http://www.colostate.edu/info-privacy.aspx/

Here is the WebPage class.

WebPage.java
import java.util.ArrayList;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
 
public class WebPage {
    protected String url;
    protected Document document;
    protected boolean searched;
    protected int keywordCount;
 
    public WebPage(String urlArg) {
	int n = urlArg.length();
	if (urlArg.charAt(n-1) != '/')
	    url = urlArg + "/";
	else
	    url = urlArg;
	searched = false;
    }
 
    public boolean isSearched() {
	return searched;
    }
 
    public boolean getDocument() {
	try {
	    document = Jsoup.connect(url).get();
	    return true;
	} catch (IOException e) {
	    return false;
	}
    }
 
    public boolean equals(Object otherObject) {
	if (otherObject instanceof WebPage) {
	    WebPage other = (WebPage) otherObject;
	    return url.equals(other.url);
	} else {
	    return false;
	}
    }
 
    public ArrayList<WebPage> extractLinks() {
	Elements linkElements = document.select("a[href]");
	ArrayList<WebPage> links = new ArrayList<WebPage>(linkElements.size());
	for (Element link : linkElements) {
	    String href = link.attr("href");
	    if (href.indexOf("http") > -1) {
		WebPage newPage = new WebPage(href);
		if (! links.contains(newPage)) 
		    links.add(newPage);
	    }
	}
	searched = true;
	return links;
    }
 
    public int countKeyword(String word) {
	String text = document.text();
	keywordCount = 0;
	int index = text.indexOf(word);
	int wordLength = word.length();
	while (index != -1) {
	    ++keywordCount;
	    index = text.indexOf(word, index+wordLength);
	}
	return keywordCount;  //in case caller wants to know.
    }
 
    public String toString() {
	char searchedMarker = ' ';
	if (searched)
	    searchedMarker = '*';
	return String.format("%5d: %c %s", keywordCount, searchedMarker, url);
    }
 
    //**********************************************************************
    //  Test WebPage
    //**********************************************************************
 
    public static void main (String [] args) {
	WebPage p = new WebPage("http://www.cs.colostate.edu");
	System.out.println(p);
 
	p.getDocument();
	p.countKeyword("research");
	ArrayList<WebPage> links = p.extractLinks();
	System.out.println("Result of searching for the word 'research' starting from " + p.url);
	System.out.println(p);  // using WebPage toString()
	System.out.println("Links found:");
	for (WebPage link : links)
	    System.out.println(link.url);
    }
 
}

Submit to RamCT

Submit your WebSearch.java file via RamCT.

Recent changes RSS feed CC Attribution-Share Alike 3.0 Unported Driven by DokuWiki