Assignment 2 FAQ

I'll post the questions related to assignment 2 as they come in.


  • A possible design for the thread-pool (which was discussed during the help session) is shown below.

455-a2.jpg


  • Certain web page URLs are redirected to different URLs. How do I resolve the relative links with relative to the redirected URL?

Jericho can handle redirects. But it does not provide a way to get the URL of the redirected page in order to process the relative URLs. To solve this problem, you can implement a simple solution based on HTTP Response codes as follows.

    public String resolveRedirects(String url) throws IOException {
        HttpURLConnection con = (HttpURLConnection)(new URL(url).openConnection());
        con.setInstanceFollowRedirects(false);
        con.connect();
        int responseCode = con.getResponseCode();
        if(responseCode == 301){
            return con.getHeaderField( "Location" );
        } else {
            return url;
        }
    }

This method will return the redirected URL if there is a redirect from the web server. Otherwise it will return the original URL. Resolve the URL using this method before creating the URL object to parse the page with Jericho.

For example:

String pageUrl = resolveRedirects("http://www.chm.colostate.edu");
Source source = new Source(new URL(pageUrl));

  • When a task is handed off to another crawler, it is possible that the corresponding URL is already crawled possibly at a different recursion level. Is it required to crawl it again?

No, it is not required. Just decide if it is a duplicate task based entirely on the URL.


  • How do I make sure a particular page URL belongs to a particular domain?
public static boolean checkDomain(String pageUrl, String rootUrl) throws MalformedURLException {
    return new URL(pageUrl).getHost().equals(new URL(rootUrl).getHost());
}

Note : This approach does not work with the Psychology department domain.(http://www.colostate.edu/Depts/Psychology/). So you may customize the implementation to handle URLs from the Psychology department in a different way to simplify the implementation.


  • When a task is handed off to a peer crawler, it is possible that the corresponding URL was already crawled probably at a different recursion depth. Do I have crawl it again ?

No, it is not required. If the page is already crawled, consider it as a duplicate task irrespective of the recursion depth.


Use http://www.bmb.colostate.edu instead.


  • It is possible to check for URL redirection in an efficient manner. Instead of using the code given above, it is possible to use a single HTTP connection both to check the redirect as well as to retrieve the content.
  HttpURLConnection con = (HttpURLConnection)(new URL(url).openConnection());
  con.connect();
  InputStream is = con.getInputStream();
  // this is the actual url, the page is redirected to (if there is a redirect).
  String redirectedUrl = con.getURL().toString();
  // instead of passing the URL, pass the input stream.
  Source source = new Source(is);

  • Is it required to process HTTPS urls?

No, you can skip them (as well as any 'mailto' and 'ftp' urls.). You only need to process HTTP urls.


  • Is it required to consider HTTP query parameters (follows a '?' at the end of the URL) in a URL?

No, it is not required. Just remove those query parameters from the URLs.


  • Is there a reliable way to normalize the URLs?

Normalizing will make sure two URLs which are logically the same irrespective of slight syntactic differences. Java URI class has the normalize method. But it does not handle certain types of URLs. Following code is borrowed and modified from Apache Commons HTTPClient library.

/**
* Licensed under http://www.apache.org/licenses/LICENSE-2.0
*/
public static String normalize(String normalized) {
 
        if (normalized == null) {
            return null;
        }
 
        // If the buffer begins with "./" or "../", the "." or ".." is removed.
        if (normalized.startsWith("./")) {
            normalized = normalized.substring(1);
        } else if (normalized.startsWith("../")) {
            normalized = normalized.substring(2);
        } else if (normalized.startsWith("..")) {
            normalized = normalized.substring(2);
        }
 
        // All occurrences of "/./" in the buffer are replaced with "/"
        int index = -1;
        while ((index = normalized.indexOf("/./")) != -1) {
            normalized = normalized.substring(0, index) + normalized.substring(index + 2);
        }
 
        // If the buffer ends with "/.", the "." is removed.
        if (normalized.endsWith("/.")) {
            normalized = normalized.substring(0, normalized.length() - 1);
        }
 
        int startIndex = 0;
 
        // All occurrences of "/<segment>/../" in the buffer, where ".."
        // and <segment> are complete path segments, are iteratively replaced
        // with "/" in order from left to right until no matching pattern remains.
        // If the buffer ends with "/<segment>/..", that is also replaced
        // with "/".  Note that <segment> may be empty.
        while ((index = normalized.indexOf("/../", startIndex)) != -1) {
            int slashIndex = normalized.lastIndexOf('/', index - 1);
            if (slashIndex >= 0) {
                normalized = normalized.substring(0, slashIndex) + normalized.substring(index + 3);
            } else {
                startIndex = index + 3;
            }
        }
        if (normalized.endsWith("/..")) {
            int slashIndex = normalized.lastIndexOf('/', normalized.length() - 4);
            if (slashIndex >= 0) {
                normalized = normalized.substring(0, slashIndex + 1);
            }
        }
 
        // All prefixes of "<segment>/../" in the buffer, where ".."
        // and <segment> are complete path segments, are iteratively replaced
        // with "/" in order from left to right until no matching pattern remains.
        // If the buffer ends with "<segment>/..", that is also replaced
        // with "/".  Note that <segment> may be empty.
        while ((index = normalized.indexOf("/../")) != -1) {
            int slashIndex = normalized.lastIndexOf('/', index - 1);
            if (slashIndex >= 0) {
                break;
            } else {
                normalized = normalized.substring(index + 3);
            }
        }
        if (normalized.endsWith("/..")) {
            int slashIndex = normalized.lastIndexOf('/', normalized.length() - 4);
            if (slashIndex < 0) {
                normalized = "/";
            }
        }
 
        return normalized;
    }
 
faq/hw2.txt · Last modified: 2015/03/05 23:51 by thilinab