CS430 Assignment 4
Alex Faaborg

 

Extra Features

My submission of Assignment 4 includes two extra features that were not required:

1.  The web parser uses 10 concurrent threads to increase the speed at which it can parse

3.  The search application provides a graphical user interface inspired by everyone's favorite search engine

 

Algorithms and Data Structures used in Assignment4_index

After the user specifies the start URL, this URL is places in a LinkedList called toParse, and then 10 threads are instantiated.  These threads (class coreThreadGet) are programmed to to use class core to check toParse for URLs.  If no URLs are in toParse they wait half a second and check again.  In addition to the ability to check toParse, the class core exposes several other synchronized methods to these threads:

core.checkContinue() --> returns a boolean variable representing if the web parse has been completed yet

core.getParseCount() --> returns the number of web pages that have been parsed

core.getFile() --> takes in a URL String and returns a String of the HTML

core.savePage() --> adds the parsed pages information to a LinkedList in the class IndexMemory.  Also takes the hyperlinks on the page being saved and adds them to toParse if the URL is not blocked by the sites robots.txt file and it has not already been parsed.

Auto saving parse: as an initial parameter, the user can enter how many pages should be parsed in between each autosave of the index.xml file.  The class IndexMemory checks how many web pages it has received and then writes its memory to the hard drive when the pages exceed the interval specified.

Extracting information from web pages:  Regular expressions in the class Parse are used to find the title and hyperlinks on each web page.  The words on each page are found using a finite state machine which is in class getWords. This class also supports using a stop list, but this was commented out to allow for exact phrase searches.

the applications index and index_prep have been separated since the index application was used for hours at a time to index the web, and it was never clear when it would finish.

Web pages are represented in the file index.xml using the following XML:

<page>
    <
url>http://www.cornell.edu</url>
    <
title>Welcome to Cornell University!</title>
    <
titleWords>welcome,to,cornell,university,</titleWords>
    <
text>welcome,to,cornell,university,esponse,to,9,11,events,diversity,inclusiveness,administration,feedback,site,index,copyright,complaints,.........</text>
    <
link>http://www.cornell.edu/CUHomePage/Welcome.html</link>
    <
link>http://www.cornell.edu/CUHomePage/Tour.html</link>
    <
link>http://www.cornell.edu/clickthrough.php3?url=http://www.admissions.cornell.edu/AtoZ.cfm</link>
<
/page>

 

Algorithms and Data Structures used in Assignment4_index_prep

The application Assignment4_index_prep takes the file index.xml created by Assignment4_index and creates several inverted indexes.  The XML manipulation is done using Crimson by Apache.

The urlList.xml file is used to abbreviate URLs in the files titleWords.xml and words.xml.  URLs are found in the page elements of index.xml and added to urlList.xml using the format:

<urlList>
    <
p0>http://www.cornell.edu</p0>
    <
p1>http://www.cornell.edu/CUHomePage/Welcome.html</p1>
    <
p2>http://www.cornell.edu/CUHomePage/Tour.html</p2>
    <
p3>http://www.cornell.edu/CUFACTS/</p3>
    <
p4>http://www.cornell.edu/CUHomePage/FAQ/default.html</p4>
    <
p5>http://www.cornell.edu/CUHomePage/CornellStream.html</p5>

The inverted word list is created by adding words from each page in index.xml to a java.util.Dictionary, which extends Hashtable.  The keys are the words themselves, and the value is a Set of hits.  The data structure Set is used to eliminate duplicate postings.  To sort the dictionary, an enumeration of keys is placed in an array, then sorted using the Arrays.sort() method in java, which performs a merge sort.  The words and their postings are encoded in XML using the following format:

<word>
    <
text>engineering</text>
    <
hits>p1, p101, p2, p21, p28, p31, p38, p39, p43, p51</hits>
</
word>
<
word>
    <
text>engineers</text>
    <
hits>p107, p99</hits>
</
word>
<
word>
    <
text>english</text>
    <
hits>p24, p4, p51, p55</hits>
</
word>
<
word>
    <
text>enhance</text>
    <
hits>p136, p146, p24</hits>
</
word>

Forward links are determined by simply accessing the index.xml file and extracting the links on each page.  Backwards links are generated by searching the entire index.xml file for pages that link to each URL. The file backwardsLinks.xml is encoded with the following format:

<page>
    <
url>http://www.cornell.edu</url>
    <
backLink>http://www.cornell.edu/clickthrough.php3?url=http://www.store.cornell.edu</backLink>
    <
backLink>http://www.cornell.edu/clickthrough.php3?url=http://www.chimes.cornell.edu</backLink>
    <
backLink>http://www.cornell.edu/clickthrough.php3?url=http://www.einaudi.cornell.edu</backLink>
</
page>

 

Algorithms and Data Structures used in Assignment4_discoverClient

The 5 inverted index files and index.xml are read in when first initializing the search application using Crimson by Apache.  Unfortunately loading an wordsList.xml file containing 17,000 words takes about 20 minutes.

the elements in wordsList.xml and titleList.xml are stored in a java.util.Dictionary.  Results to searches for pages that contain any of the search terms entered are generated by merging the LinkedList's returned by accessing the dictionary.  Results for searches for pages that contain all of the terms are generated by finding the results for pages that have any of the terms, and then find the terms that appear the same number of times as they are pages (since there are no duplicated).  Results for exact phrase searches are generated by find the pages that contain all of the terms, and then checking each page to see if it contains these terms in an exact phrase by using java.util.regex.