COMP3506/7505 Assignment 2 textual search application

$35.00

Category: You will Instantly receive a download link for .zip solution file upon Payment

Description

5/5 - (3 votes)

Searching is a common process in many applications, with textual searches being a very common
type of search. Search engines, query engines and many data mining applications are examples of
textual search applications. These applications make use of sophisticated algorithms and data
structures to provide efficient search behaviour.
In this assignment you will implement a general textual search application. The application is to be
optimised to perform many complex searches on a single document or collection of documents.
Details
You are provided with three files representing the document to be searched, an index of sections in
that document and a set of stop words. The document to be searched (shakespeare.txt) is a
large text file containing the complete works of William Shakespeare. The index of sections
(shakespeare-index.txt) contains the titles of the works included in the document and the line
number on which they start. The set of stop words (stop-words.txt) contains words that are to
be ignored in a logical search. All three files are encoded in UTF-8 format.
When searches are performed on the document, the results need to indicate the position at which
the search term(s) appear in the document. The position is indicated by the line number on which
the term appears and the column number of the first character of the term. Searches will only be
for English words and will ignore the case of the words in either the document or search term. (e.g.
If the document contained the text “Well, come again tomorrow.”, a search for the phrase “well
come again tomorrow” would match it.)
When determining the text that makes up a single word in the document, most punctuation marks
can be ignored. The exception is for an apostrophe in the middle of a word, such as for a possessive
noun or a contraction. (e.g. “I’ll” would be stored as the word “i’ll”, and “honour’s” as “honour’s”.)
Apostrophes, or other punctuation, at the start or end of a word can be ignored. (e.g. “’tis” would
be stored as the word “tis”, and “years’” as “years”.)
The file shakespeare-index.txt contains the titles and line numbers of each of the works in the
shakespeare.txt file. The titles and line numbers are separated by a comma. Your application needs
to use this index file to determine if a search term appears in a particular work of Shakespeare. For
example, searching for “to be or not to be” in “the tragedy of hamlet”, should return that it is on
line 25,779 and column 1. Whereas, searching for “to be or not to be” in “the tragedy of romeo and
juliet”,should return that no match was found. The titles should not have punctuation removed and
the full title, exactly as it is in the index file, is needed when identifying a section in a search.
The stop-words.txt file contains a list of words that are used too commonly in English text to be
useful as parts of search terms. This file contains one stop word per line in the file. Your application
needs to store all of the stop words found in the document, as there may be situations when a
search needs to include them. But, most searches you implement should ignore any of the stop
words that appear in the search term. The only search that will not ignore the stop words is the
search for a phraseOccurrence. For example, the famous phrase “to be or not to be” in Hamlet
would not be able to be found if you ignored the stop words, as almost the entire phrase consists
of stop words.
COMP3506/7505 Semester 2, 2018
Your application needs to store the data in appropriate data structures allowing efficient searching.
Part of your mark will be based on the data structures you use and your justification of your choices.
Your application should be designed with the assumption that the document will be searched many
times and that searching needs to return a result as fast as possible. You should process the
document’s text to optimise search performance. Searching is expected to occur in main memory.
Your application should be designed so that it will work with any text document that has an
associated index in the format described above. The application should also work with any list of
stop words. Testing of your application may use a different text document and index file, or a
different set of stop words.
Your application needs to implement the following searches:
1. Count the number of occurrences of a word.
2. Search for a single word.
3. Search for a prefix of a word. (e.g. Searching for “obscure” would find “obscure”, “obscured”,
“obscures” and “obscurely”.
4. Search for a phrase, where a phrase is a sequence of words that are contiguous (e.g. “to be or
not to be”).
5. Implement boolean search logic for the operators AND, OR and NOT.
There are two parts to the fifth search. One part is to find all lines which match the search criteria.
The second part is to find all occurrences of the terms in the search criteria within one or more
sections of the document.
So that you do not need to implement parsing of search strings, you are provided with an interface
called Search and a stub class AutoTester that implements it. These will be used to automate
the testing of your application’s functionality. Your AutoTester class will need to implement the
constructor stub and override the interface’s methods. These methods should call the related
functionality in your assignment and return the results defined in the Search interface’s JavaDoc.
If you do not implement some searches, the methods related to these should not be implemented
in your AutoTester class. You are provided with two simple entity classes called Pair and
Triple that are used by Search to return results from various methods.
You may implement a user interface for the assignment if you would like to do so. This will not be
executed by markers and will not be considered in assessing your mark for the assignment.
Hints
The intention of the assignment is to determine not just your ability to implement sophisticated
data structures and their related algorithms. It is to determine your ability to select data structures
and algorithms that are suitable for complex tasks. This may involve using a combination of data
structures and algorithms to provide a good solution.
Section 13.3.4 in the textbook describes an approach to implementing a search engine. You may
find it helpful in providing ideas for this assignment.
Constraints
You may not move any provided classes or interfaces to different packages. You may not add,
remove or change anything in the Search interface. You may not add, remove or change anything
in Pair or Triple. You may not move the text files out of the files directory and you may not
change the names of these files.
COMP3506/7505 Semester 2, 2018
You may not use anything from the Java Collections Framework in the JDK, or any other Java data
structures library, to implement the Abstract Data Types (ADTs) or Concrete Data Types (CDTs) in
your assignment. You may implement the Comparable, Comparator, Iterator or Iterable interfaces
for use in your ADTs or CDTs. Your implementation of the data structures in the assignment must
be based on basic Java language constructs and not libraries. See:
 https://docs.oracle.com/javase/8/docs/technotes/guides/collections/overview.html
The restriction of not using anything from the Java Collections Framework (JCF) does not apply to
your implementation of AutoTester. Some of its methods return results that are types from the
JCF. Your implementation of these methods will need to put the results found from your data
structure searches into an appropriate JCF concrete type.
Analysis
In the JavaDoc comments for your CDT classes indicate the run-time efficiency of each public
method. Use big-O notation to indicate efficiency (e.g. O(n log n)). This should be on a separate line
of the method’s JavaDoc comment, as the last line of the description and before any @param or
@return tags. When determining run-time efficiency of a method, take into account any methods
that it may call.
In the JavaDoc comments for your CDT classes indicate the memory usage efficiency of the data
structure. This should be in the JavaDoc comment for the class and should be on a separate line,
which is the last line of the description and before any @author, @see or any other tags.
You must provide an explanation as to how you determined the run-time and memory usage
efficiencies.
In a file called readme.pdf, provide a justification for the algorithms and data structures that you
used in implementing the application. The justification should explain why the algorithms and data
structures that you implemented were appropriate in the context of efficiently performing many
different searches on the same document in memory.
COMP7505 Extension
Students who are studying COMP7505 must complete a further research component to this
assignment. This extension is worth an additional 10 marks. Your final mark displayed on
BlackBoard’s Grade Centre will be scaled to be out of 30. If a student studying COMP3506 completes
this extension it will not be marked.
Pre-processing the text in the document to optimise searching is an expensive process. You are to
research options of how to minimise this cost. For example, can the results of the pre-processing be
saved to a file that can then be loaded directly into the data structures used for searching? You are
to research possible approaches and write a report describing one of these approaches. The
description should briefly summarise why the approach was an appropriate selection in comparison
to other approaches. Your report should analyse the run-time efficiency of the selected approach
and compare this to the run-time efficiency of pre-processing the document. You should draw a
conclusion as to whether the selected approach is a better alternative than pre-processing the
document or not.
You may attempt to implement the approach in your application. If you do so, the AutoTester
constructor should check the documentFileName to determine if it has been previously preprocessed. If it has been pre-processed then the constructor should apply the optimisation strategy
to create the application’s data structures. If the file has not been previously pre-processed, then it
COMP3506/7505 Semester 2, 2018
should be pre-processed as it is loaded and then the optimisation strategy should be able to be
applied to that file in future uses of AutoTester.
You may use Java’s serialisation functionality for the implementation. Just making your data
structures implement the Serializable interface will gain minimal marks for this functionality.
If you use Java’ serialisation functionality, your report must provide a detailed description of its
implementation, not just say that the objects’ state is saved to the file system.
The report must be less than 1000 words, not including any references. Your report should follow
the IEEE standard for in-text citations and bibliographic referencing. Failure to do this may lead to
allegations of plagiarism.
This report is to be in a file called report.pdf. Failure to submit this report will cap your final mark
for this assignment to be a maximum of 13 out of 30. Failure to implement the approach described
in your report, will result in only losing the marks specified in the criteria for this functionality.
Submission
You must submit your completed assignment electronically through Blackboard. You should submit
your assignment as a single zip file called assign2.zip (use this name – all lower case). This zip
file must contain the code that was provided for the assignment in the same directory structure as
was provided in the provided_code.zip. You may put your own implementation code in
whatever packages you believe are appropriate, as long as AutoTester can be executed in its
existing package. The readme.pdf file, and report.pdf file if you are a COMP7505 student, must
be in the root directory of assign2.zip. Submissions that do not conform to these requirements
(e.g. different directory structures, packages or classes) may not be marked.
You may submit your assignment multiple times before the deadline – only the last submission will
be marked.
Late submission of the assignment will not be marked. In the event of exceptional personal or
medical circumstances that prevent you from handing in the assignment on time, you may submit
a request for an extension.
See the course profile for details of how to apply for an extension:
http://www.courses.uq.edu.au/student_section_loader.php?section=5&profileId=94208
Requests for extensions must be made no later than 48 hours prior to the submission deadline. The
application and supporting documentation (e.g. medical certificate) must be submitted via my.UQ.
You must retain the original documentation for a minimum period of six months to provide as
verification should you be requested to do so.
Please read the section in the course profile about plagiarism. Submitted assignments will be
electronically checked for potential plagiarism.
Completeness
The marking criteria below are based on the assumption that the entire assignment is completed. A
partially complete assignment will be accepted and marked but the final mark for the assignment
will be scaled based on the amount of work completed. For example, completing only search 1
would result in your final mark being reduced by 50%. Completing only searches 1 and 2 would
result in your final mark being reduced by 40%.
COMP3506/7505 Semester 2, 2018
Marking Criteria
Functionality Marks
Each of the functional points below is worth ⅓ of a mark. There is a bonus of 1 mark for passing at
least 26 of these tests, and another bonus of 1 mark for passing at least 28 of these tests.
 wordCount correctly counts the occurrences of a single word that has no edge cases (e.g. no
punctuation adjacent to the word or word starting or ending a line).
 wordCount correctly counts the occurrences of a single word.
 phraseOccurrence correctly locates all occurrences of a single word that has no edge cases (e.g.
no punctuation adjacent to the word or word starting or ending a line).
 phraseOccurrence correctly locates all occurrences of a single word.
 phraseOccurrence correctly locates all occurrences of a phrase that has no edge cases.
 phraseOccurrence correctly locates all occurrences of a phrase that occurs on a single line.
 phraseOccurrence correctly locates all occurrences of a phrase.
 prefixOccurrence correctly locates all prefixes of the word, except the word itself.
 prefixOccurrence correctly locates all prefixes of the word.
 wordsOnLine correctly finds lines with more than one of the words that have no edge cases (e.g. no
punctuation adjacent to or within the phrase).
 wordsOnLine correctly finds lines with all the words that have no edge cases.
 wordsOnLine correctly finds lines with all the words.
 someWordsOnLine correctly finds lines with at least one of the words that have no edge cases.
 someWordsOnLine correctly finds lines with some of the words that have no edge cases.
 someWordsOnLine correctly implements “or” logic.
 wordsNotOnLine correctly finds lines with at least one of the required words and at least not one
of the excluded words that have no edge cases.
 wordsNotOnLine correctly finds lines with at least all the required words and none of the excluded
words that have no edge cases.
 wordsNotOnLine correctly implements “and/not” logic.
 simpleAndSearch correctly finds position of all words, in one section, that have no edge cases.
 simpleAndSearch correctly finds position of all words, in multiple sections, that have no edge cases.
 simpleAndSearch correctly finds position of all words, in multiple sections.
 simpleOrSearch correctly finds position of words, in one section, that have no edge cases.
 simpleOrSearch correctly finds position of words, in multiple sections, that have no edge cases.
 simpleOrSearch correctly finds position of words, in multiple sections.
 simpleNotSearch correctly finds position of all required words, in one section that does not have
any of the excluded words; where the words have no edge cases.
 simpleNotSearch correctly finds position of all required words, in multiple sections that do not
have any of the excluded words; where the words have no edge cases.
 simpleNotSearch correctly finds position of all required words, in multiple sections that do not
have any of the excluded words.
 compoundAndOrSearch correctly finds position of all required words, in one section that has at
least one of the orWords; where the words have no edge cases.
 compoundAndOrSearch correctly finds position of all required words, in multiple sections that
have at least one of the orWords; where the words have no edge cases.
 compoundAndOrSearch correctly finds position of all required words, in multiple sections that
have at least one of the orWords.
COMP3506/7505 Semester 2, 2018
Criteria Excellent Good Satisfactory Poor
Code Quality  Code is well structured with
excellent readability and clearly
conforms to a coding standard.
 Comments are clear, concise and
comprehensive, enhancing
comprehension of method usage
and implementation details.
 Code is well structured with good
readability and conforms to a coding
standard.
 Comments are clear and
comprehensive, mostly enhancing
comprehension of method usage
and implementation details.
 Code is well structured with good
readability and mostly conforms to a
coding standard.
 Comments are fairly clear and
comprehensive, providing some
useful information about method
usage or implementation details.
 Parts of the code are not well
structured or are difficult to read,
not conforming to a coding standard.
 Comments are often not clear or
comprehensive; or provide little
useful information about method
usage or implementation details.
4 3.5 3 2.5 2 1.5 1 0.5 0
Design  Data structures and algorithms are
very good choices in terms of
efficiency and are implemented in an
exemplary manner.
 Justification of design is valid, clear,
concise, articulate and well informed.
 Data structures and algorithms are
good choices in terms of efficiency
and are implemented with very few
inefficiencies or minor errors.
 Justification of design is valid, clear
and well informed.
 Data structures and algorithms are
appropriate choices in terms of
efficiency and are implemented with
few inefficiencies or minor errors.
 Justification of design is valid and
clear.
 Some data structures and algorithms
are inappropriate choices in terms of
efficiency or are implemented with
several inefficiencies or some errors.
 Justification of design is invalid or
unclear.
10 9 8 7 6 5 4 3 2 1 0
Analysis  Correctly determined and clearly
explained run-time and space
efficiency of every method and CDT.
 Correctly determined and clearly
explained run-time and space
efficiency of almost all methods and
all CDTs.
 Correctly determined and fairly
clearly explained run-time and space
efficiency of most methods and
CDTs.
 Correctly determined run-time and
space efficiency of some methods
and CDTs; or poorly explained how
efficiencies were determined.
4 3.5 3 2.5 2 1.5 1 0.5 0
COMP7505
Report
 Correct, clear and concise
description of approach.
 Valid, clear, concise and wellreasoned summary of alternatives.
 Correct run-time analysis of
approach with an insightful
comparison to pre-processing
leading to an informed conclusion.
 Correct and clear description of
approach.
 Valid and well-reasoned summary of
alternatives.
 Correct run-time analysis of
approach with an insightful
comparison to pre-processing.
 Correct and reasonably clear
description of approach.
 Valid and moderately well-reasoned
summary of alternatives.
 Correct run-time analysis of
approach with a fairly clear and valid
comparison to pre-processing.
 Incorrect or unclear description of
approach.
 Invalid or poorly reasoned summary
of alternatives.
 Incorrect run-time analysis of
approach, or an unclear or invalid
comparison to pre-processing.
8 7 6 5 4 3 2 1 0
COMP7505
Implemented
 Implementation of approach works
and is well designed.
 Implementation of approach works.  Implementation of approach works
but is awkward or inefficient.
 Implementation relies solely on
Serializable interface, or does not
work, or was not attempted.
2 1.5 1 0.5 0
Total: