Description

5/5 - (7 votes)

Z534: Search
Task 1: Implement your first search algorithm
Based on the Lucene index, we can start to design and implement efficient retrieval
algorithms. Let’s start from the easy ones. Please implement the following ranking
function using the Lucene index we provided through Canvas (index.zip):
� �, �� = �(�, ��)
��ℎ(��)
∙ log (1 +
�
� � )
!∈!
, where q is the user query, �� is the target (candidate document in AP89), � is the query
term, �(�, ��) is the count of term � in document ��, N is total number of documents in
AP89, and � � is the total number of documents that have the term �. Please use Lucene
API to get the information. From retrieval viewpoint, !(!,!”#)
!”#$%!(!”#) is called normalized TF
(term frequency), while log (1 + !
! ! ) is IDF (inverse document frequency).
The following code (using Lucene API) can be useful to help you implement the ranking
function:
// Get the preprocessed query terms
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser(“TEXT”, analyzer);
Query query = parser.parse(queryString);
Set queryTerms = new LinkedHashSet();
query.extractTerms(queryTerms);
for (Term t : queryTerms) {
System.out.println(t.text());
}
IndexReader reader = DirectoryReader
.open(FSDirectory
.open(new File(pathToIndex)));
//Use DefaultSimilarity.decodeNormValue(…) to decode normalized document length
DefaultSimilarity dSimi=new DefaultSimilarity();
//Get the segments of the index
List leafContexts = reader.getContext().reader()
.leaves();
for (int i = 0; i < leafContexts.size(); i++) {
AtomicReaderContext leafContext=leafContexts.get(i);
int startDocNo=leafContext.docBase;
int numberOfDoc=leafContext.reader().maxDoc();
for (int docId = startDocNo; docId < startDocNo+numberOfDoc; docId++) {
//Get normalized length for each document
float normDocLeng=dSimi.decodeNormValue(leafContext.reader()
.getNormValues(“TEXT”).get(docIdstartDocNo));
System.out.println(“Normalized length for doc(“+docId+”) is
“+normDocLeng);
}
//Get the term frequency of “new” within each document containing it for
TEXT
DocsEnum de = MultiFields.getTermDocsEnum(leafContext.reader(),
MultiFields.getLiveDocs(leafContext.reader()),
“TEXT”, new BytesRef(“new”));
int doc;
while ((doc = de.nextDoc()) != DocsEnum.NO_MORE_DOCS) {
System.out.println(“\”new\” occurs “+de.freq() + ” times in doc(” +
(de.docID()+startDocNo)+”) for the field TEXT”);
}
}
For each given query, your code should be able to 1. Parse the query using Standard
Analyzer (Important: we need to use the SAME Analyzer that we used for indexing to
parse the query), 2. Calculate the relevance score for each query term, and 3. Calculate
the relevance score � �, �� .
The code for this task should be saved in a java class: easySearch.java
Task 2: Test your search function with TREC topics
Next, we will need to test the search performance with the TREC standardized topic
collections. You can download the query test topics from Canvas (topics.51-100).
In this collection, TREC provides a number of topics (total 50 topics), which can be
employed as the candidate queries for search tasks. For example, one TREC topic is:

Tipster Topic DescriptionNumber: 054Domain: International Economics

Assignment 2: Retrieval Algorithm and Evaluation

Description

Related products

Basics of ML and Deep Learning

Assignment 3: Unit Testing With JUnit

Lab 1: Recursion