Description

5/5 - (4 votes)

The goal of the assignment is to build a search engine from scratch, which is an example
of an information retrieval system. In the class, we have seen the various modules that serve
as the building blocks of a search engine. The first part of the assignment involved building
a basic text processing module that implements sentence segmentation, tokenization, stemming/lemmatization and stopword removal. This module involves implementing an Information
Retrieval system using the Vector Space Model. The same dataset as in Part 1 (Cranfield
dataset) will be used for this purpose.
1. Now that the Cranfield documents are pre-processed, our search engine needs a data
structure to facilitate the ’matching’ process of a query to its relevant documents. Let’s
work out a simple example. Consider the following three sentences:
S1 Herbivores are typically plant eaters and not meat eaters
S2 Carnivores are typically meat eaters and not plant eaters
S3 Deers eat grass and leaves
Assuming {are, and, not} as stop words, arrive at an inverted index representation for the
above documents (treat each sentence as a separate document).
2. Next, we must proceed on to finding a representation for the text documents. In the class,
we saw about the TF-IDF measure. What would be the TF-IDF vector representations for
the documents in the above table? State the formula used.
3. Suppose the query is “plant eaters”, which documents would be retrieved based on the
inverted index constructed before?
4. Find the cosine similarity between the query and each of the retrieved documents. Rank
them in descending order.
5. Is the ranking given above the best?
6. Now, you are set to build a real-world retrieval system. Implement an Information
Retrieval System for the Cranfield Dataset using the Vector Space Model.
7. (a) What is the IDF of a term that occurs in every document?
(b) Is the IDF of a term always finite? If not, how can the formula for IDF be modified to
make it finite?
Assignment Part 2 Page 1
8. Can you think of any other similarity/distance measure that can be used to compare vectors
other than cosine similarity. Justify why it is a better or worse choice than cosine similarity
for IR.
9. Why is accuracy not used as a metric to evaluate information retrieval systems?
10. For what values of α does the Fα-measure give more weightage to recall than to precision?
11. What is a shortcoming of Precision @ K metric that is addressed by Average Precision @
k?
12. What is Mean Average Precision (MAP) @ k? How is it different from Average Precision
(AP) @ k ?
13. For Cranfield dataset, which of the following two evaluation measures is more appropriate
and why? (a) AP (b) nDCG
14. Implement the following evaluation metrics for the IR system:
(a) Precision @ k
(b) Recall @ k
(c) F-Score @ k
(d) Average Precision @ k
(e) nDCG @ k
15. Assume that for a given query, the set of relevant documents is as listed in cran_qrels.json.
Any document with a relevance score of 1 to 4 is considered as relevant. For each query
in the Cranfield dataset, find the Precision, Recall, F-score, Average Precision and nDCG
scores for k = 1 to 10. Average each measure over all queries and plot it as function of
k. Code for plotting is part of the given template. You are expected to use the same.
Report the graph with your observations based on it.
16. Analyse the results of your search engine. Are there some queries for which the search
engine’s performance is not as expected? Report your observations.
17. Do you find any shortcoming(s) in using a Vector Space Model for IR? If yes, report them.
18. While working with the Cranfield dataset, we ignored the titles of the documents. But,
titles can sometimes be extremely informative in information retrieval, sometimes even
more than the body. State a way to include the title while representing the document as
a vector. What if we want to weigh the contribution of the title three times that of the
document?
19. Suppose we use bigrams instead of unigrams to index the documents, what would be its
advantage(s) and/or disadvantage(s)?
Assignment Part 2 Page 2
20. In the Cranfield dataset, we have relevance judgements given by the domain experts.
In the absence of such relevance judgements, can you think of a way in which we can
get relevance feedback from the user himself/herself? Ideally, we would like to keep the
feedback process to be non-intrusive to the user. Hence, think of an ’implicit’ way of
recording feedback from the users.
Submission Instructions:
1. The template for the code (in python) is provided in a separate zip file and you are
expected to fill in the template wherever instructed to. Note that any python library, such
as nltk, stanfordcorenlp, spacy, etc can be used.
2. A folder named ’RollNo1_RollNo2.zip’ that contains a zip of the code folder (named ’code’)
and a PDF of the answers to the above questions must be uploaded on Moodle by
02/04/2021.
3. Please include the names and roll numbers of each member of the team in the document.
4. All sources of material must be cited. The institute’s academic code of conduct will be
strictly enforced.

Assignment I – Part 2 CS6370: Natural Language Processing

Description

Related products

Assignment 1 – Part 1 CS6370: Natural Language Processing