ASSIGNMENT 4 COMP 550 Multi-document Summarization

$30.00

Category: Tags: , , , , , You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (3 votes)

Question 1: Multi-document Summarization (60 points)
This question asks you to implement a simple, but surprisingly effective algorithm for multi-document
summarization, SumBasic:
Ani Nenkova and Lucy Vanderwende. The Impact of Frequency on Summarization. Microsoft Research,
Redmond, Washington, Tech. Rep. MSR-TR-2005-101. 2005. http://www.cs.bgu.ac.il/~elhadad/
nlp09/sumbasic.pdf
a) Use a news aggregator tool such as Google News to find four clusters of articles on the same event or
topic. Each cluster should contain at least three articles, and each article should be of sufficient length
to generate an interesting summary from (at least 3–4 paragraphs).
You should clean the article text by removing all hyperlinks, formatting, titles and other items that are
not the textual body of the articles. Use any method to do this (including by hand). You may have to
deal with non-ASCII characters. You can handle them any way you like, including just replacing them
by a similar-looking ASCII character. Save your input into text files called docA-B.txt, where A is an
positive integer corresponding to the cluster number, and B is another positive integer corresponding to
the article number within that cluster. For example doc1-2.txt is the second article in the first cluster.
Put all of your documents inside a subfolder called /docs.
b) Implement SumBasic, as it is described in the lecture notes, in order to generate 100-word summaries
for each of your document clusters. Compare these two versions of the system:
1. orig: The original version, including the non-redundancy update of the word scores.
2. simplified: A simplified version of the system that holds the word scores constant and does not
incorporate the non-redundancy update.
Compare these versions against a third method, leading, which takes the leading sentences of one of
the articles, up until the word length limit is reached. You may decide on how to select the article
arbitrarily.
You should apply the standard preprocessing steps on your input documents, including sentence segmentation, lemmatization, ignoring stopwords and case distinctions. The main method that should run
1
your code should be in a file called sumbasic.py. Your code should be run using the following command
structure:
python sumbasic.py *
And it should print the output summary to standard output.
For example, running
python ./sumbasic.py simplified ./docs/doc1-*.txt > simplified-1.txt
should run the simplified version of the summarizer on the first cluster, writing the output to a text file
called simplified-1.txt.
c) Discuss quality of each of the three methods. Does the non-redundancy update work as expected?
How are the methods successful or not successful? How would you order the summary sentences with
the SumBasic methods, or another extractive summarization approach? Be sure to cover all aspects of
summary quality that we discussed in class.
Question 2: Reading Assignment — Abstractive Summarization (40 points)
Read the following paper:
Trevor Cohn and Mirella Lapata. Sentence Compression Beyond Word Deletion. COLING 2008. http:
//www.aclweb.org/anthology/C08-1018.pdf
Write a max. one-page (c. 500 words) discussion on this paper, including the following points:
1. A brief summary of the contents of the paper, including the theoretical framework and the experiments.
2. The limitations of the approach. What do you suggest to address these limitations?
3. Relate this paper to the following concepts that we have discussed throughout the term: context-free
grammar, natural language generation and extractive summarization.
4. Three questions related to the paper. These can be clarification questions, or questions about
potential extensions of the paper, or its relationship to other work.
What To Submit
Before the deadline: You are encouraged to do a preliminary pass of the reading before November 30th,
when we will discuss the paper in class.
Electronically: Submit the written portions of the assignment in a single pdf file called ‘a4-written.pdf’.
For the programming part of Question 1, you should submit one zip file called ‘a4-q1.zip’ with your source
code, input document clusters, and output summaries. Both should be submitted to MyCourses under
Assignment 4.
Page 2