BBM 204: Software Laboratory II ASSIGNMENT 3




5/5 - (5 votes)

Subject : Minimum Spanning Trees
1 Introduction
Graphs are widely used in Computer Science for a wide range of subjects. These subjects may
include Natural Language Processing, Image Processing, Pattern Recognition, Speech Recognition,
Bioinformatics etc.
Graphs provide a natural way of constructing relationships between objects; thereby it is a natural
way of learning by using this relationship between objects. These objects may be genes in a dna,
phonemes in a speech signal, pixels in an image, words in a text (see Figure 1), and so on.
Depending on the type of the problem, everything can be modelled by using the graph theory.
There are many operations defined on graphs. You will practice Cosine Similarity and Minimum
Spanning Trees in this assignment:
The cosine similarity between two vectors (or two documents on the Vector Space) is a measure
that calculates the cosine of the angle between them.
A Minimum Spanning Tree is a subgraph of a given graph which has the minimum sum of the
weights on the edges in the spanning tree. Minimum Spanning Trees have a wide range of application fields. For example, they are widely used in the design of the computer networks or
telecommunication networks, handwriting recognition, image segmentation, clustering (see Figure
2), and so on.
Figure 1: A word graph [3]
2 Problem Definition
In this programming assignment, you will practice on one application field of graphs. The field
that you will apply graphs is Natural Language Processing. You are expected to measure semantic
similarity between words by using the word vectors. Subsequently, you will find word clusters that
bear semantically similar words in each of them.
Page 1 of 6
BBM 204: Software Laboratory II
Semantic similarity is a concept that measure how similar two words/documents/terms/concepts/senses
are in meaning. For example, words ’blue’, ’red’, ’yellow’ are semantically similar, whereas ’book’
is no semantically similar to these words. However, book is semantically similar to ’notebook’. In
Figure 3, a graph which is constructed by the Flickr tags is given (see Figure 4 for visual images
for the given tags). You can see that semantically similar words are located close to each other on
the graph, whereas semantically different words are quite distant from each other.
Figure 2: An example of minimum spanning tree clustering [4]
Another point that you need to observe on the graph is that, vertices in different colours refer to
different clusters. Moreover, these clusters bear semantically similar words.
In this assignment, using the word vectors provided in assignment you should build a graph. Movie
review text is taken as input and word vector is produced. For each word you should assign edge
weights equal to the cosine similarity. You can calculate cosine similarity by using formula which
is shown below.
Figure 3: Disambiguation of Flickr Tags
Once the initial graph is constructed as described, you will use your graph for practicing two
operations: a) measuring semantic similarity, b) finding clusters.
2.1 Measuring the Semantic Similarity Between Words
You will use your graph to measure semantic similarity between words. There are several ways to
measure semantic similarity between words. However, you will use the cosine similarity:
Page 2 of 6
BBM 204: Software Laboratory II
cos_sim(w1, w2) = ~w1. ~w2
||w1||.||w2|| (1)
In this equation w1 and w2 represents different word vectors. In the word vector file, each line
represents a different word vector.
Figure 4: Disambiguation of Flickr Tags
2.2 Clustering
In this part of the assignment, you will use minimum spanning trees on your graph to find word
clusters (see Figure 2). As indicated before, minimum spanning trees can be used for clustering
data. Once the minimum spanning tree on a given graph is found,k clusters can be obtained by
removing k-1 edges that have the minimum weights on the minimum spanning tree. Therefore, k
clusters are obtained. Each of these clusters are expected to have semantically similar words.
Since your initial graph is not weighted, you will construct a weighted graph before finding the
minimum spanning tree. In order to create a weighted graph, you will remove all the edges from
your initial graph, and you will add only the edges between the word pairs that you measured
the semantic similarities in the first step. Once you have a weighted graph, you can find the
minimum spanning tree and cut k-1 edges that have the minimum weights on the graph to obtain
k clusters.
The number of clusters will be provided as a parameter. Therefore, the number of clusters to be
obtained will be fixed initially.
3 Input Output And Testing
You will have two input files and one output files in this assignment. All of the file names will be
provided to the program as command line arguments.
• word vector file: A word vector is given in this file. In this word vector file, each line
represents a vector of unique word in the text. That is every word and it’s vector are given
in different lines. A sample word vectors are given in Figure 5.
• word pairs file: A word pairs list is given in this file. Your program is expected to measure
the semantic similarities between the word pairs. A sample word-pairs file is given in Figure
Page 3 of 6
BBM 204: Software Laboratory II
Figure 5: A sample word vectors
Figure 6: A sample of word pairs file
Figure 7: A sample cluster for k=2
• cluster file: Finally, your program will produce the minimum spanning tree and produce
the clusters out of this spanning tree. Contents of each cluster will be written to this file.
Cluster members will be written in alphabetical order (in increasing order) and clusters
will be written according to the number of members in each cluster in increasing order (i.e.
cluster that has the minimum number of members will be written first). Cluster members
must be delimited by commas. A sample clusters file is given in Figure 7.
4 Execution of the Program
Your implementation should run with following command:
java -java wordVec wordpairs clusters
• wordVec is the name of the file that has the dictionary,
• wordpairs is the name of the file that has word pairs that your program will measure how
similar they are,
Page 4 of 6
BBM 204: Software Laboratory II
• clusters is the name of the file that the contents of each cluster will be written to,
• numberofclusters is the number of clusters that your program has to find by using the
minimum spanning tree.
Please keep in mind that all of the names of the files will be taken as command line arguments
and will not be fixed names, otherwise your program will not be able to run with other file
A screenshot of the terminal that shows how to run your program is given above.
You can test your program by creating your own word_pairs.
• Do not miss the deadline.
• Save all your work until the assignment is graded.
• The assignment must be original, individual work. Duplicate or very similar assignments
are both going to be considered as cheating.
• You can ask your questions via Piazza (
and you are supposed to be aware of everything discussed in Piazza.
• You will submit your work from with the file
hierarchy as below:
This file hierarchy must be zipped before submitted (Not .rar, only .zip files are supported by the

→ src
→ *.java
→ readme.txt (optional)
Page 5 of 6
BBM 204: Software Laboratory II
All work on assignments must be done individually unless stated otherwise. You are encouraged
to discuss with your classmates about the given assignments, but these discussions should be
carried out in an abstract way. That is, discussions related to a particular solution to a specific
problem (either in actual code or in the pseudocode) will not be tolerated. In short, turning
in someone else’s work (from internet), in whole or in part, as your own will be considered as
a violation of academic integrity. Please note that the former condition also holds for the
material found on the web as everything on the web has been written by someone else.
[1] Hyunjong Cho and Viet-An Nguyen. Flickr tags disambiguation, February 2011.
[2] Jay J. Jiang and David W. Conrath. Semantic similarity based on corpus statistics and lexical
taxonomy. In in Proceedings of International Conference Research on Computational Linguistics
(ROCLING X), 1997.
[3] Hoang Vu-Thien, Katia Hormigos, Gaelle Corbineau, Brigitte Fauroux, Harriet Corvol, Didier
Moissenet, Gilles Vergnaud, and Christine Pourcel. Longitudinal survey of staphylococcus aureus in cystic fibrosis patients using a multiple-locus variable-number of tandem-repeats analysis
method. BMC Microbiology, 10(1), 2010.
Page 6 of 6