CX 4010 / CSE 6010 Assignment 3: Graph Analytics

$30.00

Category: Tags: , , , , , You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (2 votes)

1. Background
Graphs arise in many applications in science, engineering, and the social sciences. Here, we are
particularly interested in a class of graphs known as scale-free networks. An important
characteristic of scale-free network is they include some number of hub nodes that contain a
relatively large number of links, but the vast majority of nodes contain relatively few links. The
number of links attached to a node is referred to as the node’s degree. Technically, a scale-free
network is defined as a network whose node degree distribution follows a power-law. Scale free
networks are interesting because there is empirical evidence that they arise in many applications
of current interest.
This assignment involves generating scale free network graphs and performing an analysis of the
maximum path length through the graph. The problem is divided into two parts. The first part
involves writing a program to generate a graph, storing it in a file, and considering if the
generated topology is scale-free. The second involves writing another separate program that
reads the graph file created by the graph generation program and computing the diameter of the
graph to evaluate how the maximum distance between nodes increases with network size.
You will work in teams of two persons each to attack this problem. For each team one person
will develop a program to generate a network graph and completing associated tasks. The second
person will develop a program that reads the graph file and performs an analysis of the graph.
The two programs should be completely separate “stand-alone” programs. The interface between
the two programs will be the format of the file specifying the graph. You and your partner must
define and document a common file format for the graph, and write up all of your results into a
single, combined report.
2. Part 1: Random Graph Generation
To complete this part of the assignment, write a program that creates a scale-free network graph,
stores the result into a file, and generates a histogram of the node degree distribution. Your
program should take three command line parameters. The first is the number of nodes in the
network to be generated. The second is the name of the file where the resulting topology is to be
written. The third is the name of another file where the histogram of the node degree distribution
is to be written. For example, the command line (in Unix):
% graphgen 2000 topology –h histogram
executes the program called “graphgen” (which you will write) to create a scale-free network
topology containing 2000 nodes, and writes the resulting topology into a file called topology.
The “-h histogram” part of the command line is optional. If specified, a histogram is created
and written into a file called histogram.
2
The network topology is constructed using a well-known principle called preferential attachment
[1]. The key idea behind preferential attachment is when a node is added to an existing network,
it is more likely to establish a link to another node that already has a high degree (a large number
of edges) compared to nodes with fewer edges. This construction mechanism is also referred to
as “the rich get richer” in the sense that the “richness” of a node is based on its node degree.
More precisely, when a new node is added to an existing network, it is attached via an edge to a
randomly selected node i with probability p(i). p(i) is defined as: d(i) / D where d(i) is the degree
of node i, and D is the sum of node degrees of all nodes in the existing network prior to adding
the new node.
The algorithm to generate the topology with N nodes is as follows:
1. Construct a fully connected topology containing n0 nodes, where n0 << N. A fully
connected network means there is a single edge between every pair of nodes (so a fully
connected network with n0 nodes has n0*(n0-1)/2 edges). For this assignment, assume n0
is equal to 3 nodes.
2. Assuming the network now contains k nodes (labeled 0, 1, 2, … k-1) add one additional
node k and one edge between k and one randomly selected node j with probability p(j) as
defined above using the preferential attachment rule.
3. Repeat the previous step until the network contains N nodes.
Your program must include at least three functions:
1. A function that creates the network and stores it in memory
2. A function that outputs a histogram of the node degree distribution for the created
network to a file; each line of the file should contain three values: i, num, list,
where i is the node degree (1, 2, 3, …), num is the number of nodes in your network
with degree i, and list is a list of nodes with degree i. The first line of the generated
file should have information for nodes of degree 1, the second line has information for
nodes of degree 2, etc.
3. A function that writes the network topology stored in memory into a file in the format
agreed to with your partner.
You should complete an analysis of the topologies generated by your program and argue that the
created network is indeed scale-free. In this plot, the horizontal axis is the node degree, and the
vertical axis indicates the number of times a node with that degree appeared in the network you
generated. An easy way to qualitatively determine if the node-degree histogram follows a power
law is to create a log-log plot of the node degree histogram, i.e., plot log(d) on the horizontal
axis, and log(c) on the vertical axis, where d is the degree, and c is a count of the number of
nodes with degree d. If the distribution follows a power law, the plot should produce
(approximately) a straight line. Do this for as large a network as your program can generate in a
reasonable amount of time, say several minutes.
In addition to documenting your software you will need to define the format of the file
containing the graph. This file format defines the interface between the graph generation and the
graph analysis program. The file format should be clearly documented in your report.
3
3. Part 2: Network Analysis
We are interested in understanding how the path length between vertices grows as the size of the
scale-free network grows. The metric of interest here is called the network diameter. Diameter is
defined as follows. A path from a source to a destination vertex is a sequence of edges leading
from the source to the destination. The length of this path is the sum of the weights assigned to
the edges making up the path. The distance between any two vertices in a graph is the length of
the shortest path between those two vertices. The diameter of the graph is defined as the
maximum distance among all pairs of vertices in the graph. The objective of this part of the
assignment is to understand how the graph diameter grows as N is increased in a scale free
network with N vertices. Here, you will assume that each edge of the network is bidirectional,
and has weight (length) of 1.
The graph analysis program reads a file containing the graph topology (produced by the graph
generation program) and computes the diameter of the graph. To compute the diameter, use a
breadth-first search algorithm to compute the minimum length path between a source vertex s
and every other vertex in the network. Write your program to be as efficient as possible. You
may reuse code, e.g., a FIFO queue, that you or your partner developed in the previous
assignment, but be sure to cite the source of this software in your report and through a comment
in the software itself.
Use the following command line to execute your program:
% analysis topology –o output
The first argument specifies the name of the file that holds the topology of the network you are
analyzing. The second (-o output) is optional. If specified, it indicates the name of the file
your program will create to hold the output. The output file should provide the following
information:
• The first line indicates the diameter of the network that was analyzed
• The next N lines contain information on each of the N nodes. Specifically, each line
should contain (1) the node number s, and (2) the maximum distance from node s to any
other node in the network, and (3) a list of the nodes that are this maximum distance from
node s.
If the output file is not specified in the command line via the –o flag, the program should simply
print the computer network diameter to the screen.
The report for this part of the assignment should plot the size of the network diameter as a
function of N. Use as large a value of N as your program can process in a reasonable amount of
time (e.g., a few minutes).
4. CX 4010 Students
Students enrolled in CX 4010 must complete all of the tasks described above to create the two
programs. A report should describe both the software as well as the results produced by the
software, as discussed above, as well as your analyses. Although there are two distinct parts to
this assignment, your report should be written as one document, not two separate documents. Be
sure to include proper citations and a bibliography in your report.
4
5. CSE 6010 Students
For graduate students, in addition to the above:
1. Perform a literature search on scale free networks, focusing on the applications where
scale-free networks are believed to arise. The network need not exactly match the
preferential attachment model, but there should be empirical evidence suggesting the
topology has properties such as hub and leaf nodes. What do the nodes and links
represent, and what kinds of questions might a graph analysis answer? Write up your
findings in the report. You should discuss at least three different applications involving
scale-free networks and provide some intuition as to why they might be scale-free. This
part should be completed by the individual who wrote the graph generation program.
2. Search the literature for information on the expected diameter of scale free networks.
Compare your results with what is expected, as described in the literature. Write up your
findings in the report.
6. CX 4010 Students (Extra Credit)
Complete the assignment required for CSE 6010 students described above.
7. Reminder: Collaboration Policy
A reminder you must adhere to the Georgia Tech honor code, and the collaboration policy stated
in the course syllabus. Specifically, you are encouraged to discuss the problem and possible
solutions with other students (as well as the TA/instructor), however, all code that you turn in
must be completely your own work, except as noted above (in which case you must clearly
document the source of the code). Disseminating your code to other students (not in your team)
is strictly prohibited. Further, downloading code from the web or other sources other than
examples provided in the class is also prohibited.
References
1. Barabasi, A.L. and R. Albert, Emergence of Scaling in Random Networks. Science, 1999.
286(5439): p. 509-512.