Description
Identify which type of database/data processing system you would choose (Key-Value store, Column-oriented store, Document-oriented store, Graph database, Relational database, Streaming engine) in each scenario below.
- Highly structured multi-table data that requires enforcing data constraints.
- Stock market data ticker with decisions that must be made in real time.
- LinkedIn type data with interconnected nodes where much of the information resides in the links between nodes.
- An image storage system that allows lookup images by file name.
- A collection of JSON objects (e.g., tweets).
- Data that is stored in large sparse tables that are continuously growing (new rows/columns).
-
- Consider the following graph
Compute the page rank for the nodes in this graph. If you are multiplying matrices manually, you may stop computing after 6 steps. If you use a tool (e.g., Matlab, python) for matrix multiplication, you should get your answer to converge.
- Now consider a graph with dead-end nodes Q and P:
What is the page rank of Q?
What is the page rank of P?
- Exercise 5.1.6 from Mining of Massive Datasets
- Given the input data [(1pm, $6), (2pm, $16), (3pm, $17), (4pm, $28), (5pm, $10), (6pm, $20), (7pm, $21), (8pm, $22), (9pm, $23), (10pm, $28), (11pm, $26), (12am, $30)].
- What will the Hive query “compute average price” return? (yes, this question is as obvious as it seems, asked for comparison with part-b and part-c)
- What will a Storm streaming query “compute average price per each 3 hour window” return? (tumbling, i.e., non-overlapping window of tuples). For example, the first window would 1pm-4pm. Second window would be 4pm-7pm. If you are wondering about overlap, I recommend defaulting to [1pm-4pm) [4pm-7pm). (e.g., 1pm-3:59pm).
- What will a Storm query “compute average price per each 3 hour window” return? (sliding, i.e. overlapping window of tuples, moving the window forward 2 hours each time). First window is 1pm-4pm, second window is 3pm-6pm and so on.
NOTE: when Storm does not have a full window, you cannot output anything until the window fills with data.
- Run another custom MapReduce job, implementing a solution for the following query:
SELECT lo_quantity, MAX(lo_revenue) FROM (SELECT lo_revenue, MIN(lo_quantity) as lo_quantity, MIN(lo_discount) as lo_discount FROM lineorder WHERE lo_orderpriority LIKE ‘%URGENT’ GROUP BY lo_revenue) WHERE lo_discount BETWEEN 6 AND 8 GROUP BY lo_quantity; This requires running two different map reduce jobs. First, you would write a job that executes the subquery and produces an output in HDFS. Then you would write a second job that uses output of the first job as the input.
- In this section you will run an implementation of the page rank algorithm. Unfortunately, newer versions of Mahout (machine learning library that runs on Hadoop and Spark, which we will use for a couple of other examples later) removed their MapReduce page rank implementation. So we will run a publically available implementation from GitHub – an opportunity to build a custom Java job (while it is written in Java, you do not need to know Java to run it).
Download the Stanford graph dataset (this is a graph of page links from Stanford.edu, originally from here – https://snap.stanford.edu/data/)
wget http://cdmgcsarprd01.dpu.depaul.edu/CSC555/web-Stanford.txt.gz
gunzip web-Stanford.txt.gz
- Take a look at the file and report how many nodes and edges the web-Stanford.txt contains (it’s on the third line).
Download the PageRank implementation from here: https://github.com/danielepantaleone/hadoop-pagerank, as follows:
Install git (Git is a version control system for tracking changes in source code).
sudo yum install git
git clone https://github.com/danielepantaleone/hadoop-pagerank
cd hadoop-pagerank/
Edit the source file (that is the reducer of the preprocessing MapReduce job)
nano src/it/uniroma1/hadoop/pagerank/job1/PageRankJob1Reducer.java
to modify PageRank.NODES.size() to 281903 (number of nodes), like in the screenshot.
// is a comment in Java, equivalent of # in python, so you do not need the blue line.
This code is meant to count the total number of nodes, but it isn’t working correctly (returning 0) and I have not yet figured out why. If you do not make this change, all of the initial values will compute be Infinity.
Set the classpath environment variable (you do not have to put it in .bashrc because we will only be using it to compile Java code in the next command). As always, note that this is a single line, no linebreaks
export CLASSPATH=”/home/ec2-user/hadoop-2.6.4/share/hadoop/common/*:/home/ec2-user/hadoop-2.6.4/share/hadoop/mapreduce/lib/*:/home/ec2-user/hadoop-2.6.4/share/hadoop/mapreduce/*”
Compile the code:
javac -sourcepath src/ src/it/uniroma1/hadoop/pagerank/PageRank.java
Build the new jar file with your custom compiled code.
cd src
jar cvf PageRank.jar it
Congratulations, you have build a new custom PageRank.jar from Java code, similar to hadoop-streaming or hadoop-examples jar files that we have previously used.
Now, load the web Stanford data into HDFS:
hadoop fs -mkdir /data/
hadoop fs -mkdir /data/webStanford
hadoop fs -put ~/web-Stanford.txt /data/webStanford
And, finally, run the new jar to execute page rank evaluation.
time hadoop jar PageRank.jar it.uniroma1.hadoop.pagerank.PageRank –input /data/webStanford –output /data/prOutput –damping 90 –count 8
Note that –damping 90 means a 10% chance of teleporting and –count of 8 means 8 iterations are computed.
- Report the runtime (took about 5 minutes to run when I tested it)
- Submit a screenshot of the first page of nodes, e.g., by running
hadoop fs -cat /data/prOutput/result/part-r-00000 | more
Submit a single document containing your written answers. Be sure that this document contains your name and “CSC 555 Assignment 5” at the top.