Description
This project consists of two independent components: 1. Query planning in PostgresQL using a DVD store database. 2. Exporting FakeBook Project 2 data from Oracle to MongoDB, and then writing some MongoDB queries. You may find the details of each component below. You will need to be on the University of Michigan network to do the assignment, since the database servers are only accessible from within the University. If you are off-campus, you must first connect to the University of Michigan VPN network. See http://www.itcom.itd.umich.edu/vpn/ for instructions on connecting to the VPN. You can do 1. and 2. in either order — they are totally independent activities. We have talked about MongoDB in the lecture on 11/21. So, you may consider starting on that first. Within MongoDB project, you can do Part I or Part 2 in either order as well. Part I is about writing a Java program to convert Oracle data from Project 2 to MongoDB format (JSON) and then importing it into MongoDB. Part 2 is about writing 7 queries on the imported data. We have given you a JSON file that you can use for Part 2, if you wish to do Part 2 first. Our advice for F16: Do MongoDB part of of the project first. That is 80% of the project and is completely auto-graded. Do Part I or Part 2 of that in either order (or in parallel!). It is worth 8% of the overall grade. Postgres Portion does not require coding and there is no auto-grader — it requires you to use the Postgres database, read its manual, following the instructions, and fill out a Google form based on the observations. It is almost like doing a homework (and since it is worth 20% of this project, it ends up being worth the same as other homeworks – 2% of the overall grade). PostgreSQL (20 points) We will have a couple of exercises using PostgreSQL to help you understand query planning and query optimization. The spec and questions can be found in the following Google form link. You need to fill in your answer in the Google form. You don’t need to submit any files for this exercise. https://goo.gl/forms/mqrz6jxjYmh0LAvr1 Following is an introduction on how to connect to Postgres, load data and some helpful links, before you answer the questions at the above google form. Logging into Postgres To login to postgres, login to a CAEN machine. Alternatively, download a postgres client program such as psql to your machine (first check – you may already have it if you have a Mac or Linux machine). % psql -h eecs484.eecs.umich.edu -U uniquename Your default Postgres password is postgres484student. You need to connect from either a University machine, while you are connected to the University network via UM Wireless (not guest), or be on the University of Michigan VPN. Once you login, you can change your password as follows on the postgres => prompt: => \password Remember your new password. Resetting it is possible but will incur a delay (possibly even 24-48 hours) as it would be a manual process. If you need to reset it, then you should email the teaching staff and allow us 24-48 hours to respond. CAEN or ITD do not support this database system. Initializing the Database This project comes with a zip file (dvd_store_data.zip) that contains relevant commands to initialize the database. Download and unzip the file and you will find a file called setup_db.sql. You will use this file to populate the database. Specifically, you can do the following steps on CAEN to initialize the database: % unzip dvd_store_data.zip % cd dvd_store_data % psql -h eecs484.eecs.umich.edu -U uniquename Password for user uniqname: You are now connected to the Postgres interactive terminal. Run the following command (note the backslash) to execute commands from setup_db.sql: \i setup_db.sql It can take a few minutes for the database to be initialized. Here is what you may see when initializing (or re-initializing). The error messages about permission being denied on triggers can be ignored. uniquename=> \i setup_db.sql psql:pgsqlds2_delete_all.sql:2: ERROR: permission denied: “RI_ConstraintTrigger_19357” is a system trigger psql:pgsqlds2_delete_all.sql:3: ERROR: permission denied: “RI_ConstraintTrigger_19359” is a system trigger psql:pgsqlds2_delete_all.sql:4: ERROR: permission denied: “RI_ConstraintTrigger_19409” is a system trigger psql:pgsqlds2_delete_all.sql:5: ERROR: permission denied: “RI_ConstraintTrigger_19391” is a system trigger psql:pgsqlds2_delete_all.sql:6: ERROR: permission denied: “RI_ConstraintTrigger_19424” is a system trigger psql:pgsqlds2_delete_all.sql:7: ERROR: permission denied: “RI_ConstraintTrigger_19383” is a system trigger DROP TABLE DROP TABLE psql:pgsqlds2_create_tbl2.sql:30: NOTICE: CREATE TABLE will create implicit sequence “customers_customerid_seq” for serial column “customers.customerid” psql:pgsqlds2_create_tbl2.sql:30: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index “customers_pkey” for table “customers” CREATE TABLE psql:pgsqlds2_create_tbl2.sql:43: NOTICE: CREATE TABLE will create implicit sequence “orders_orderid_seq” for serial column “orders.orderid” psql:pgsqlds2_create_tbl2.sql:43: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index “orders_pkey” for table “orders” CREATE TABLE psql:pgsqlds2_create_tbl2.sql:51: NOTICE: CREATE TABLE will create implicit sequence “categories_category_seq” for serial column “categories.category” psql:pgsqlds2_create_tbl2.sql:51: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index “categories_pkey” for table “categories” CREATE TABLE INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 psql:pgsqlds2_create_tbl2.sql:82: NOTICE: CREATE TABLE will create implicit sequence “products_prod_id_seq” for serial column “products.prod_id” psql:pgsqlds2_create_tbl2.sql:82: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index “products_pkey” for table “products” CREATE TABLE CREATE TABLE CREATE TABLE psql:pgsqlds2_create_tbl2.sql:115: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index “inventory_pkey” for table “inventory” CREATE TABLE CREATE TABLE psql:pgsqlds2_load_orderlines.sql:1: ERROR: permission denied: “RI_ConstraintTrigger_20643” is a system trigger psql:pgsqlds2_load_orderlines.sql:16: ERROR: permission denied: “RI_ConstraintTrigger_20643” is a system trigger Helpful Links Now, you should proceed to do the quiz at the Google Forms link given at the beginning. When you do the quiz, you may have to refer to Postgres documentation and run postgres commands to find the answers to the quiz questions. Some of the tables that you will be using during the course are system catalog pages. In most databases, there are tables about tables that you create. For example, one of the tables in Postgres is pg_class that has general information about all the relations, indexes, etc., including their column names. Another table is pg_stats, which contains approximate number of tuples in each table, number of distinct values in each column, etc. These two tables are very useful in query optimization. The data in these tables helps the query optimizer estimate the cost of different ways of evaluating a query (e.g., whether to use an index or to ignore it). For example, ignoring an index and just doing a regular file scan may be more efficient in some cases (e.g., see Lecture Notes where examples of SELECT queries on age being 18 for UM students were discussed). The following are links to Postgres documentations relevant to this assignment. Be prepared to look up the documentation as you work on the exercises. ● Full PostgreSQL 8.4.16 documentation: http://www.postgresql.org/docs/8.4/static/index.html ● System catalogs that give you information about tables: http://www.postgresql.org/docs/8.4/static/catalogs.html ● Statistics used by the query planner http://www.postgresql.org/docs/8.4/static/planner-stats.html ● How to manipulate the query planner (such as disabling the use of a join algorithm) http://www.postgresql.org/docs/8.4/static/runtime-config-query.html ● Syntax of EXPLAIN command http://www.postgresql.org/docs/8.4/static/sql-explain.html ● How to use EXPLAIN command and interpret its output http://www.postgresql.org/docs/8.4/static/using-explain.html ● Creating an index http://www.postgresql.org/docs/8.4/static/sql-createindex.html ● Creating a clustered index http://www.postgresql.org/docs/8.4/interactive/sql-cluster.html MongoDB (80 points) Introduction In this project, you will learn how to transfer data from SQL to MongoDB and learn to perform MongoDB queries. There are two parts in this project. In part 1, you will need to write a Java program to export a small portion of Fakebook data stored in Project 2 tables into one JSON file which serves as the input to part 2. In part 2, you will need to import this JSON file into MongoDB and perform a couple of queries in the form of JavaScript functions. (Note: we have provided you the JSON file that should result from your part 1, to allow you to work on part 1 and part 2 in either order, and to help you check the correctness of part 1. See more on this later in this document.) JSON objects and arrays MongoDB uses a format called JSON (Javascript Object Notation) extensively. JSON is a key-value representation, in which values can also be JSON objects. Here are some examples of objects in JSON notation: Example 1 (JSON object): {“firstName”:”John”, “lastName”:”Doe”} Here, “firstName” is a key, and the corresponding value is “John”. Similarly, “lastName” is a key and “Doe” is a value. Think of it like a map. Here is some Javascript code that uses the above: var employee = {“firstName”:”John”, “lastName”:”Doe”}; employee[“firstName”]; // displays John One can also have a JSON array, which are an array of JSON objects. For example, variable employees is an array of JSON objects. var employees = [ {“firstName”:”John”, “lastName”:”Doe”}, {“firstName”:”Atul”, “lastName”:”Prakash”}, {“firstName”:”Barzan”,”lastName”: “Mozafari”} ]; employees[1][“firstname”]; // prints Atul Nesting is possible. For example, in a key-value pair, the value can be a JSON array, another JSON object, or just a simple string or number. MongoDB does not use SQL. But there are some similarities. Here are a few simple examples. See the following more examples: https://docs.mongodb.org/manual/reference/sql-comparison/ SQL MongoDB Table Collection. Initialized using a JSON array. Tuple or row of a table Document. Corresponds to a JSON object SELECT * FROM users; db.users.find(); SELECT * FROM users WHERE name = ‘John’ AND age = 50; db.users.find({name : “John”, age : 50}); SELECT user_id, addr FROM users WHERE name = ‘John’; db.users.find({name : “John”}, {user_id : 1, addr : 1, _id : 0}); Install mongodb on your local machine: We encourage you to install mongodb locally in your laptop to have a more pleasant development environment and also to explore mongdb’s functionality by yourself. You can follow instructions in the following links. Post on piazza if you get stuck and feel free to help other students get mongo installed. A great way is to reply to any questions related to installation on Piazza, based on your experience. Google search on installation error messages can also help. Install mongodb on Mac: https://docs.mongodb.com/v3.2/tutorial/install-mongodb-on-os-x/ (Note: You may need to use sudo command at times to temporarily become root if you get Permission Denied errors). Install mongodb on Windows: https://docs.mongodb.com/v3.2/tutorial/install-mongodb-on-windows/ Install mongodb on Linux https://docs.mongodb.com/v3.2/administration/install-on-linux/ (Note: You may need to use sudo command at times to temporarily become root if you get Permission Denied errors.) When you have it successfully installed, the following commands should work from a Terminal: % mongo % mongoimport If you have trouble installing locally, you can also use the mongo server on eecs484.eecs.umich.edu directly by supplying a userid and password. See Part 2 below. Part 1 – Export SQL data to JSON using JDBC This part does not really make use of MongoDB. It instead relies on your knowledge from Project 2 (SQL!). You will be retrieving data from the public tables of Project 2 and outputting a subset of the information as a JSON array. We have given you the output file as well that we are expecting as output so that you can check your answer. You are provided with 3 files: GetData.java, Main.java, and Makefile for this part. You will write code inside GetData.java, resulting a file output.json . Also, you are provided sample.json, which is one of the possible correct outputs. You should get an equivalent JSON array in output.json as in sample.json. 1) Main.java This file contains the main function to run the program. The only modification you need to make to this file is to provide your SqlPlus username and password that you used in Project 2 as well. Please refer to your project 2 files in case you forgot what it is, since it is probably embedded in one of the files there. public class Main { static String dataType = “PUBLIC”; static String oracleUserName = “username”; //replace with your Oracle account name static String password = “password”; //replace with your Oracle password … 2) GetData.java This file contains the function you need to write to export data from SqlPlus to a JSON file. The function will output a JSON file called “output.json” which you will need to submit. public JSONArray toJSON() throws S QLException{ // Your implementation goes here…. // This is an example usage of JSONArray and JSONObject // The array contains a list of objects // All user information should be stored in the JSONArray object: users_info // You will need to DELETE this stuff. This is just an example. // A JSONObject is an unordered collection of name/value pairs. Add a few name/value pairs. JSONObject test = new JS ONObject(); / / declare a new JSONObject // A JSONArray consists of multiple JSONObjects. JSONArray users_info = new JS ONArray(); test.put(“user_id”, “testid”) ; / / populate the JSONObject test.put(“first_name”, “testname”); JSONObject test2 = new JS ONObject(); test2.put(“user_id”, “test2id”); test2.put(“first_name”, “test2name”); users_info.add(test); / / add the JSONObject to JSONArray users_info.add(test2); / / add the JSONObject to JSONArray return users_info; } You need to use JDBC to query relevant project 2 tables to get information about users and store the information in a JSONArray called users_info. It is OK to use multiple queries — in fact, that may be more convenient to do. The users_info object is a JSONArray and contains an array of JSONObjects. Each JSONObject should contain information about one user. For each user (stored as a JSONObject), you will need to retrieve the following information: ● user_id ● first_name ● last_name ● gender ● YOB ● MOB ● DOB ● hometown (JSONObject). Hometown contains city, state and country. ● friends (JSONArray). Friends contains an array of user_ids which are greater than the user_id of that user. Following is an example of one user JSONObject (order of key/value pairs in JSONObject doesn’t matter). See the file sample.json for an example valid output. 3) Makefile We provide a simple Makefile to compile the Java files and run them. You may make changes to this file if necessary. To compile the code, do: $ make To run the code, do: $ make run An output file output.json should result. 4) output.json Since the order of attributes inside a jsonObject is not fixed, there are a lot of correct answers for output.json . However, when you import them into database, they should be all identical for queries. For your convenience, we have provided a sample.json, which is one of the correct answers. To test whether your output.json is correct, you could do Part 2 with your output.json as the input instead of sample.json . If you get the same answers, that is a good sign (though not a proof). If you get different answers, then something is definitely wrong your output.json . Part 2 – Query MongoDB The first step is to import output.json file from part 1 to MongoDB as a collection. Alternatively, you can use sample.json as your input file. If you are working on CAEN machine, use the following command to input sample.json into the database, for example: $ module load mongodb $ mongoimport –host eecs484.eecs.umich.edu –username –password –collection users –db –file sample.json –jsonArray You can also do this by modifying the Makefile and then doing: $ make setupsampledb Alternatively, to use your output.json, you can do: $ make setupmydb On eecs484 server, we have set up Mongo databases for each student. The database name is your uniquename, and the username is also your uniquename. Password is eecs484class for all student, you can change your password for your database, see(https://docs.mongodb.org/manual/tutorial/manage-users-and-roles/). What you need to do is db.updateUser(, {pwd : “newpassword”}) You can do it either in mongo shell or run it as script (If you are using a private, local copy of mongodb on your personal machine, you can instead just omit the –host, –username, and –password flags). When importing a json file, please use “users” as your collection name, and do not modify this collection. The above command does that. You can create additional collections besides if you want as helper collections to answer the queries. For the second step of this part, you will need to write 5 queries in the form of JavaScript functions to query MongoDB. MongoDB can load JavaScript files. You can find more information via the following link. https://docs.mongodb.org/manual/tutorial/write-scripts-for-the-mongo-shell/ If a collection is created in a query, you may reuse that collection in subsequent queries to save time. Note: Since only hometown information is retrieved in part 1. We assume that the city in queries means hometown city. Query 1: Find users who live in a certain city This query should return a javascript array of user_ids who live in the specified city. City is passed as a parameter of function find_user. Hint: A possible answer would start out like: var result = db.users.find(….); // Read MongoDB tutorials on the find command. In addition, you may find the following useful. https://docs.mongodb.org/v3.0/reference/method/cursor.forEach/ Instead of using forEach, you can also iterate over the cursor result that is returned by result, and push the user_id from the result into a Javascript array variable. function find_user(city, dbname){ db = db.getSiblingDB(dbname) //implementation goes here // returns a Javascript array. See test.js for a partial correctness check. This will be // an array of integers. The order does not matter. } Note: Query 2-5 assume that the variable db has been initialized from Query 1 above. Do not drop the db database. Query 2: Unwind friends Each document in the collection represents one user’s information, including a list of friends’ id. In this query, you need to unwind the friends list such that the resulting collection contains document which represents a friend pair. You don’t need to return anything. The new collection must be named flat_users. You may find this link on MongoDB unwind helpful. https://docs.mongodb.org/manual/reference/operator/aggregation/unwind/ function unwind_friends(dbname){ db = db.getSiblingDB(dbname) //implementation goes here // returns nothing. It creates a collection instead as specified above. } You may also find the following useful: https://docs.mongodb.org/manual/reference/operator/aggregation/ In particular, besides $unwind, $project and $out can also be useful. $out can create a collection directly. Instead of $out, you can also iterate over the resulting cursor from the query and use insert operator to insert the tuples into flat_users. See the documentation link in the query below as well. Query 3: Create a city to user_id mapping Create a new collection. Documents in the collection should contain two fields: _id field holds the city name, users field holds an array of user_id who live in that city. You don’t need to return anything. The new collection must be named cities. You may find the following link helpful. https://docs.mongodb.org/manual/reference/operator/aggregation/out/ function cities_table(dbname) { db = db.getSiblingDB(dbname) //implementation goes here // Returns nothing. Instead, it creates a collection inside the database. } Query 4: Recommend friends Find user pairs such that, one is male, the other is female, their year difference is less than year_diff, and they live in same city and they are not friends with each other. Store each friend pair as an array (male first, female second), and store all pairs as an array of arrays, and return the array at the end of the function. function suggest_friends(year_diff, dbname) { db = db.getSiblingDB(dbname) //implementation goes here // Return an array of arrays. } Query 5: Find the oldest friend Find the oldest friend for each user who has friends. For simplicity, use only year of birth to determine age. If there is a tie, use the one with smallest user_id. Notice that in the collection, each user only has the information of friends whose user_id is greater than that user, due to the requirement in Fakebook database. You need to find the information of friends who have smaller user_ids than the user. You should find the idea of query 2 and 3 useful. Return a javascript object with keys as user_ids and the value of keys is the oldest friend’s id. The number of keys of the object should equal to the number of users who has friends. function oldest_friend(dbname){ db = db.getSiblingDB(dbname) //implementation goes here //return an javascript object described above } Query 6: Find the Average friend count for users We define the `friend count` as the number of friends of a user. The average friend count is the average `friend count` towards a collection of users. In this function we ask you to find the `average friend count` for the users collection. function find_avg_friend_count(dbname){ db = db.getSiblingDB(dbname) //implementation goes here //return an javascript object described above } Return a decimal as the average user friend count of all users in the “users” collection. Query 7: Find the city average friend count using MapReduce MapReduce is a very powerful yet simple parallel data processing paradigm. Please refer to the discussion 8 (will be released on 11/30) for more information. In the question we are asking you to use MapReduce to find the “average friend count” at the city level (i.e., average friend count per user where the users belong to the same city). We have set up the MapReduce calling point in the test.js and we are asking to write the mapper, reducer and finalizer (if needed) to find the average friend count per user for each city. You may find the following link helpful. https://docs.mongodb.com/v3.2/core/map-reduce/ var city_average_friendcount_mapper = function() { // implem ction of average friend count }; var city_average_friendcount_reducer = function(key, values) { // implement the reduce function of average friend count }; var city_average_friendcount_finalizer = function(key, reduceValue) { // We’ve implemented a simple forwarding finalize function. This implementation is naive: it just forwards the reduceVal to the output collection. // Feel free to change it if needed. However, please keep this unchanged: the var ret should be the average friend count per user of each city. var ret = reduceVal; return ret; }; Sanity check: Note that after running the test.js, running db.friend_city_population.find() in mongo shell, you should find the documentation friend_city_population have the records in the similar form as following: { “_id” : “some_city”, “value” : 15.23} Where the _id is the name of the city, and the value is the average friend count per user. Sample test We offer a test.js which will call all 7 query javascript files and will print “Query x is correct!” if you query passes the simplistic test in test.js. In test.js, you need to put your database name in the dbname variable. Please make sure that all 7 query javascript files are in the same directory as test.js. Also, please note that the autograder will use a similar program as test.js but is more exhaustive. In particular, we compare the output of your queries against a reference output in more depth. You are free to make changes on test.js for more exhaustive testing. To run the tests from command-line, you can do: $ module load mongodb # This is only required one-time per login. $ mongo -u -p –host eecs484.eecs.umich.edu < test.js This will access to your database(created as your uniquename) on eecs484 mongodb server and run the script test.js. Alternatively, you can use the Makefile as well. $ make mongoquerytest will basically do the above for you. If you want to open the mongodb shell, you will do: $mongo -u -p –host eecs484.eecs.umich.edu Again, if you are using a local mongodb on your personal computer, just do the following instead: $ mongo dname < test.js Note: it may take some time to run query 4 and query 5. However, query 4 should take less than 3 minutes and query 5 should take less than 6 minutes. You will receive 0 on that query if it exceeds this time limit. What to submit You should submit a zip file named p4.zip. The zip file should include: GetData.java — Your java program to create the json file. Do not change the code for writing to output.json. query1.js — return a javascript array of user_ids. query2.js — create a new collection called flat_users, nothing to return. query3.js — create a new collection called cities, nothing to return. query4.js — return a javascript array of user pairs. query5.js — return an javascript object, keys are user_ids, value for each key is the oldest friend id for that user_id query6.js — return a decimal which is the “average friend count” for all users query7.js finish the mapper, reducer and finalizer Where to submit For Mongodb part of the project, we have made an autograder available at grader484.eecs.umich.edu. You will submit the mongodb part of the project online. (Postgres portion is to be submitted via Google forms on the link given earlier in the specs.)