Description
In this homework, we will explore the metadata stored in the namenode of HDFS. You can obtain such metadata by using the Offline Image Viewer (oiv) tool provided by Hadoop (https:// hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html). For example, /bin/hdfs oiv -i /tmp/hadoop-ec2-user/dfs/name/current/ fsimage_0000000000000000564 -o fsimage564.xml -p XML will export the metadata stored in the specified fsimage (file system image) to an XML file called fsimage546.xml. Fsimage has a INodeSection listing metadata about each inode and a INodeDirectorySection describing the directory structure, as show above. Note that id of inode is its inumber; and the directory nodes are represented by their inumbers, e.g., 16385. DSCI 551 – Spring 2022 Your task is to implement a Python program stats.py that takes a fsimage file in XML and outputs an JSON file that contains the following statistics about the file system. {“number of files”: 5, “number of directories”: 10, “maximum depth of directory tree”: 4 “file size”: {“max”: 3518, “min”: 16}} Note the maximum depth of directory tree is the number of levels of the tree, e.g., the maximum depth of the following directory tree is 4. Note if the file system does not contain any files, then you should not output the statistics about the “file size”. Permitted libraries: lxml. Execution format: python3 stats.py e.g., python3 stats.py fsimage564.xml stats.json Submission: submit stats.py DSCI 551 – Spring 2022 Checklist for Submission : 1. DO NOT return anything we didn’t ask for. For example, “please enter XXX: ___”. Please no. We have given you the EXACT output format. Please just follow them. 2. Use ONLY relative path. You should assume your scripts will be run in the directory where the scripts are at. For example, no ‘C:\\homework1\…’ or ‘/Users/ blabla/….’ 3. Make sure that you are able to run the code according to the execution format mentioned above in the questions. 4. Double-check your files before submitting them. Please use python3 to complete the homework and try to maintain the python version as 3.7. Do not use any libraries other than the ones specified in the handout. You can use EC2 to test your code, and python 3.7 is preinstalled on EC2. 5. You can submit it multiple times on DEN but only the latest attempt will be graded. 6. Please only submit the stats.py file, and do not include your output file in the submission.