Description
In this assignment, we’re interested in the main topics discussed on the /r/mcgill subreddit vs. the /r/concordia subreddit. We’ll do this using human annotation … and you’re the annotator 😊 Task 1: Data collection (10 pts) First, let’s collect the latest 100 posts (using the /new endpoint (do not use the /hot endpoint)). Write a script “collect_newest.py” that collects the 100 newest posts in the subreddit specified. It should run as follows: python3 collect_newest.py -o -s Collect two data files – one for mcgill and one for concordia subreddits. This involves running your script two times. Note that in the output data files, you should have exactly one post (in JSON format) per line. Do not indent the JSON output. The files should be named concordia.json and mcgill.json. Place them in the root folder of the submission template. Please read the README.md file in the repository for further instructions. Task 2: Prep for coding (10 pts) Write a script extract_to_tsv.py that accepts one of the files you collected from Reddit and outputs a random selection of posts from that file to a tsv (tab separated value) file. It should function like this: python3 extract_to_tsv.py -o If is greater than the file length, then the script should just output all lines. If there are more than (which is likely the case), then it should randomly select num_posts_to_output (the parameter you passed to the script) of them and just output those. The output format (written to out_file) is: Name title coding