Description
Overview Natural Language Processing (NLP) is a subset of AI that focuses on the understanding and generation of written and spoken language. This involves a series of tasks from low-level speech recognition on audio signals up to high-level semantic understanding and inferencing on the parsed sentences. One task within this spectrum is Part-Of-Speech (POS) tagging. Every word and punctuation symbol is understood to have a syntactic role in its sentence, such as nouns (denoting people, places or things), verbs (denoting actions), adjectives (which describe nouns) and adverbs (which describe verbs), to name a few. Each word in a piece of text is therefore associated with a part-of-speech tag (usually assigned by hand), where the total number of tags can depend on the organization tagging the text. A list of all the part-of-speech tags can be found here. While this task falls under the domain of NLP, having prior language experience doesn’t offer any particular advantage. In the end, the main task is to create an HMM model that can figure out a sequence of underlying states given a sequence of observations. What You Need To Do: Your task for this assignment is to create a Hidden Markov Model (HMM) for POS tagging, including: 1. Training probability tables (i.e., initial, transition and emission) for HMM from training files containing text-tag pairs 2. Performing inference with your trained HMM to predict appropriate POS tags for untagged text. Your solution will be graded based on the learned probability tables and the accuracy on our test files, as well as the efficiency of your algorithm. See Mark Breakdown for more details. Starter Code & Validation Program The starter code contains one Python starter file, a validation program and several training and test files. You can download the code and supporting files as a zip file starter-code.zip. In that archive, you will find the following files: Project File (the file you will edit and submit on Markus): tagger.py The file where you will implement your POS tagger; this is the only file tand graded. Training Files (don’t modify): data/training1.txt – training5.txt Training files (in text format) containing large texts with POS tags on eacTesting Files (don’t modify): data/test1.txt – test5.txt Test files (in text format), identical to the training files but without the POValidation Files: validation/tagger-validate.py The public validation script for testing your solution with a set of providefiles. Running the Code You can run the POS tagger by typing the following at a command line: $ python3 tagger.py -d -t -o