Description
The goal of this project is to create a script that analyzes various features of a bacterial genome.
You are given the following types files to analyze:
Sequence file is a FASTA file containing the DNA sequence of a bacterial species. The DNA is organized into 1 or
more chromosomes.
Annotation file is a text file containing tab-delimited data for each gene. There should be a header line,
containing the following five columns:
contain information for a single gene. Assume the coordinate system is 1 based.
Your script should take the following arguments:
Positional arguments
Sequence file – required, must be a string
Annotation file – required, must be a string
Optional arguments
Codon analysis flag – optional, should not take a value
Gene sequence flag – optional, should take 1 or more gene names to return the sequence
Your script should do the following things:
A. Using argparse, take in all the above arguments and store them appropriately into a single object.
B. Read in and perform error checking on the sequence and annotation file.
For the sequence file, use the pyfaidx module to read in the data. Verify that:
1. The file exists
2. It is proper fasta format
3. All nucleotides are A,C,G, or T (uppercase or lowercase are allowed)
For the annotation file, you should use pandas to read in the data. Verify that:
1. The file exists
2. It contains five columns
3. The headers of the columns are named: GeneName, Chromosome, Strand, Start, Stop
4. None of the genes have the same name
5. Strand equals ‘+’ or ‘-‘
6. Start is less than stop
7. The length of the gene is divisible by 3
If any of these conditions are violated, the program should print an informative statement of all of the
violations and quit the program.
C. If no optional arguments are given, your script should report: name, length, number of genes, and GC content
for each of the chromosomes.
D. If the codon analysis option is used, you should report that calculates the amino acid and codon usage for the
entire genome (i.e. how often each amino acid is used within all of the proteins and how often each codon is
used for a given amino acid):
A 5.5% – GCA: 23%; GCC – 37%; GCG – 21%; GCT – 19%
E. If the gene sequence option is used, you should print on the protein sequence for each of the genes that are
requested in FASTA format.
We’ve provided a template script for you to use. Some example outputs for this script are given below:
OUTPUTS
General Usage
Help documentation:
Base case:
Codon flag:
Gene flag:
Error Handling
Missing command inputs:
Missing files:
Bad input files:
Bad input files (duplicate gene):