Files you need to read:
ecoli.fasta :
The data file for your submission.
short.fasta :
A data file to debug your code. Do not use for submission.
Files you need to submit:
Your filenames need to start with 'problem1' or 'problem2'.
For each problem, submit:
1.
A properly commented and
PEP8-compliant Python file for the module.
2.
A properly commented and
PEP8-compliant Python file for the script.
3.
A PDF of the terminal output using
the print icon in the lower left corner in PyCharm.
4.
One screenshot of your terminal
showing the git commands you typed.
5.
One screenshot of your internet
browser showing your Python files in GitHub.
Problem 1. Count the Bases (50 points)
Write a module with a function that takes a filename, tests
whether the file extension is .fasta, prints the number of As, Cs, Gs, and Ts
found in the sequence if it is a fasta file, and throws an exception if the
input is not a fasta file. Use a single space between the upper case letter and
the count.
Write a script that uses this module and shows the number of
bases in ecoli.fasta in the terminal. Were you to process short.fasta,
you should see the following in your terminal.
A 47
C 27
G 31
T 49
Problem 2. Count Bigrams (50 points)
Without reusing the code you wrote for Problem 1, write a
new module and a new script. Your code need to show the number of pairs
of DNA bases in ecoli.fasta in the
terminal using the function printDigrams provided in the lecture slides.
You might want to practice using short.fasta , but
your final submission needs to use ecoli.fasta. The second line is
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
which has the pairs AG, GC, CT, TT, TT, TT, TC in order, and
so on.
Create a dictionary that maps strings to the number of times
the string appears. AA appears 338006 times in the file ecoli.fasta
If you count the bases in that sample file short.fasta and
print the result, you would get the following: AA appears 18 times, AG appears
9 times, and so on.
In your terminal, include a short description with each
output, e.g. "The result is:". The output should not exceed 100
lines.
A G
C T
A 18 9
8 12
G 7
6 9 9
C 7
3 3 13
T 14 13 7 15
Rubric (50 points per problem):
- (10 points) Your submission include all files listed under
'Files you need to submit'. Files have meaningful names and the content
matches the filename.
- (10 points) The code reads the correct data, not the
sample data, and generates the correct results in the terminal.
- (10 points) Exceptions and explicit error messages are
used to cover common error cases, e.g. trying to read a file that doesn't exist
or trying to read a file with the wrong extension.
- (10 points) Code is commented and PEP8 compliant.
Variable names are meaningful. Every module and function has a meaningful
docstring.
- (10 points) Concepts covered in the last lecture are
used. Unnecessary structures, global variables, hard-coded values, break,
and continue are not used. Code, results, and test results are easy to
read.
Get Free Quote!
255 Experts Online