Grasping the basics of a programming language is one thing, seeing how to apply that to solving problems is quite another. This process of working a problem is usually gained over a long period of time as a programmer. Different programmers will tackle things in different ways. Part of the job of learning to program is developing your way of solving problems.
It is therefore useful to see how experienced programmers tackle particular problems. Not so you can mimic their approach, but so you can see one way a person with experience goes about the task. This observation can help shape your approach to problem solving.
With that motivation, I'll walk through how I would use Python to tackle a problem from Project Rosalind. This is a well documented set of problems from the world of biology. The main goal is to manipulate strings of genetic information using Python. The site gives you plenty of background on the science, so no prior knowledge is needed. Furthermore, the problems are very well described, so you can just approach them as string manipulation problems and forget the biology if you like. (Though that wouldn't be making the most of the exercise.)
The first problem is nice and simple, counting nucleotides. DNA sequences are made up of four letters A
, C
, G
, and T
, representing the four different types of amino acid. Given a string of at most 1000 such letters, write a function to count how many times each one occurs.
Let's create a new Python file and set it up as follows.
# Function definitions.
# Test code.
if __name__ == "__main__":
# Test each function here.
We will define a function at the top of the file, then write a series of statements to test it's behaviour under the main guard.
So, to the first function. From the problem description we can see our function needs to accept a single string s
that cannot exceed 1000
characters. Lets implement that now.
# Function definitions.
def count_nucleotides(s):
""" Count the number of occurrences of each nucleotide in string s. Return
a string of four integers representing the count of each one.
http://rosalind.info/problems/dna/
"""
if len(s) > 1000:
raise ValueError("String cannot exceed 1000 characters.")
Our function doesn't actually count anything yet, but it does two important things. First it has a doc string (the lines between the triple quote marks """
). This describes what the function does. You can see the doc string for a Python function by passing its name to the built in function help()
.
>>> help(count_nucleotides)
Help on function count_nucleotides in module __main__:
count_nucleotides(s)
Count the number of occurrences of each nucleotide in string s. Return
a string of four integers representing the count of each one.
http://rosalind.info/problems/dna/
Second our function validates the length of the supplied string. The problem specification stated that strings may not exceed 1000 characters in length. We use the built in len()
function to get the length of the variable s
. If it is greater than our limit of 1000, we raise
an exception. Raising an exception stops execution of the function and passes the exception up to the caller. The caller can chose to handle the exception in some way, or ignore it and let it stop the program entirely.
Before we add any more code to our function, let's test the behaviour we have so far.
# Test code.
if __name__ == "__main__":
# Test each function here.
test_name = "Test 1: rejects strings longer than 1000 characters."
print(test_name)
test_value = "a" * 1001
try:
count_nucleotides(test_value)
print("failed - {0}".format(test_name))
except ValueError as ex:
print("passed - {0}".format(test_name))
This code calls our count_nucleotides()
function with a string of 1001 letter A
characters. We use try
and except
to tell Python to expect a ValueError
exception to be thrown by the function. If this happens, our test can be considered to have passed, otherwise it has failed in some way.
A note on testing.
There are libraries that formalise the writing of tests such as these, known as unit tests, in Python. I've gone for plain Python code for our tests here, so as not to complicate things with unfamiliar library code.
The final thing we need to do with our input is to normalise it. We assume the string is valid DNA, only containing letters A
, C
, T
, and G
. However it will make counting easier if we can know that the letters are all upper case or all lower case. Let's normalise the string by making all the letters upper case.
# Function definitions.
def count_nucleotides(s):
""" Count the number of occurrences of each nucleotide in string s. Return
a string of four integers representing the count of each one.
http://rosalind.info/problems/dna/
"""
if len(s) > 1000:
raise ValueError("String cannot exceed 1000 characters.")
ns = s.upper()
It's good practice to not modify the variables given to the function as it's arguments unless absolutely necessary. So here we are assigning the result of calling s.upper()
to a new variable called ns
(short for normalised s).
It would be nice to test this behaviour, but we are not in a position to do so just yet. Once we implement the counting logic we can test that it handles both upper and lower case letters as input, thus implicitly testing our normalisation logic.
Right, lets count some characters. We will need to keep a running total of four numbers, one for each letter. This could be done with four separate variables, but I'm going to use a dictionary, keyed by the four letters A
, C
, T
, and G
.
totals = {'A':0, 'C':0, 'T':0, 'G':0}
Now I can simply loop over each character in the string ns
, and use that character to increment the relevant dictionary entry.
for char in ns:
totals[char] += 1
That's all there is to it. The problem definition asked us to return a string of the four totals separated by a space. We can do this neatly as follows.
" ".join([str(v) for v in totals.values()])
Putting it all together our count_nucleotides()
function now looks as follows.
# Function definitions.
def count_nucleotides(s):
""" Count the number of occurrences of each nucleotide in string s. Return
a string of four integers representing the count of each one.
http://rosalind.info/problems/dna/
"""
if len(s) > 1000:
raise ValueError("String cannot exceed 1000 characters.")
ns = s.upper()
totals = {'A':0, 'C':0, 'T':0, 'G':0}
for char in ns:
totals[char] += 1
return " ".join([str(v) for v in totals.values()])
If we run our module using the Python interactive shell, we can call our count_nucleotides()
method with a variety of different input and see how it behaves.
>>> count_nucleotides('ACTG')
'1 1 1 1'
>>> count_nucleotides('actg')
'1 1 1 1'
>>> count_nucleotides('AAAG')
'3 0 0 1'
>>>
So far so good. It seems to be working correctly, and handles both upper and lower case input thanks to our normalisation. What we want to do now is to formalise this testing, and add some more unit tests to our test code. It is seldom possible to prove a program works in 100% of circumstances, but a well chosen set of tests can give us a high confidence that the program works as expected.
A good place to start is to test what happens when we give our function an empty string. We would expect to get back a string of four zeroes, as there were no letters of any kind to count. Lets add this test to our program.
test_name = "Test 2: empty string gives four zero counts."
print(test_name)
test_value = ""
expected_result = "0 0 0 0"
actual_result = count_nucleotides(test_value)
if (expected_result == actual_result):
print("passed - {0}".format(test_name))
else:
print("failed - {0}".format(test_name))
Next it would be good to test it can handle a string of exactly one of each character.
test_name = "Test 3: string of each character gives four one counts."
print(test_name)
test_value = "ACTG"
expected_result = "1 1 1 1"
actual_result = count_nucleotides(test_value)
if (expected_result == actual_result):
print("passed - {0}".format(test_name))
else:
print("failed - {0}".format(test_name))
Running this test again but with mixed upper and lower case letters will be good test of our normalisation logic.
test_name = "Test 4: mixed case strings can be counted."
print(test_name)
test_value = "ActG"
expected_result = "1 1 1 1"
actual_result = count_nucleotides(test_value)
if (expected_result == actual_result):
print("passed - {0}".format(test_name))
else:
print("failed - {0}".format(test_name))
Next lets test a repeated set of each letter.
test_name = "Test 5: repeated characters can be counted."
print(test_name)
test_value = "AACCCTTTTGGGGG"
expected_result = "2 3 4 5"
actual_result = count_nucleotides(test_value)
if (expected_result == actual_result):
print("passed - {0}".format(test_name))
else:
print("failed - {0}".format(test_name))
Finally we can test a randomly constructed string with different counts of each letter.
test_name = "Test 6: random sequence can be counted."
print(test_name)
test_value = "GACAATTATTTG"
expected_result = "4 1 5 2"
actual_result = count_nucleotides(test_value)
if (expected_result == actual_result):
print("passed - {0}".format(test_name))
else:
print("failed - {0}".format(test_name))
This is by no means an exhaustive list of what you could test, but it is enough for a reasonable level of confidence in the code. As a programmer you learn to get a feel for these things, and know when enough is enough.
That's a good question. Reading our problem definition we can see it makes no mention of what to do when the string s
contains letters other than A
, C
, T
, or G
. At present our program will error, as it would try to access a non-existent key in the totals
dictionary. (Go ahead and try it to see the specific error you get.)
Perhaps validating the input can safely be considered outside the scope of our problem. There is a limit to how much one function should do. Maybe adding a function to check a given string is in fact a DNA sequence would be a useful addition to our program.
In real life programming issues such as this crop up all the time. The best course of action is to highlight these issues with the people who asked for the work to be done. Perhaps they have a plan for handling this issues. Perhaps they never thought of it as a problem.
Our finished program is given in the listing below. Note how much more code is given over to testing than code that actually solves the problem. This is a common pattern in programming. We need to have confidence in the behaviour of the code we write, so test code will naturally grow quite large. The more complex the problem the more test code we need.
# Function definitions.
def count_nucleotides(s):
""" Count the number of occurrences of each nucleotide in string s. Return
a string of four integers representing the count of each one.
http://rosalind.info/problems/dna/
"""
if len(s) > 1000:
raise ValueError("String cannot exceed 1000 characters.")
ns = s.upper()
totals = {'A':0, 'C':0, 'T':0, 'G':0}
for char in ns:
totals[char] += 1
return " ".join([str(v) for v in totals.values()])
# Test code.
if __name__ == "__main__":
# Test each function here.
test_name = "Test 1: rejects strings longer than 1000 characters."
print(test_name)
test_value = "a" * 1001
try:
count_nucleotides(test_value)
print("failed - {0}".format(test_name))
except ValueError as ex:
print("passed - {0}".format(test_name))
test_name = "Test 2: empty string gives four zero counts."
print(test_name)
test_value = ""
expected_result = "0 0 0 0"
actual_result = count_nucleotides(test_value)
if (expected_result == actual_result):
print("passed - {0}".format(test_name))
else:
print("failed - {0}".format(test_name))
test_name = "Test 3: string of each character gives four one counts."
print(test_name)
test_value = "ACTG"
expected_result = "1 1 1 1"
actual_result = count_nucleotides(test_value)
if (expected_result == actual_result):
print("passed - {0}".format(test_name))
else:
print("failed - {0}".format(test_name))
test_name = "Test 4: mixed case strings can be counted."
print(test_name)
test_value = "ActG"
expected_result = "1 1 1 1"
actual_result = count_nucleotides(test_value)
if (expected_result == actual_result):
print("passed - {0}".format(test_name))
else:
print("failed - {0}".format(test_name))
test_name = "Test 5: repeated characters can be counted."
print(test_name)
test_value = "AACCCTTTTGGGGG"
expected_result = "2 3 4 5"
actual_result = count_nucleotides(test_value)
if (expected_result == actual_result):
print("passed - {0}".format(test_name))
else:
print("failed - {0}".format(test_name))
test_name = "Test 6: random sequence can be counted."
print(test_name)
test_value = "GACAATTATTTG"
expected_result = "4 1 5 2"
actual_result = count_nucleotides(test_value)
if (expected_result == actual_result):
print("passed - {0}".format(test_name))
else:
print("failed - {0}".format(test_name))