A Crash Course on BioPython
Spring 2004
On th elab machines,
Biopython package is residing in “/usr/local/lib/python2.3/site-packages”. Please add the following line to your
“.bashrc” file:
export
PYTHONPATH=$PYTHONPATH’:/usr/local/lib/python2.3/site-packages’
The main biopython releases have
lots of functionality, including:
• The ability to parse
bioinformatics files into python utilizable data structures, including support
for
the following formats:
– Blast
output – both from standalone and WWW Blast
–
Clustalw
– FASTA
– GenBank
– PubMed
and Medline
– Expasy
files, like Enzyme, Prodoc and Prosite
–
SwissProt
Files in
the supported formats can be iterated over record by record or indexed and
accessed via a
Dictionary interface.
• Code to deal with popular on-line
bioinformatics destinations such as:
– NCBI
– Blast,
Entrez and PubMed services
–
Expasy, Prodoc, Prosite, and SwissProt
• Interfaces to common
bioinformatics programs such as:
–
Standalone Blast from NCBI
–
Clustalw alignment program.
• A standard sequence class that
deals with sequences, ids on sequences, and sequence features;
tools for performing common
operations on sequences, such as translation, transcription and weight
calculations.
• Code to perform classification of
data using k Nearest Neighbors, Naive Bayes or Support Vector
Machines.
• Code for dealing with alignments,
including a standard way to create and deal with substitution
matrices.
• Integration with other languages,
including the Bioperl and Biojava projects, using the BioCorba
interface standard (available with
the biopython-corba module).
A biopython Seq object has two important attributes:
1. data – the actual sequence data string of the
sequence.
2. alphabet – an object describing what the individual
characters making up the string “mean” and how
they
should be interpreted. We’ll use the IUPAC
alphabets (http://www.chem.qmw.ac.uk/iupac/) here to deal with some of our favorite objects,
DNA, RNA and Proteins: IUPACProtein, ExtendedIUPACProtein,
IUPACUnambiguousDNA, IUPACAmbiguousDNA.
>>> from
Bio.Alphabet import IUPAC
>>> my_alpha =
IUPAC.unambiguous_dna
>>> from Bio.Seq
import Seq
>>> my_seq =
Seq(’GATCGATGGGCCTATATAGGATCGAAAATCGC’, my_alpha)
>>> print my_seq
Seq(’GATCGATGGGCCTATATAGGATCGAAAATCGC’,
IUPACUnambiguousDNA())
>>> my_seq[4:12]
Seq(’GATGGGCC’,
IUPACUnambiguousDNA())
>>> protein_seq =
Seq(’EVRNAK’, IUPAC.protein)
>>>
dna_seq = Seq(’ACGT’, IUPAC.unambiguous_dna)
>>> protein_seq +
dna_seq
Traceback (most recent call
last):
File
"<stdin>", line 1, in ?
File
"/usr/local/lib/python1.6/site-packages/Bio/Seq.py", line 42, in
__add__
raise TypeError,
("incompatable alphabets", str(self.alphabet),
TypeError: (’incompatable
alphabets’, ’IUPACProtein()’, ’IUPACUnambiguousDNA()’)
>>> my_seq.tostring()
’GATCGATGGGCCTATATAGGATCGAAAATCGC’
>>> my_seq[5] =
’G’ #
Not mutable!!!
Traceback (most recent call
last):
File
"<stdin>", line 1, in ?
AttributeError: ’Seq’
instance has no attribute ’__setitem__’
>>> mutable_seq =
my_seq.tomutable() # make it mutable if
you have to
>>> print
mutable_seq
MutableSeq(’GATCGATGGGCCTATATAGGATCGAAAATCGC’,
IUPACUnambiguousDNA())
>>> mutable_seq[5]
= ’T’
>>> print
mutable_seq
MutableSeq(’GATCGTTGGGCCTATATAGGATCGAAAATCGC’,
IUPACUnambiguousDNA())
>>>
mutable_seq.remove(’T’)
>>> print
mutable_seq
MutableSeq(’GACGTTGGGCCTATATAGGATCGAAAATCGC’,
IUPACUnambiguousDNA())
>>>
mutable_seq.reverse()
>>> print
mutable_seq
MutableSeq(’CGCTAAAAGCTAGGATATATCCGGGTTGCAG’,
IUPACUnambiguousDNA())
>>> from Bio import
Transcribe
>>> transcriber =
Transcribe.unambiguous_transcriber
>>>
my_rna_seq = transcriber.transcribe(my_seq)
>>> print
my_rna_seq
Seq(’GAUCGAUGGGCCUAUAUAGGAUCGAAAAUCGC’,
IUPACUnambiguousRNA())
>>>
transcriber.back_transcribe(my_rna_seq)
Seq(’GATCGATGGGCCTATATAGGATCGAAAATCGC’,
IUPACUnambiguousDNA())
>>> from Bio import
Translate
>>>
standard_translator = Translate.unambiguous_dna_by_id[1] #standard
translation
>>>
mito_translator = Translate.unambiguous_dna_by_id[2] #Mitochondriall DNA
>>> my_protein =
Seq(’AVMGRWKGGRAAG’, IUPAC.protein)
>>>
standard_translator.back_translate(my_protein)
Seq(’GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGT’,
IUPACUnambiguousDNA())
All of the parsers have two
components:
1. Scanner - The part of the
parser that actually does the work or going through the file and extracting
useful information. This useful
information is converted into Events.
2. Consumer - The consumer
does the job of processing the useful information and spitting it out ina
format that the programmer can use.
The consumer does this by receiving the events created by the
scanner.
>gi|6273290|gb|AF191664.1|AF191664
Opuntia clavata rpl16 gene; chloroplast gene for...
TCTAAATGATATAGGATTCCACTATGTAAGGTCTTTGAATCATATCATAAAAGACAATGTAATTAATAAA
AAA ...
As a scanner moved through a file
containing this entry, it would create the following events:
begin sequence (as
soon as we notice we’ve got a new >)
title >gi|6273290|gb|AF191664.1|AF191664
Opuntia clavata...
sequence TATACATTAAAGGAGGGGGATGCGGAT...
sequence TCTAAATGATATAGGATTCCACTATGTAA...
sequence AAA... (and so on – you’ve got
the idea!)
end sequence (as soon as we reach a blank line
after the sequence data)
Processing:
1. Register
itself with the scanner to let it know that it wants to receive those events
that are being
generated.
2. Do
something with the events (and the information associated with them).
# Writing your own consumer
import
string
from
Bio.ParserSupport import AbstractConsumer
class
SpeciesExtractor(AbstractConsumer):
def __init__(self):
self.species_list = [] # create a species_list
def title(self, title_info):
title_atoms =
string.split(title_info)
new_species = title_atoms[1]
if new_species not in
self.species_list:
self.species_list.append(new_species)
So, when the Scanner comes to the first line in our example
FASTA file:
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum
5.8S rRNA gene and ITS1 and ITS2 DNA
It will recognize this as a title, and call the title
function with the title_info argument set as the
value of the title.
from Bio import Fasta
def extract_organisms(file,
num_records):
scanner
= Fasta._Scanner()
consumer
= SpeciesExtractor()
file_to_parse
= open(file, ’r’)
for
fasta_record in range(num_records):
scanner.feed(file_to_parse,
consumer)
file_to_parse.close()
return
consumer.species_list
# the use of the function
all_species =
extract_organisms("ls_orchid.fasta", 94) #Lady Slipper Orchids
print "number
of species:", len(all_species)
print ’species names:’,
all_species
One big problem with the parser we
used above was that we were required to know the number of elements
in the file. To help deal with this problem, we can use
the Iterator interface for FASTA files. This interface allows you to not
worry about consumers and scanners, and just march through a file one record at
a time. However, doing this requires that the Iterator return some kind of
object that you can manipulate to extract the info you want - a very useful
class was written into many of the Biopython parsers – a RecordParser,
which parses a file into a python class that represents it. For instance, for
FASTA files, the RecordParser class is just an object with title and sequence
attributes.
>>> from Bio import
Fasta
>>> parser =
Fasta.RecordParser()
>>> file =
open("ls_orchid.fasta")
>>> iterator =
Fasta.Iterator(file, parser)
>>> cur_record =
iterator.next()
>>> print
cur_record.title
gi|2765658|emb|Z78533.1|CIZ78533
C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
>>> print
cur_record.sequence
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAATCCGGA
. . .
>>> print
cur_record
>gi|2765658|emb|Z78533.1|CIZ78533
C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA
. . .
We can use all of this to do a
rewrite of our extract_organisms function:
def
extract_organisms(file_to_parse):
parser
= Fasta.RecordParser()
file = open(file_to_parse, ’r’)
iterator
= Fasta.Iterator(file, parser)
all_species
= []
while
1:
cur_record
= iterator.next()
if
cur_record is None:
break
#
extract the info from the title
title_atoms
= string.split(cur_record.title)
new_species
= title_atoms[1]
#
append the new species to the list if it isn’t there
if
new_species not in all_species:
all_species.append(new_species)
return
all_species
Let’s index our record via their
GenBank accession number.
import string
def
get_accession_num(fasta_record):
title_atoms
= string.split(fasta_record.title)
#
all of the accession number information is stuck in the first element
#
and separated by ’|’s
accession_atoms
= string.split(title_atoms[0], ’|’)
#
the accession number is the 4th element
gb_name
= accession_atoms[3]
#
strip the version info before returning
return
gb_name[:-2]
Now we need to create an index from
this file, which will show us why writing this function was important.
The general format for indexing a
file is:
index_file(file_to_index,
index_file_to_create, function_to_get_index_key)
The tricky argument is the
function_to_get_index_key. Basically, this is a reference to a function that
can get called and return an element
to use as a key. The idea here is that the index_file should be general
purpose, and there isn’t any good
way for the Biopython developers to read your mind and know how you
want to index your files. So now the
get_accession_num function above makes a lot of sense!
Using all of this, we can now create
an index file and see how it works. First, let’s create the index file:
>>> from Bio import
Fasta
>>>
Fasta.index_file("ls_orchid.fasta", "my_orchid_dict.idx",
get_accession_num)
This creates an index file
my_orchid_dict.idx based on the input from ls_orchid.fasta and indexed
by the values returned by our get_accession_number
function.
Now that we’ve got the index, we can
create an in-memory dictionary to access the file contents using
the index:
>>> from
Bio.Alphabet import IUPAC
>>> dna_parser =
Fasta.SequenceParser(IUPAC.ambiguous_dna)
>>> orchid_dict =
Fasta.Dictionary("my_orchid_dict.idx", dna_parser)
The type of objects returned by the
index file are based on the second argument passed. This argument should be a
parser that the index file will pass the
information through before returning it. If no parser is passed, then the
dictionary will just return the raw object (i. e. exactly as it appears in the
file). The parser that we pass here is
a SequenceParser which converts FASTA files into SeqRecord
objects. The SequenceParser takes an
argument of the alphabet we should be using for the sequences. Since the parser
can’t be smart enough to know exactly what it will be parsing, we need to tell
it.
Since
this is a dictionary, we can look at all of the keys we have available:
>>> print
orchid_dict.keys()
[’Z78475’, ’Z78519’,
’Z78516’, ’Z78517’, ’Z78514’, ’Z78515’, ’Z78512’,
’Z78513’, ’Z78510’, ’Z78511’, ’Z78457’, ’Z78456’, ’Z78455’,
’Z78454’,
...
>>> seq_record =
orchid_dict[’Z78475’]
>>> print
seq_record
<Bio.SeqRecord.SeqRecord
instance at 0x102c1f2c>
>>> print
seq_record.description
gi|2765600|emb|Z78475.1|PSZ78475
P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA
>>> print
seq_record.seq
Seq(’CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCAC
...’, IUPACAmbiguousDNA())
The sys module: The sys module provides an interface with the Python interpreter:
%
python -i prog.py myseq.fasta
>>>
import sys
>>>
sys.argv
[’prog.py’,
’myseq.fasta’]
>>> sys.stdout.write("a string\n")
a
string
>>>
sys.stdin.read()
a line
another
line <Ctrl-D> # need a ctrl-d to
end the input
’a
line\nanother line\n’
The os module: This module is very helpful to handle files, directories,
and processes.
>>>
import os.path
>>>
os.path.exists(’myseq.fasta’)
1
>>> os.path.isfile(’myseq.fasta’)
1
>>> os.path.isdir(’myseq.fasta’)
0
>>>
os.path.basename(’/local/bin/perl’)
’perl’
The simplest way to run a program is by using the
system of the os module.
>>>
import os
>>>
cmd="golden swissprot:malk_ecoli"
>>> status = os.system(cmd)
>>>
print "Status: ", status
The
first step in interacting with clustalw is to set up a command line you want to
pass to the program.
>>> import os
>>> from
Bio.Clustalw import MultipleAlignCL
>>> command_line =
MultipleAlignCL(os.path.join(os.curdir, ’sars.fasta’))
>>>
command_line.set_output(’test.aln’)
# CLUSTALW_OUTPUT_FILE =
"test.aln"
# CLUSTALW_OUTPUT_TYPE =
"NEXUS"
#
command_line.set_output(CLUSTALW_OUTPUT_FILE, output_type=CLUSTALW_OUTPUT_TYPE)
>>> str( command_line
)
clustalw ./sars.fasta
-OUTFILE=test.aln
>>> from Bio import
Clustalw
>>> alignment =
Clustalw.do_alignment( command_line) # a ClustalAlignment type object
>>> all_records =
alignment.get_all_seqs() # get a list of SeqRecord objects
>>> print ’description:’,
all_records[0].description
>>> print
’sequence:’, all_records[0].seq
This prints out the description and
sequence object for the first sequence in the alignment:
description:
gi|6273285|gb|AF191659.1|AF191
sequence:
Seq(’TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT
...’, IUPACAmbiguousDNA())
>>> length =
alignment.get_alignment_length() # calculate the maximum length of the
alignment
>>> print alignment
CLUSTAL X (1.81) multiple
sequence alignment
gi|6273285|gb|AF191659.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273284|gb|AF191658.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
...
>>> from Bio.Align
import AlignInfo
>>> summary_align =
AlignInfo.SummaryInfo( alignment )
Assuming we’ve got a SummaryInfo
object called summary_align we can calculate a consensus by doing:
>>> consensus =
summary_align.dumb_consensus() # a Seq
object
>>> print consensus
consensus
Seq(’TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT
...’,
IUPACAmbiguousDNA())
# the default ambiguous
character is ‘N’
First, let’s say we have just parsed
an alignment from clustal format into a ClustalAlignment object:
>>> import os
>>> from Bio import
Clustalw
>>> alignment =
Clustalw.parse_file(os.path.join(os.curdir, ’test.aln’))
Now, let’s convert this
alignment into FASTA format. First, we create a converter object:
>>> from
Bio.Align.FormatConvert import FormatConverter
>>>
converter = FormatConverter(alignment)
We then pass the converter the
alignment that we want to convert. Now, to get this in FASTA alignment
format, we simply do the following:
>>> fasta_align =
converter.to_fasta()
>>> print
fasta_align
>gi|6273285|gb|AF191659.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAATATATA----
------ATATATTTCAAATTTCCTTATATACCCAAATATAAAAATATCTAATAAATTAGA
...
• ExPASy – http://www.expasy.ch
• Entrez from NCBI – http://www.ncbi.nlm.nih.gov/Entrez/
• PubMed from NCBI – http://www.ncbi.nlm.nih.gov/PubMed/
In this example, we’ll just show how
to connect, get the results, and display them in a web browser.
First,
we’ll start by defining our search and how to display the results:
search_command = ’Search’
search_database =
’Nucleotide’
return_format = ’FASTA’
search_term =
’Cypripedioideae’
my_browser = ’lynx’
The first four terms
define the search we are going to do. To use the Entrez module, you’ll need to
know a bit about how the
remote CGI scripts at NCBI work, and you can find out more about this at
http://www.ncbi.nlm.nih.gov/entrez/query/static/linking.html. The final term just
describes the
browser to display the
results in.
from Bio.WWW import NCBI # it has all the interface with NCBI
result_handle =
NCBI.query(search_command, search_database, term = search_term,
doptcmdl
= return_format) # Entrez query
import os
result_file_name =
os.path.join(os.getcwd(), ’results.html’)
result_file =
open(result_file_name, ’w’)
result_file.write(result_handle.read())
result_file.close()
if my_browser == ’lynx’:
os.system(’lynx
-force_html ’ + result_file_name)
elif my_browser ==
’netscape’:
os.system(’netscape
file:’ + result_file_name)
The GenBank record
format is a very popular method of holding information about sequences,
sequence
features, and other
associated sequence information. The format is a good way to get information
from the
NCBI databases at http://www.ncbi.nlm.nih.gov/.
We can do quick search and get back
the Gis (GenBank identifiers) for all of the corresponding records:
>>> from Bio import
GenBank
>>> gi_list =
GenBank.search_for("Opuntia AND rpl16")
gi_list will be a list of all of the
GenBank identifiers that match our query:
[’6273291’, ’6273290’,
’6273289’, ’6273287’, ’6273286’, ’6273285’, ’6273284’]
Now that we’ve got the
GIs, we can use these to access the NCBI database through a dictionary
interface:
>>>
ncbi_dict = GenBank.NCBIDictionary()
Now
that we’ve got this, we do the retrieval:
>>> gb_record =
ncbi_dict[gi_list[0]]
In this case, gb_record will be
GenBank formatted record:
LOCUS AF191665 902
bp DNA PLN 07-NOV-1999
DEFINITION Opuntia marenae rpl16 gene; chloroplast gene
for chloroplast
product,
partial intron sequence.
ACCESSION AF191665
VERSION AF191665.1 GI:6273291
...
In this case, we are just getting
the raw records. We can also pass these records directly into a parser
and return the parsed record. For
instance, if we wanted to get back SeqRecord objects with the GenBank
file parsed into SeqFeature objects
we would need to create the dictionary with the GenBank FeatureParser:
>>> record_parser =
GenBank.FeatureParser()
>>> ncbi_dict =
GenBank.NCBIDictionary(parser = record_parser)
Now
retrieving a record will give you a SeqRecord object instead of the raw record:
>>> gb_seqrecord =
ncbi_dict[gi_list[0]]
>>> print
gb_seqrecord
<Bio.SeqRecord.SeqRecord
instance at 0x102f9404>
Right now the GenBank
module provides the following parsers:
1. RecordParser –
This parses the raw record into a GenBank specific Record object. This object
models
the information in a raw
record very closely, so this is good to use if you are just interested in
GenBank
records themselves.
2. FeatureParser –
This parses the raw record in a SeqRecord object with all of the feature table
information
represented in
SeqFeatures. This is best to use if you are interested in getting things in a
more standard format.
Either way
you chose to go, the most common usage of these will be creating an iterator
and parsing
through a file on GenBank
records.
>>> from Bio import
GenBank
>>> gb_file =
"my_file.gb"
>>> gb_handle =
open(gb_file, ’r’)
>>> feature_parser
= GenBank.FeatureParser()
>>> gb_iterator =
GenBank.Iterator(gb_handle, feature_parser)
>>> while 1:
... cur_record = gb_iterator.next()
... if cur_record is None:
... break
... # now do something with the record
... print cur_record.seq
This just iterates over a GenBank
file, parsing it into SeqRecord and SeqFeature objects, and prints out
the Seq objects representing the
sequences in the record.
The first step in automating BLASTing is to make
everything accessable from python scripts. So, biopython
contains code that allows you to run the WWW
version of BLAST (http://www.ncbi.nlm.nih.gov/BLAST/)
directly from your python scripts.
The code to deal with the WWW version of BLAST is found in
the Bio.Blast.NCBIWWW module, and the
blast function. Let’s say we want to BLAST info we have in
a FASTA formatted file against the database.
First, we need to get the info from the FASTA file. The
easiest way to do this is to use the Fasta iterator to get
a record. In this example, we’ll just grab the first FASTA
record from a file, using an Iterator:
>>> from Bio import
Fasta
>>>
file_for_blast = open(’m_cold.fasta’, ’r’)
>>>
f_iterator = Fasta.Iterator(file_for_blast)
>>>
f_record = f_iterator.next()
Now that
we’ve got the FASTA record, we are ready to blast it.
>>> from Bio.Blast
import NCBIWWW
>>> result_handle =
NCBIWWW.blast(’blastn’, ’nr’, f_record) # returns a file-like handle
The blast
function also take a number of other option arguments which are basically
analogous to the different parameters you can set on the basic BLAST page. The first argument is the blast program to use for the
search. The options and descriptions of the programs are available at http://www.ncbi.nlm.nih.gov/BLAST/blast_program.html. The second argument specifies the databases to search
against. Again, the options for this are available on the NCBI web pages at http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.html.
Before parsing the results, it is
often useful to save them into a file so that you can use them later
without having to go back and
re-blast everything.
>>> save_file =
open(’my_blast.out’, ’w’)
# save the results for later, in case we
want to look at it
>>>
blast_results = result_handle.read() # load the handle into a string
>>>
save_file.write(blast_results) # write the string to the file
>>> save_file.close() #
don’t forget to close the file
You can get information
to parse with the WWW BLAST parser in one of the two ways.
1. The first is to grab
the info using the biopython blast function for dealing with blast, described
above.
2. The second way is to
manually do the BLAST on the NCBI site, and then save the results. In order to
have
your output in the right
format, you need to save the final BLAST page you get (you know, the one with
all
of the interesting
results!) as HTML (i. e. do not save as text!) to a file. The important point
is that you
do not have to use
biopython scripts to fetch the data in order to be able to parse it.
If
you followed the code above for interacting with BLAST through a script, then
you already have result_handle, which is great for passing to a parser. If you
saved the results in a file, then opening up the file with code like the
following will get you what you need:
>>> result_handle = open(’my_blast_result.out’, ’r’) # what you
received using either 1 or 2
Now that we’ve got a handle, we are
ready to parse the output. The code to parse it is really quite small:
>>> from Bio.Blast
import NCBIWWW
>>> blast_parser =
NCBIWWW.BlastParser() # get
the parser
>>> blast_record =
blast_parser.parse(result_handle) # parse
the handle
What this
accomplishes is parsing the BLAST output into a Blast Record class, which has a
rich structure. Now let’s just print
out some summary info about all hits in our blast report greater than a
particular threshold. The following code does this:
E_VALUE_THRESH = 0.04
for alignment in
blast_record.alignments:
for hsp in alignment.hsps:
if
hsp.expect < E_VALUE_THRESH:
print
’****Alignment****’
print
’sequence:’, alignment.title
print
’length:’, alignment.length
print
’e value:’, hsp.expect
print
hsp.query[0:75] + ’...’
print
hsp.match[0:75] + ’...’
print
hsp.sbjct[0:75] + ’...’
This will print out summary reports
like the following:
****Alignment****
sequence:
>gb|AF283004.1|AF283004 Arabidopsis thaliana cold acclimation protein
WCOR413-like protein
alpha form mRNA, complete
cds
length: 783
e value: 0.034
tacttgttgatattggatcgaacaaactggagaaccaacatgctcacgtcacttttagtcccttacatattcctc...
||||||||| | ||||||||||| ||
|||| || || |||||||| |||||| | | |||||||| ||| ||...
tacttgttggtgttggatcgaaccaattggaagacgaatatgctcacatcacttctcattccttacatcttcttc...