A Crash Course on BioPython


Spring 2004


On th elab machines, Biopython package is residing in “/usr/local/lib/python2.3/site-packages”.  Please add the following line to your “.bashrc” file:

export PYTHONPATH=$PYTHONPATH’:/usr/local/lib/python2.3/site-packages’



The main biopython releases have lots of functionality, including:


• The ability to parse bioinformatics files into python utilizable data structures, including support for

the following formats:

– Blast output – both from standalone and WWW Blast

– Clustalw


– GenBank

– PubMed and Medline

– Expasy files, like Enzyme, Prodoc and Prosite

– SwissProt


Files in the supported formats can be iterated over record by record or indexed and accessed via a

Dictionary interface.


• Code to deal with popular on-line bioinformatics destinations such as:


– Blast, Entrez and PubMed services

– Expasy,  Prodoc, Prosite, and SwissProt


• Interfaces to common bioinformatics programs such as:

– Standalone Blast from NCBI

– Clustalw alignment program.


• A standard sequence class that deals with sequences, ids on sequences, and sequence features;

tools for performing common operations on sequences, such as translation, transcription and weight



• Code to perform classification of data using k Nearest Neighbors, Naive Bayes or Support Vector



• Code for dealing with alignments, including a standard way to create and deal with substitution



• Integration with other languages, including the Bioperl and Biojava projects, using the BioCorba

interface standard (available with the biopython-corba module).


Seq objects


A biopython Seq object has two important attributes:

1. data – the actual sequence data string of the sequence.

2. alphabet – an object describing what the individual characters making up the string “mean” and how

they should be interpreted.  We’ll use the  IUPAC alphabets (http://www.chem.qmw.ac.uk/iupac/) here to deal with some of our favorite objects, DNA, RNA and Proteins:  IUPACProtein, ExtendedIUPACProtein, IUPACUnambiguousDNA,  IUPACAmbiguousDNA.


>>> from Bio.Alphabet import IUPAC

>>> my_alpha = IUPAC.unambiguous_dna

>>> from Bio.Seq import Seq


>>> print my_seq


>>> my_seq[4:12]

Seq(’GATGGGCC’, IUPACUnambiguousDNA())



>>> protein_seq = Seq(’EVRNAK’, IUPAC.protein)

>>> dna_seq = Seq(’ACGT’, IUPAC.unambiguous_dna)

>>> protein_seq + dna_seq

Traceback (most recent call last):

File "<stdin>", line 1, in ?

File "/usr/local/lib/python1.6/site-packages/Bio/Seq.py", line 42, in __add__

raise TypeError, ("incompatable alphabets", str(self.alphabet),

TypeError: (’incompatable alphabets’, ’IUPACProtein()’, ’IUPACUnambiguousDNA()’)


>>> my_seq.tostring()



>>> my_seq[5] = ’G’  #  Not mutable!!!

Traceback (most recent call last):

File "<stdin>", line 1, in ?

AttributeError: ’Seq’ instance has no attribute ’__setitem__’


>>> mutable_seq = my_seq.tomutable()  # make it mutable if you have to

>>> print mutable_seq


>>> mutable_seq[5] = ’T’

>>> print mutable_seq


>>> mutable_seq.remove(’T’)

>>> print mutable_seq


>>> mutable_seq.reverse()

>>> print mutable_seq



>>> from Bio import Transcribe

>>> transcriber = Transcribe.unambiguous_transcriber

>>> my_rna_seq = transcriber.transcribe(my_seq)

>>> print my_rna_seq



>>> transcriber.back_transcribe(my_rna_seq)



>>> from Bio import Translate

>>> standard_translator = Translate.unambiguous_dna_by_id[1] #standard translation

>>> mito_translator = Translate.unambiguous_dna_by_id[2] #Mitochondriall DNA


>>> my_protein = Seq(’AVMGRWKGGRAAG’, IUPAC.protein)

>>> standard_translator.back_translate(my_protein)



Parsing biological file formats


All of the parsers have two components:

1. Scanner - The part of the parser that actually does the work or going through the file and extracting

useful information. This useful information is converted into Events.

2. Consumer - The consumer does the job of processing the useful information and spitting it out ina

format that the programmer can use. The consumer does this by receiving the events created by the



>gi|6273290|gb|AF191664.1|AF191664 Opuntia clavata rpl16 gene; chloroplast gene for...



AAA ...


As a scanner moved through a file containing this entry, it would create the following events:


begin sequence         (as soon as we notice we’ve got a new >)

title                   >gi|6273290|gb|AF191664.1|AF191664 Opuntia clavata...

sequence                TATACATTAAAGGAGGGGGATGCGGAT...


sequence                AAA... (and so on – you’ve got the idea!)

end sequence            (as soon as we reach a blank line after the sequence data)



1. Register itself with the scanner to let it know that it wants to receive those events that are being


2. Do something with the events (and the information associated with them).


# Writing your own consumer

import string

from Bio.ParserSupport import AbstractConsumer

class SpeciesExtractor(AbstractConsumer):

def __init__(self):

self.species_list = []  # create a species_list

def title(self, title_info):

title_atoms = string.split(title_info)

new_species = title_atoms[1]

if new_species not in self.species_list:



So, when the Scanner comes to the first line in our example FASTA file:


>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA


It will recognize this as a title, and call the title function with the title_info argument set as the

value of the title.


from Bio import Fasta

def extract_organisms(file, num_records):

scanner = Fasta._Scanner()

consumer = SpeciesExtractor()

file_to_parse = open(file, ’r’)

for fasta_record in range(num_records):

scanner.feed(file_to_parse, consumer)


return consumer.species_list


# the use of the function

all_species = extract_organisms("ls_orchid.fasta", 94) #Lady Slipper Orchids

print "number of species:", len(all_species)

print ’species names:’, all_species


Biopython parsers


One big problem with the parser we used above was that we were required to know the number of elements

in the file.  To help deal with this problem, we can use the Iterator interface for FASTA files. This interface allows you to not worry about consumers and scanners, and just march through a file one record at a time. However, doing this requires that the Iterator return some kind of object that you can manipulate to extract the info you want - a very useful class was written into many of the Biopython parsers – a RecordParser, which parses a file into a python class that represents it. For instance, for FASTA files, the RecordParser class is just an object with title and sequence attributes.


>>> from Bio import Fasta

>>> parser = Fasta.RecordParser()

>>> file = open("ls_orchid.fasta")

>>> iterator = Fasta.Iterator(file, parser)

>>> cur_record = iterator.next()

>>> print cur_record.title

gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA

>>> print cur_record.sequence


. . .


>>> print cur_record

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA



We can use all of this to do a rewrite of our extract_organisms function:


def extract_organisms(file_to_parse):

parser = Fasta.RecordParser()

file = open(file_to_parse, ’r’)

iterator = Fasta.Iterator(file, parser)

all_species = []

while 1:

cur_record = iterator.next()

if cur_record is None:


# extract the info from the title

title_atoms = string.split(cur_record.title)

new_species = title_atoms[1]

# append the new species to the list if it isn’t there

if new_species not in all_species:


return all_species


FASTA files as Dictionaries


Let’s index our record via their GenBank accession number.


import string

def get_accession_num(fasta_record):

title_atoms = string.split(fasta_record.title)

# all of the accession number information is stuck in the first element

# and separated by ’|’s

accession_atoms = string.split(title_atoms[0], ’|’)

# the accession number is the 4th element

gb_name = accession_atoms[3]

# strip the version info before returning

return gb_name[:-2]


Now we need to create an index from this file, which will show us why writing this function was important.

The general format for indexing a file is:


index_file(file_to_index, index_file_to_create, function_to_get_index_key)


The tricky argument is the function_to_get_index_key. Basically, this is a reference to a function that

can get called and return an element to use as a key. The idea here is that the index_file should be general

purpose, and there isn’t any good way for the Biopython developers to read your mind and know how you

want to index your files. So now the get_accession_num function above makes a lot of sense!

Using all of this, we can now create an index file and see how it works. First, let’s create the index file:


>>> from Bio import Fasta

>>> Fasta.index_file("ls_orchid.fasta", "my_orchid_dict.idx", get_accession_num)


This creates an index file my_orchid_dict.idx based on the input from ls_orchid.fasta and indexed

by the values returned by our get_accession_number function.

Now that we’ve got the index, we can create an in-memory dictionary to access the file contents using

the index:


>>> from Bio.Alphabet import IUPAC

>>> dna_parser = Fasta.SequenceParser(IUPAC.ambiguous_dna)

>>> orchid_dict = Fasta.Dictionary("my_orchid_dict.idx", dna_parser)


The type of objects returned by the index file are based on the second argument passed. This argument should be a

parser that the index file will pass the information through before returning it. If no parser is passed, then the dictionary will just return the raw object (i. e. exactly as it appears in the file).  The parser that we pass here is a SequenceParser which converts FASTA files into SeqRecord objects.  The SequenceParser takes an argument of the alphabet we should be using for the sequences. Since the parser can’t be smart enough to know exactly what it will be parsing, we need to tell it.


Since this is a dictionary, we can look at all of the keys we have available:


>>> print orchid_dict.keys()

[’Z78475’, ’Z78519’, ’Z78516’, ’Z78517’, ’Z78514’, ’Z78515’, ’Z78512’,

 ’Z78513’, ’Z78510’, ’Z78511’, ’Z78457’, ’Z78456’, ’Z78455’, ’Z78454’,


>>> seq_record = orchid_dict[’Z78475’]

>>> print seq_record

<Bio.SeqRecord.SeqRecord instance at 0x102c1f2c>

>>> print seq_record.description

gi|2765600|emb|Z78475.1|PSZ78475 P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA

>>> print seq_record.seq



Using the system environment: os and sys modules


The sys module: The sys module provides an interface with the Python interpreter:


% python -i prog.py myseq.fasta

>>> import sys

>>> sys.argv

[’prog.py’, ’myseq.fasta’]


>>> sys.stdout.write("a string\n")

a string

>>> sys.stdin.read()

a line

another line <Ctrl-D>  # need a ctrl-d to end the input

’a line\nanother line\n’


The os module:  This module is very helpful to handle files, directories, and  processes.


>>> import os.path

>>> os.path.exists(’myseq.fasta’)


>>> os.path.isfile(’myseq.fasta’)


>>> os.path.isdir(’myseq.fasta’)


>>> os.path.basename(’/local/bin/perl’)



The simplest way to run a program is by using the system of the os module.


>>> import os

>>> cmd="golden swissprot:malk_ecoli"

>>> status = os.system(cmd)

>>> print "Status: ", status


Dealing with alignments: Clustalw


The first step in interacting with clustalw is to set up a command line you want to pass to the program.


>>> import os

>>> from Bio.Clustalw import MultipleAlignCL

>>> command_line = MultipleAlignCL(os.path.join(os.curdir, ’sars.fasta’))

>>> command_line.set_output(’test.aln’)




# command_line.set_output(CLUSTALW_OUTPUT_FILE, output_type=CLUSTALW_OUTPUT_TYPE)


>>> str( command_line )

clustalw ./sars.fasta -OUTFILE=test.aln


>>> from Bio import Clustalw

>>> alignment = Clustalw.do_alignment( command_line) # a ClustalAlignment type object

>>> all_records = alignment.get_all_seqs() # get a list of SeqRecord objects


>>> print ’description:’, all_records[0].description

>>> print ’sequence:’, all_records[0].seq


This prints out the description and sequence object for the first sequence in the alignment:


description: gi|6273285|gb|AF191659.1|AF191


...’, IUPACAmbiguousDNA())

>>> length = alignment.get_alignment_length() # calculate the maximum length of the alignment

>>> print alignment

CLUSTAL X (1.81) multiple sequence alignment






Calculating summary information


>>> from Bio.Align import AlignInfo

>>> summary_align = AlignInfo.SummaryInfo( alignment )


Assuming we’ve got a SummaryInfo object called summary_align we can calculate a consensus by doing:


>>> consensus = summary_align.dumb_consensus()  # a Seq object

>>> print consensus


...’, IUPACAmbiguousDNA())

# the default ambiguous character is ‘N’


Translating between Alignment formats


First, let’s say we have just parsed an alignment from clustal format into a ClustalAlignment object:

>>> import os

>>> from Bio import Clustalw

>>> alignment = Clustalw.parse_file(os.path.join(os.curdir, ’test.aln’))


Now, let’s convert this alignment into FASTA format. First, we create a converter object:

>>> from Bio.Align.FormatConvert import FormatConverter

>>> converter = FormatConverter(alignment)


We then pass the converter the alignment that we want to convert. Now, to get this in FASTA alignment

format, we simply do the following:


>>> fasta_align = converter.to_fasta()

>>> print fasta_align






Connecting with biological databases


• ExPASy                     http://www.expasy.ch

• Entrez from NCBI         http://www.ncbi.nlm.nih.gov/Entrez/

• PubMed from NCBI      http://www.ncbi.nlm.nih.gov/PubMed/


In this example, we’ll just show how to connect, get the results, and display them in a web browser.

First, we’ll start by defining our search and how to display the results:


search_command = ’Search’

search_database = ’Nucleotide’

return_format = ’FASTA’

search_term = ’Cypripedioideae’

my_browser = ’lynx’


The first four terms define the search we are going to do. To use the Entrez module, you’ll need to

know a bit about how the remote CGI scripts at NCBI work, and you can find out more about this at

http://www.ncbi.nlm.nih.gov/entrez/query/static/linking.html. The final term just describes the

browser to display the results in.


from Bio.WWW import NCBI      # it has all the interface with NCBI

result_handle = NCBI.query(search_command, search_database, term = search_term,

doptcmdl = return_format)  # Entrez query

import os

result_file_name = os.path.join(os.getcwd(), ’results.html’)

result_file = open(result_file_name, ’w’)



if my_browser == ’lynx’:

os.system(’lynx -force_html ’ + result_file_name)

elif my_browser == ’netscape’:

os.system(’netscape file:’ + result_file_name)




The GenBank record format is a very popular method of holding information about sequences, sequence

features, and other associated sequence information. The format is a good way to get information from the

NCBI databases at http://www.ncbi.nlm.nih.gov/.


Retrieving GenBank entries from NCBI


We can do quick search and get back the Gis (GenBank identifiers) for all of the corresponding records:

>>> from Bio import GenBank

>>> gi_list = GenBank.search_for("Opuntia AND rpl16")


gi_list will be a list of all of the GenBank identifiers that match our query:

[’6273291’, ’6273290’, ’6273289’, ’6273287’, ’6273286’, ’6273285’, ’6273284’]


Now that we’ve got the GIs, we can use these to access the NCBI database through a dictionary interface:

>>> ncbi_dict = GenBank.NCBIDictionary()


Now that we’ve got this, we do the retrieval:

>>> gb_record = ncbi_dict[gi_list[0]]


In this case, gb_record will be GenBank formatted record:


LOCUS       AF191665    902 bp      DNA   PLN   07-NOV-1999

DEFINITION Opuntia marenae rpl16 gene; chloroplast gene for chloroplast

product, partial intron sequence.


VERSION     AF191665.1 GI:6273291



In this case, we are just getting the raw records. We can also pass these records directly into a parser

and return the parsed record. For instance, if we wanted to get back SeqRecord objects with the GenBank

file parsed into SeqFeature objects we would need to create the dictionary with the GenBank FeatureParser:


>>> record_parser = GenBank.FeatureParser()

>>> ncbi_dict = GenBank.NCBIDictionary(parser = record_parser)


Now retrieving a record will give you a SeqRecord object instead of the raw record:


>>> gb_seqrecord = ncbi_dict[gi_list[0]]

>>> print gb_seqrecord

<Bio.SeqRecord.SeqRecord instance at 0x102f9404>


Parsing GenBank records


Right now the GenBank module provides the following parsers:

1. RecordParser – This parses the raw record into a GenBank specific Record object. This object models

the information in a raw record very closely, so this is good to use if you are just interested in GenBank

records themselves.

2. FeatureParser – This parses the raw record in a SeqRecord object with all of the feature table information

represented in SeqFeatures. This is best to use if you are interested in getting things in a more standard format.


Either way you chose to go, the most common usage of these will be creating an iterator and parsing

through a file on GenBank records.


>>> from Bio import GenBank

>>> gb_file = "my_file.gb"

>>> gb_handle = open(gb_file, ’r’)

>>> feature_parser = GenBank.FeatureParser()

>>> gb_iterator = GenBank.Iterator(gb_handle, feature_parser)

>>> while 1:

...   cur_record = gb_iterator.next()

...   if cur_record is None:

...         break

...   # now do something with the record

...   print cur_record.seq


This just iterates over a GenBank file, parsing it into SeqRecord and SeqFeature objects, and prints out

the Seq objects representing the sequences in the record.




Running BLAST over the internet


The first step in automating BLASTing is to make everything accessable from python scripts. So, biopython

contains code that allows you to run the WWW version of BLAST (http://www.ncbi.nlm.nih.gov/BLAST/) directly from your python scripts.

The code to deal with the WWW version of BLAST is found in the Bio.Blast.NCBIWWW module, and the

blast function. Let’s say we want to BLAST info we have in a FASTA formatted file against the database.

First, we need to get the info from the FASTA file. The easiest way to do this is to use the Fasta iterator to get

a record. In this example, we’ll just grab the first FASTA record from a file, using an Iterator:


>>> from Bio import Fasta

>>> file_for_blast = open(’m_cold.fasta’, ’r’)

>>> f_iterator = Fasta.Iterator(file_for_blast)

>>> f_record = f_iterator.next()


Now that we’ve got the FASTA record, we are ready to blast it.


>>> from Bio.Blast import NCBIWWW

>>> result_handle = NCBIWWW.blast(’blastn’, ’nr’, f_record) # returns a file-like handle


The blast function also take a number of other option arguments which are basically analogous to the different parameters you can set on the basic BLAST page. The first argument is the blast program to use for the search. The options and descriptions of the programs are available at http://www.ncbi.nlm.nih.gov/BLAST/blast_program.html. The second argument specifies the databases to search against. Again, the options for this are available on the NCBI web pages at http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.html.


Before parsing the results, it is often useful to save them into a file so that you can use them later

without having to go back and re-blast everything.


>>> save_file = open(’my_blast.out’, ’w’) # save the results for later, in case we want to look at it

>>> blast_results = result_handle.read()    # load the handle into a string

>>> save_file.write(blast_results)              # write the string to the file

>>> save_file.close()                                          # don’t forget to close the file


Parsing the output from the WWW version of BLAST


You can get information to parse with the WWW BLAST parser in one of the two ways.

1. The first is to grab the info using the biopython blast function for dealing with blast, described above.

2. The second way is to manually do the BLAST on the NCBI site, and then save the results. In order to have

your output in the right format, you need to save the final BLAST page you get (you know, the one with all

of the interesting results!) as HTML (i. e. do not save as text!) to a file. The important point is that you

do not have to use biopython scripts to fetch the data in order to be able to parse it.


If you followed the code above for interacting with BLAST through a script, then you already have result_handle, which is great for passing to a parser. If you saved the results in a file, then opening up the file with code like the following will get you what you need:


>>> result_handle = open(’my_blast_result.out’, ’r’)    # what you received using either 1 or 2


Now that we’ve got a handle, we are ready to parse the output. The code to parse it is really quite small:


>>> from Bio.Blast import NCBIWWW

>>> blast_parser = NCBIWWW.BlastParser()              # get the parser

>>> blast_record = blast_parser.parse(result_handle)  # parse the handle


What this accomplishes is parsing the BLAST output into a Blast Record class, which has a rich structure.  Now let’s just print out some summary info about all hits in our blast report greater than a particular threshold. The following code does this:




for alignment in blast_record.alignments:

for hsp in alignment.hsps:

if hsp.expect < E_VALUE_THRESH:

print ’****Alignment****’

print ’sequence:’, alignment.title

print ’length:’, alignment.length

print ’e value:’, hsp.expect

print hsp.query[0:75] + ’...’

print hsp.match[0:75] + ’...’

print hsp.sbjct[0:75] + ’...’


This will print out summary reports like the following:



sequence: >gb|AF283004.1|AF283004 Arabidopsis thaliana cold acclimation protein WCOR413-like protein

alpha form mRNA, complete cds

length: 783

e value: 0.034


||||||||| | ||||||||||| || |||| || || |||||||| |||||| | | |||||||| ||| ||...