InfoPlatter: November 2013

Thursday, 7 November 2013

Python Modules: Expand your reach in Bioinformatics! (Part#2: Hybrid Programming)

A very classic question in bioinformatics is, which programming language is the best for a bioinformatician? Discussions like this never end with a conclusive answer. Interestingly, people find this question as a piece of cake and jump at it with whatever they have in their hands! The result is, you get a nice rainbow of choices, right from "C" to "PHP"!

Each programming language has its own perks and disadvantages. For example, "C" has an incredible speed in execution but it is equally code-intensive in writing even a simple program. Python and Perl on other hand make the same program code-lite but with a mediocre speed of execution. Apart from these performance issues, every language is blessed with a varying degree of third party modules/libraries.

Python has provided interfaces to many system calls and libraries, giving direct access to the shell of an operating system (modules like os, subprocess let you call unix commands directly from the python terminal). Python is also usable as an extension language for applications written in other languages that need easy-to-use scripting or automation interfaces. More than 15 coding projects have started to establish a platform where python can be integrated with other programming languages like C, Java, Perl, PHP, R, Fortran etc.

These hybrid platforms are either available as python modules which can easily be imported, like we do for general (numpy, maths, random etc) modules or accessible from a parent language (i.e. Jython, python implemented in Java)

A detailed list of these hybrid platforms are accessible from here.

Some fascinating platforms I couldn't resist to mention here are:

elmer: Elmer allows developers to write code in Python and execute it in C or Tcl.
JPype: JPype allows python programs to fully access java class libraries.
PyPerlish: Allows the usage of perl idioms in python.
RPy: Simple and efficient access to R from python.

It is interesting to note that every platform mentioned here was somebody's dream. Since shifting to a new language might deliver new exciting features but at the same time it takes away what you loved the most about the previous one. Following are the words from the creator of PyPerlish,

"I've used perl for several years, and been very impressed with its ease of use. When you need to do something new, chances are there is an idiom which lets you do it in a few keystrokes. I didn't want to lose that in moving to python. Somehow I wanted to get the benefits of perl's idioms with the robust scalability and maintainability of python. So the idea is to emulate perl idioms, no matter how we implement the python code under the covers." -- Harry George

Sunday, 3 November 2013

Python Modules: Expand your reach in Bioinformatics! (Part#1: Phyloinformatics)

Python is getting increasingly popular among bioinformaticians, not just due to its simplistic yet powerful structure but also due to the third party modules which are imparting domain specific added advantages. This series is dedicated towards compilation of such modules, specific to each domain.

In this section, the most popular python modules in phyloinformatics are introduced.

E.T.E a python Environment for phylogenetic Tree Exploration

"ETE is a python programming toolkit that assists in the automated manipulation, analysis and visualization of phylogenetic and other type of trees. It provides a wide range of tree handling methods, node annotation features, programmatic access to the phylomeDB database, and automatic orthology and paralogy prediction methods. In addition, an interactive tree visualization program, as well as a highly customizable tree drawing engine, is included." -- ETE website

ETE examples: Tree with multiple sequence alignment, Bar chart and Pie chart

ETE is very well documented and pretty easy to use. Traversing the tree in different directions (from root to leaves, and leaves to root), manipulating (adding/removing) custom features to an individual node of tree, creating graphics rich plots, integrating multiple sequence alignments, evolutionary hypothesis testing and much more can be easily achieved with this module.

DendroPy

"DendroPy is a Python library for phylogenetic computing. It provides classes and functions for the simulation, processing, and manipulation of phylogenetic trees and character matrices, and supports the reading and writing of phylogenetic data in a range of formats, such as NEXUS, NEWICK, NeXML, Phylip, FASTA etc. Application scripts for performing some useful phylogenetic operations, such as data conversion and tree posterior distribution summarization, are also distributed and installed as part of the libary. DendroPy can thus function as a stand-alone library, a component of more complex multi-library phyloinformatic pipelines, or as a scripting “glue” that assembles and drives such pipelines." -- DendroPy Website

Compared to ETE, DendroPy is more focused towards computational aspect of phyloinformatics, which includes simulation of birth-death process trees, population genetic trees, coalescent tress etc. DendroPy also allows calculation of general tree statistics like tree length, node age, probability under the coalescent model, tree distances etc. Unlike ETE, DendroPy also supports variety of character matrices (DNA, RNA, Proteins, any continuous/ discrete-value data), but at the same time DendroPy allows Phylogenetic Independent Contrasts (PIC) analysis (as described by Felsenstein 1985) given a tree and continuous character matrix.
CAUTION: The current release (3.2.0) do not support python 3.0

Bio.Phylo (BioPython)

Bio.Phylo module was introduced in BioPython 1.54. This module is simplistic but covers all the necessary functionalities including, parsing/writing various tree formats, displaying trees in different color palettes, searching and traversing methods, clade/node specific information extraction/modification etc. Bio.Phylo also allows integration of third-party application like PAML for phylogenetic analysis by maximum likelihood. Likewise, BioPython wrappers are also available for PhyML, RAxML and FastTree.

All the three modules are well documented and irreplaceable given their functional disparity. There are also couple of other modules which are highly function specific and might just fit into your requirement list. These are,

P4: a python package for phylogenetics

For maximum likelihood and Bayesian phylogenetic analysis on molecular sequences

Mavric: a python toolkit for phylogenetics

Fully interactive editing of phylogenetic trees