Using biomaRt in Python - a quick rpy2 tutorial

Nancy Dong

Aug. 15, 2018


Recently, as part of my analyses of a transcriptome, I wanted to retrieve the corresponding gene IDs and biotypes for a list of Drosophila transcript IDs that I have, using the R package biomaRt. As I did the rest of my data wrangling in Python, I wanted to run everything in Spyder, rather than having to swtich to RStudio just for this.

I had been able to use both Python and R in the same Jupyter Notebook using either the polyglot SoS Notebook or rmagic, but the Jupyter Notebook structure is just not suited for first-round analysis (more on that in a later post). I had also tried Python interfaces to Biomart, such as pybiomart and biomartpy, but neither of them were as easy to use as the original biomaRt R package.

So, I decided to get Python-R interface rpy2 to work in my Python script in Spyder. It took some tinkering, but after some Googling, I got something cooking.


1. Import libraries

from rpy2.robjects.vectors import StrVector  
from rpy2.robjects import pandas2ri  
from rpy2.robjects import r as R 
pandas2ri.activate()


2. Import Pandas dataframe containing the Drosophila transcript IDs into R

r_DM_tIDs = pandas2ri.py2ri(DM_tIDs)


3. Biomart analysis

R.library("biomaRt")

mart = R.useMart(biomart="ensembl", dataset="dmelanogaster_gene_ensembl", host="www.ensembl.org")

DM_BM = R.getBM(attributes = StrVector(("flybase_transcript_id", "ensembl_gene_id", "external_gene_name", "transcript_biotype", "gene_biotype")),
             filters = "flybase_transcript_id",
             values = r_DM_tIDs,
             mart = mart)

Note: It was super helpful to figure out that the StrVector object was needed to convert the list of attributes into a string vector usable in R, thanks to this StackOverflow post.


4. Export result back into a Pandas dataframe

python DM_BM_py = pandas2ri.ri2py(DM_BM)


This is a super simple example of using rpy2 to integrate Python and R, but one that I have not seen posted anywhere. Since Biomart is an important part of many bioinformatics workflow, I think it could be useful to see this worked out in one fashion.

Here are two more detailed in-depth tutorials of using rpy2 (not specific to bioinformatics), which I plan on studying up: