In Bioinformatics, all tasks can be done using one of 2 programming languages:
Python is a high-level programming language that is known for its “Easy to read and learn” format. Python can do tasks from automation, to data science and machine learning.
On the other hand, R is a programming language that was made for statistics and data processing.
Through-out the past few years, R was adopted by the bioinformatics community as the number one programming language for the release of new packages, partially because of Bioconductor (a collection of mature libraries for next-generation sequencing analysis) and the ggplot2 library for advanced plotting.
But from personal experience with Bioinformatics, most of my data wrangling and manipulation is done in python and endpoint analysis, and plotting is done in R. But R can do almost anything that Python can in terms of statistics and even more. My only problem with R is the sometimes — unintuitive syntax (<-, %>%, variable$attribute), but there is a way around this using the rpy2 library in python.
In this post, I want to try to show an example of using this library for bioinformatics.
For this demonstration, we are going to use data from the 1000 genome project. To download the data for the project use the wget command below:
Linux:
# Get the data
!wget https://raw.githubusercontent.com/VarunSendilraj/Bioinformatics/main/rpy2_tutorial/1000-genomes_other_sample_info_sample_info.csv
If you are on windows or mac, the wget command won't work so use the urlib library instead :
Windows/Mac:
import urllib.request
url = 'https://raw.githubusercontent.com/VarunSendilraj/Bioinformatics/main/rpy2_tutorial/1000-genomes_other_sample_info_sample_info.csv'
filename = 'genomes_other_sample_info_sample_info.csv'
urllib.request.urlretrieve(url, filename)
Let’s start by importing the necessary libraries to open and convert the python DataFrame to an R DataFrame:
#Import Libraries
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter
Now let's open the file that we have downloaded using pandas read_csv function:
df = pd.read_csv('1000-genomes_other_sample_info_sample_info.csv')
df.head()
Using the .head() method we can view the data of the first few rows. Lets Double check that this is a pandas DataFrame before we continue on:
print(type(df))
#expected output: <class 'pandas.core.frame.DataFrame'>
The conversion of the pandas DataFrame to R DataFrame has trouble encoding float and integer values so let’s start by converting the entire DataFrame into a string.
#converts values in df into string
df = df.applymap(str)
# Double check that values have changed
print(type(df['In_Final_Phase_Variant_Calling'][0]))
#Expected output: <class 'str'>
Next, let’s convert the Pandas DataFrame to an R DataFrame using localcoverter from the rpy2 library:
#conversion
with localconverter(ro.default_converter + pandas2ri.converter):
r_df = ro.conversion.py2rpy(df)
#Check if conversion Worked
print(type(r_df))
#Expected output: rpy2.robjects.vectors.DataFrame
Before we move forward, let’s get a better understanding of our data:
print(f'This dataframe has {r_df.ncol} columns and {r_df.nrow} rows\n')
print(r_df.colnames)
Here we are just printing out the different columns in our DataFrame. The output should look like this:
This dataframe has 62 columns and 3500 rows
[1] "Sample" "Family_ID"
[3] "Population" "Population_Description"
[5] "Gender" "Relationship"
[7] "Unexpected_Parent_Child" "Non_Paternity"
[9] "Siblings" "Grandparents"
[11] "Avuncular" "Half_Siblings"
[13] "Unknown_Second_Order" "Third_Order"
[15] "In_Low_Coverage_Pilot" "LC_Pilot_Platforms"
[17] "LC_Pilot_Centers" "In_High_Coverage_Pilot"
[19] "HC_Pilot_Platforms" "HC_Pilot_Centers"
[21] "In_Exon_Targetted_Pilot" "ET_Pilot_Platforms"
[23] "ET_Pilot_Centers" "Has_Sequence_in_Phase1"
[25] "Phase1_LC_Platform" "Phase1_LC_Centers"
[27] "Phase1_E_Platform" "Phase1_E_Centers"
[29] "In_Phase1_Integrated_Variant_Set" "Has_Phase1_chrY_SNPS"
[31] "Has_phase1_chrY_Deletions" "Has_phase1_chrMT_SNPs"
[33] "Main_project_LC_Centers" "Main_project_LC_platform"
[35] "Total_LC_Sequence" "LC_Non_Duplicated_Aligned_Coverage"
[37] "Main_Project_E_Centers" "Main_Project_E_Platform"
[39] "Total_Exome_Sequence" "X_Targets_Covered_to_20x_or_greater"
[41] "VerifyBam_E_Omni_Free" "VerifyBam_E_Affy_Free"
[43] "VerifyBam_E_Omni_Chip" "VerifyBam_E_Affy_Chip"
[45] "VerifyBam_LC_Omni_Free" "VerifyBam_LC_Affy_Free"
[47] "VerifyBam_LC_Omni_Chip" "VerifyBam_LC_Affy_Chip"
[49] "LC_Indel_Ratio" "E_Indel_Ratio"
[51] "LC_Passed_QC" "E_Passed_QC"
[53] "In_Final_Phase_Variant_Calling" "Has_Omni_Genotypes"
[55] "Has_Axiom_Genotypes" "Has_Affy_6_0_Genotypes"
[57] "Has_Exome_LOF_Genotypes" "EBV_Coverage"
[59] "DNA_Source_from_Coriell" "Has_Sequence_from_Blood_in_Index"
[61] "Super_Population" "Super_Population_Description"
Now, we need to perform some data cleanup. For example, some columns should be interpreted as Integers or Floats, but they are read as strings:
Let’s start by defining the as_numeric and match functions:
#define the as_numeric function
as_numeric = ro.r('as.numeric')
#define the Match function
match = ro.r.match
Functions:
Since there are a lot of columns that need to be converted into numbers, let's store those column names in a list and iterate through it:
#columns
numCol = ['Has_Sequence_in_Phase1','In_Phase1_Integrated_Variant_Set','Has_Phase1_chrY_SNPS','Has_phase1_chrY_Deletions','Has_phase1_chrMT_SNPs','LC_Passed_QC','E_Passed_QC','In_Final_Phase_Variant_Calling','Has_Omni_Genotypes', 'Has_Axiom_Genotypes','Has_Affy_6_0_Genotypes','Has_Exome_LOF_Genotypes', 'Has_Sequence_from_Blood_in_Index', 'Total_LC_Sequence', 'LC_Non_Duplicated_Aligned_Coverage','Total_Exome_Sequence', 'X_Targets_Covered_to_20x_or_greater', 'VerifyBam_E_Omni_Free', 'VerifyBam_E_Affy_Free', 'VerifyBam_E_Omni_Chip', 'VerifyBam_E_Affy_Chip', 'VerifyBam_LC_Omni_Free', 'VerifyBam_LC_Affy_Free','VerifyBam_LC_Omni_Chip','VerifyBam_LC_Affy_Chip', 'LC_Indel_Ratio', 'E_Indel_Ratio', 'EBV_Coverage']
#loop
for col in numCol:
my_col = match(col, r_df.colnames)[0] #returned as a vector
print('Type of read count before as.numeric: %s' % r_df[my_col - 1].rclass[0])
r_df[my_col - 1] = as_numeric(r_df[my_col - 1])
print('Type of read count after as.numeric: %s' % r_df[my_col - 1].rclass[0])
The results should look like this (make sure all columns are converted to numbers):
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
...
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
Type of read count before as.numeric: character
Type of read count after as.numeric: numeric
Now let's use ggplot2 from the rpy2 library to plot our data.
Let's start by making a bar graph plotting the count of people per country that participated in the 1000 genome project:
#import ggplot2
import rpy2.robjects.lib.ggplot2 as ggplot2
from rpy2.robjects.functions import SignatureTranslatedFunction
#set theme
ggplot2.theme = SignatureTranslatedFunction(ggplot2.theme, init_prm_translate = {'axis_text_x': 'axis.text.x'})
#plot
bar = ggplot2.ggplot(r_df) + ggplot2.geom_bar() + ggplot2.aes_string(x='Population') + ggplot2.theme(axis_text_x=ggplot2.element_text(angle=90, hjust=1))
#save to img
ro.r.png('out.png', type='cairo-png')
bar.plot()
dev_off = ro.r('dev.off')
dev_off()
This may look a little bit intimidating but its mainly just boilerplate Code:
bar = ggplot2.ggplot(seq_data) + ggplot2.geom_bar() + ggplot2.aes_string(x=’CENTER_NAME’) + ggplot2.theme(axis_text_x=ggplot2.element_text(angle=90, hjust=1))
variableName = ggplot2.ggplot(**DATAFRAME**) + ggplot.**GRAPH_TYPE** + ggplot2.aes_string(x=’**X-AXIS**’) + ggplot2.theme(**ADJUST THE THEME HOWEVER NEEDED**)
In the end, the graph will be stored in a pdf file, but you can also view the file in your Jupyter Notebook:
from IPython.display import Image
Image(filename='out.png')
Graph:
Similarly, let's create a scatterplot to compare the Total_LC_Sequence and the LC_Non_Duplicated_Aligned_Coverage with relation to the population and gender:
#plot
pp = ggplot2.ggplot(r_df) + ggplot2.aes_string(x='Total_LC_Sequence', y='LC_Non_Duplicated_Aligned_Coverage', col='Population', shape='Gender') + ggplot2.geom_point()
#save img
ro.r.png('scatter.png', type='cairo-png')
pp.plot()
dev_off = ro.r('dev.off')
dev_off()
#veiw img
Image(filename='scatter.png')
Graph:
We can also plot a Box Plot using the same data above:
#plot
bp = ggplot2.ggplot(r_df) + ggplot2.aes_string(x='Total_LC_Sequence', y='LC_Non_Duplicated_Aligned_Coverage', fill='Population') + ggplot2.geom_boxplot()
#save img
ro.r.png('box.png', type='cairo-png')
bp.plot()
dev_off = ro.r('dev.off')
dev_off()
#veiw img
Image(filename='box.png')
Graph:
Completed Jupyter Notebook:
Bioinformatics/rpy2.ipynb at main · VarunSendilraj/Bioinformatics (github.com)
Also published at https://varunsendilraj.medium.com/interfacing-r-using-python-for-bioinformatics-9387c17344bd