Python vs R: The Great Data Science Debate

Python and R are popular in data science. Which is best?

14 min read Ben Hayes

Table of Contents


Introduction

Choosing a Dataset

In order to resolve a subjective question (which will still remain subjective even after this post), I will use a concrete example with the famous wine quality data set. The data can be found at this URL: https://archive.ics.uci.edu/ml/datasets/Wine+Quality.

Defining a Problem

After selecting our data set, we’ll look at answering a handful of common questions relating to exploring and analyzing any data set. The wine quality data set is popular and one of the handful data sets that beginners use (others include Titanic, iris, and cars). We will explore this data set in both languages and attempt to predict whether a wine is red or white.

1. External Libraries

It may seem strange to begin a battle between two programming languages by focusing on the go-to external libraries, but these languages are high-level and come equipped with a broad and deep arsenal of extensions. The first step in applying data science in Python or R is often loading libraries (after you’ve thought about the problem, the problem’s context, and formed hypotheses).

Note: Each programming language can be augmented with hundreds of packages. A small subset of packages is discussed below. The list is not intended to be exhaustive.

Python

import scipy as scpy # For the scipy stack
import numpy as np # For multi-dimensional arrays and matrices
import pandas as pd # For series, dataframe, panel structures
from matplotlib import pyplot as plt # For plotting
import seaborn as sns # For plotting
import beautifulsoup4 as bs4 # For HTML parsing
import selenium as slm # For web-scraping
import sklearn # For machine-learning
import spaCy # For natural language processing
import nltk # Also for natural language processing
import gensim # For topic modeling
import pyspark # For using the Python API for Spark
import keras # For deep learning (sits on tensorflow)
import torch # For deep learning with PyTorch
 
 
 
  • Python comes equipped with a similar stack to R's tidyverse. Many of the features of the tidyverse are available in the pandas library.
  • Plotting is not as intuitive as using ggplot2 but as I discuss below, similar plots can be achieved using matplotlib and seaborn.
  • Python deep learning is made simple with pytorch and keras.

R

library(feather) # For working with faster-than-csv feather files
library(plyr) # dplyr often needs plyr
library(readr) # For reading a variety of file types; (tidyverse)
library(dplyr) # For data grouping, filtering, etc.; (tidyverse)
library(tidyr) # For reshaping data; (tidyverse)
library(purrr) # For mapping functions on nested data; (tidyverse)
library(DT) # For outputting pretty tables
library(ggplot2) # For plotting graphs, charts
library(ggmap) # For plotting maps
library(leaflet) # For plotting interactive maps
library(plotly) # For enhancing plots
library(rpart) # For decision trees
library(glmnet) # For cross-validation, regularization
library(neuralnet) # For deep learning
library(keras) # For deep learning (interfaces python)
library(rnn) # For recurrent neural networks
library(sparklyR) # For interfacing R with Apache Spark
  • The tidyverse is associated with Hadley Wickham who has helped R cement itself in this debate. The package simplifies data manipulation and exploratory data analysis while also cooperating with other packages (plotting, modeling, etc.).
  • For plotting, ggplot2 provides flexible, easily-customizable plots with intuitive commands. Using 'geoms' simplifies plotting and and becomes second nature.
  • For deep learning, keras offers an interface to the python version.

Result: Draw

Using either R or Python will require familiarity with the packages which takes time. Each package may have slightly different naming conventions, notations, etc., but both communities provide support to help those starting out. While in R, the dataframe type is native, in most cases, you will really end up using tibbles which need to be imported. This requirement is no different than having to import numpy and pandas in Python. Neither language separates itself in this category but we’ll continue to see new libraries in the other rounds.


2. Exploratory Data Analysis

Exploratory data analysis (EDA) is perhaps the most important step within the data science process. At this juncture, you can develop a deeper understanding of the data set, reform your questions and hypothesis, and identify potential shortcuts that will save you time. How do your feature affect your outcome variable? Are your features related? Can you reduce the dimensionality of your data?

Python

import pandas as pd
import numpy as np
import seaborn as sns

# Use pandas to read in the CSV data, specifying separator as ';'
reds = pd.read_csv('./data/winequality-red.csv', sep=";")
whites = pd.read_csv('./data/winequality-white.csv', sep=";")

# Print the size and shapes of the dataframes
reds.ndim # 2 dimensions (row x column)
reds.shape # 1599 rows, 12 columns
whites.ndim # 2 dimensions
whites.shape #4898 rows, 12 columns

R

library(tidyverse)
library(ggplot2)


# Use readr:: to read in the CSV data, specifying separator as ';'
reds <- read_delim('./data/winequality-red.csv', delim=';')
whites <- read_delim('./data/winequality-white.csv', delim=';')

# Print the size and shapes of the dataframes
dim(reds)
dim(whites)
# In R, 2-dimensional data is standard so printing
# number of dimensions is unnecessary

Python

# Print the first 5 rows of the dataframes
reds.head()
whites.head()

# Print summary statistics for the dataframes
reds.describe()
whites.describe()

R

# Print the first 5 rows of the dataframe
head(reds)
head(whites)

# Print summary statistics for the dataframes
summary(reds)
summary(whites)

Python

# Generate pairs plot to see how our features are related
rpp = sns.pairplot(reds)
wpp = sns.pairplot(whites)

R

# Generate pairs plot to see how features are related
pairs(reds)
pairs(whites)

Python

# Generate new column before joining to specify red vs white
reds['type'] = 'red'
whites['type'] = 'white'

# Combine the dataframes using union-like method: concat()
wines = pd.concat([reds, whites])

# Check the data after adding the column and combining
wines.sample(10)

R

# Generate new column before joining to specify red vs white
reds <- reds %>% mutate(type="red")
whites <- whites %>% mutate(type="white")

# Combine the dataframes using union-like method: concat()
wines <- rbind(reds, whites)

# Check the data after adding the column and combining
sample(wines, 10)

Result: Winner R

Here, R provides the advantage because of the pipe (%) operator and the ability to filter, select, mutate, and group data. Python offers the same features but the methods needed are not as well integrated. Using the base dplyr functions and pipe(%) feels almost like writing sentences/paragraphs. Additionally, the R shiny framework provides a fast way to develop interactive data visualizations. Unfortunately, the analogous Python tool, bokeh does not provide as much support for a beginner.


3. Data Manipulation

In this section, I evaluate each language’s ability to read and transform data. How easy is it to import data and to engineer new features?

Note: For readability, I avoid using nesting with purrr in R or with lists/dictionaries in Python. In both languages, working with objects of varying length is non-trivial but possible.

Python

# Add a new column to the data (example purposes only)
wines['new_feature'] = (wines['citric acid'] * 
                        wines['density'] - 
                        wines['sulphates'])

# Drop a column from the data (example purposes only)
wines = wines.drop('new_feature', axis = 1)

### Create train/test data
# Shuffle the data and reset the index
wines = wines.sample(frac = 1.0).reset_index(drop=True)

# Split the data into halves, first half is train set
wines_train = wines.loc[0:len(wines.index)/2,]

# Assign remaining rows as test set
wines_test = wines.loc[len(wines.index)/2:,]

# Verify sizes of the train/test sets
wines_train.shape
wines_test.shape

R

# Add a new column to the data (example purposes only)
wines <- wines %>% mutate(
    new_feature = `citric acid` * density - sulphates
  )

# Drop a column from the data (example purposes only)
wines <- wines %>% select(-new_feature)

### Create train/test data
# Shuffle the data and reset the index
wines <- wines[sample(1:nrow(wines)),]

# Split the data into halves, first half is train set
wines_train <- wines[1:(nrow(wines)/2),]

# Assign remaining rows as test set
wines_test <- wines[-(1:(nrow(wines)/2)),]
  
# Verify sizes of the train/test sets
print(paste0("Train Set: ", dim(wines_train)))
print(paste0("Test Set: ", dim(wines_test)))

Result: Winner Python

While Python at times can be verbose and difficult to read due to nesting boolean masks, its support of method chaining mirrors that of piping in R (although piping in R can involve multiple pipe operators, each with unique features). The ability to specify axes using pandas allows you finer control of your data manipulation. This round is very close but is awarded to Python. Python’s ability to perform simple transformations/functions (via list comprehensions, lambda expressions, etc.) is too powerful to ignore.

Note: In R, the dplyr library can be extended with dbplyr which translates your piped statements into SQL and optimizes statement calls to improve performance. From personal experience, dbplyr works well enough but does not support the full lexicon of SQL statements.


4. Data Visualization

Data visualization is a critical part of exploratory data analysis, but it’s also important for communicating results. Don’t think of data visualization as a distinct step, it’s one part of an iterative process. Python and R both provide visualization packages but this is an area where R really shines. Here, ggplot2 and plotly are reviewed for R and matplotlib and seaborn are reviewed for Python.

Python

# Create facetgrid for multiple plots
g = sns.FacetGrid(wines_train, col="type", hue="quality", 
                  palette="Set1", height=6)

# Map plot settings to the subplots
g = (g.map(plt.scatter, 
           "free sulfur dioxide", 
           "total sulfur dioxide").add_legend())

# Create ridge plot for red wines
f, a = (joypy.joyplot(wines_train[wines_train.type == "red"], 
                      by="quality", column="alcohol", 
                      figsize=(5,8)))
plt.title("Red Wines (Quality vs Alcohol)")
plt.xlabel("Alcohol")
plt.show()

# Create ridge plot for white wines
f, a2 = (joypy.joyplot(wines_train[wines_train.type == "white"], 
                       by="quality", column="alcohol",
                       figsize=(5,8)))
plt.title("White Wines (Quality vs Alcohol)")
plt.xlabel("Alcohol")
plt.show()

R

# Plot free sulfur dioxide vs total sulfur dioxide
ggplot(data = wines_train, aes(x = `free sulfur dioxide`, 
                               y = `total sulfur dioxide`, 
                               color = as.factor(quality))) +
  geom_point(size = 2, alpha = 0.2) +
  facet_wrap(~type) +
  labs(title = "Free vs Total Sulfur Dioxide", 
       x = "Free Sulfur Dioxide", 
       y = "Total Sulfur Dioxide")

# Plot quality vs alcohol as a ridge plot
ggplot(data = wines_train, aes(x = alcohol, 
                               y = as.factor(quality),
                               fill = as.factor(quality))) + 
  geom_density_ridges() +
  facet_wrap(~type) +
  labs(title = 'Wine Quality vs Alcohol', 
        x = "Alcohol", 
        y = "Quality")
 
 
 
 
 

Result: Winner R

The simplicity of ggplot2 gives R the edge in this case. While matplotlib and seaborn provide functional alternatives using Python, the amount of ease of use, abundance of online examples, and quality of documentation favors R. Additionally, the ability to convert most ggplot2 plots into interactive plots by wrapping a single function call around the plot is astonishing. Notice how in the example plots the ggplot examples look better and were generated with less code.

Note: The data visualization options in both Python and R are extensive and comprehensive. In either language, you will be able to produce insightful, eye-opening, and even beautiful plots. If you are starting out, ggplot2 will be more intuitive but the skill, unfortunately is not directly transferrable to R (there are ggplot implementations in Python, but you’d be better off in the long-run learning seaborn).


5. Modeling/Machine Learning

Now that your data is imported, cleaned, explored, transformed, feature-engineered, you want to figure out how to model and then predict/classify/forecast. Both R and Python provide tools to perform this step.

Python

# Prepare data for machine learning
wines_train_y = wines_train[['type']]
wines_train_X = wines_train.drop('type', axis=1)

## Logistic Regression
from sklearn.linear_model import LogisticRegression

# Fit and predict
logreg = LogisticRegression().fit(wines_train_X, wines_train_y)
logreg_predictions = logreg.predict(wines_test.drop('type', axis=1))
logreg.score(wines_test.drop('type', axis=1), wines_test[['type']])

## Random Forest
from sklearn.ensemble import RandomForestClassifier

# Fit and predict
rf = RandomForestClassifier().fit(wines_train_X, wines_train_y)
rf_predictions = rf.predict(wines_test.drop('type', axis=1))
rf.score(wines_test.drop('type', axis=1), wines_test[['type']])

## AdaBoost
from sklearn.ensemble import AdaBoostClassifier

# Fit and predict
abf = AdaBoostClassifier().fit(wines_train_X, wines_train_y)
abf_predictions = abf.predict(wines_test.drop('type', axis=1))
abf.score(wines_test.drop('type', axis=1), wines_test[['type']])

## Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier

# Fit and predict
gbm = GradientBoostingClassifier().fit(wines_train_X, wines_train_y)
gbm_predictions = gbm.predict(wines_test.drop('type', axis=1))
gbm.score(wines_test.drop('type', axis=1), wines_test[['type']])
  • Python's sklearn is an excellent tool for working with machine learning algorithms.
  • Relying on one package makes learning easier and more consistent.
  • Pythonic and pandaic coding conventions make working with sklearn fun!

R

## Logistic Regression
library(glmnet)

# Fit using glmnet::glm() and generate predictions
logreg <- glm(data = wines_train, 
              type == 'white' ~ .,
              family = binomial("logit"))
logreg_predictions <- predict(logreg, wines_test)

## Random Forest
library(randomForest)

# Fit using randomForest::randomForest()and generate predictions
wines_train_rf <- wines_train %>%
  mutate(
    totalsulfurdioxide = ifelse(is.na(totalsulfurdioxide), 
                                25, 
                                totalsulfurdioxide))

rf <- randomForest(type == 'white' ~ ., data = wines_train_rf)
rf_predictions <- predict(rf, wines_test)

## Adaboost
library(ada)

# Fit using ada::ada() and generate predictions
abf <- ada(type == 'white' ~ ., data = wines_train)
abf_predictions <- predict(abf, wines_test, type = "probs")

## Gradient Boosted Tree
library(gbm)

# Fit gbm::gbm() and generate predictions
gb <- gbm(type == 'white' ~ ., data = wines_train)
gb_predictions <- predict(gb, wines_test, n.trees = gb$n.trees)
  • R provides a good range of functionality for classification tasks.
  • The packages however are not as uniform as sklearn is for Python.
  • Not shown but I have found time-series data easier to work with in R.

Result: Winner Python

For plotting, R impressed with its ease of use and seamless integration using ggplot2, Python wins this round because of its seamlessness within sklearn. Instead of importing a library (each with its own syntax, method parameters, etc.), sklearn provides all four example classifiers LogisticRegression, RandomForestClassifier, AdaBoostClassifier, and GradientBoostingClassifier. This code is not intended to demonstrate all of the functionality for either Python or R but give an impression of using both languages. Not shown in this post are neural networks (stay tuned for an upcoming post on deep learning with torch #pytorch). Python further separates itself with a collection of deep learning packages.


6. Support/Documentation

In this section, instead of looking at code examples, we will focus on availability of resources for when you start out or get stuck.

Searching for Answers

Python

  • The search results for 'pandas add a new column' contain over 1M results.
  • Video tutorials may be more common for Python but I cannot vouch for their depth.

R

  • The search results for 'dplyr add a new column' contain over 100K results.
  • The search results for R (at least those in Stack Overflow) are less in number but are more recent than Python.

Both Python and R have considerable amount of reference material online already: tutorials, documentation, videos, etc. These are valuable resources for both newcomers and seasoned veterans. While Python has more search results, keep in mind that quantity does not always mean quality. Both Python and R have large amounts of information and answers on websites like Stack Overflow.

Digging through Documentation

Python

  • The pandas documentation is thorough and helpful when stuck on a coding problem.
  • Generally, the documentation for data science related Python libraries is highly detailed and available.

R

  • The dplyr documentation is robust and well managed.
  • Generally, the documentation for critical R packages is highly detailed and available.

One noticeable advantage for the R language is that the format/style of the documentation is more consistent. This simple design choice makes searching for an answer easier when you know what to expect (package, function/method, argument). The Python documentation is often maintained in a less centralized manner where the owner of the library organizes, formats, and styles the documentation.

Evaluating the Communities

Most Wanted Languages

  • Stack Overflow indicates that as of 2018, Python is a more loved language than R.
  • R is loved by less than half of its users.

Salary by Language

  • R and Python both appear near the bottom of the top languages by salary.
  • There exists only a small $2K gap in average salaries for Python and R users. It's unknown how this gap changes if limited to data scientists.

Github

  • Python is one of the top 10 fastest growing languages according to Github.
  • R does not make the top 10. It is unclear where R falls in the list.

Kaggle

  • According to Kaggle, over the course of 2016, Python-based kernels quadrupled while R remained constant.
  • Kaggle has not published data since 2016 but it's clear that Kaggle is dominated by Python users as XGB, LGB, and neural networks become more popular.

As communities, Python and R both have breadth and depth of knowledge. Python may have advantages in the future as it continues to outgrow R but for now it's still a two-horse race. Only on Kaggle do you really find a large disparity between Python and R users.

Result: Draw

Both languages have strong community support and active presences on Github, Stack Overflow, and other coding hotspots. If you are new to programming entirely, R may be easier to get started based on the documentation (however, I have found the R help docs can be cryptic) and the lack of ‘object-oriented’ style programming (see: function vs method). If you have familiarity with programming, Python will be more intuitive as you will not have to learn the tidyverse/magrittr pipe syntax and you already know how to look up methods, attributes in documentation. Neither language earns a victory in this round.


Conclusion

Result: It's a draw!

After all of that, it ends up a tie! Both languages have advantages but also enough flaws that preference becomes the primary separator. As the field becomes more entranced with neural networks and deep learning, Python may develop an edge as its community seems to have a higher proportion of computer scientists.