Similarities between R and Python for data analysis

Scott Spencer https://ssp3nc3r.github.io (Columbia University)https://sps.columbia.edu/faculty/scott-spencer
2020 February 03

R and Python are both useful languages for data science and can be similarly applied in a data analysis. Here, without explaining either language per se, we demonstrate similarities between them for importing, transforming, and visualizing a small, example dataset.

1 A generic, data-analysis workflow

In a typical or common data analysis, we iteratively import data, clean or tidy it, transform those measures into new variables, model and visualize the data, and finally communicate our findings. From an illustration in Grolemund and Wickham (2016), we find a generic workflow in simplified form:

In this paper, we focus on similarities between R and Python for importing, transforming, and visualizing data, shown through a few examples.

2 Data used for this tutorial

The diamonds dataset we’ll be working with1 has 53,940 observations of 10 measures. The first few rows of observed measures follow:

Table 1: diamonds dataset, first few observations.
carat cut color clarity depth table price x y z
0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

where the variables are defined as:

3 Software packages in R and Python

Today, most data analysts and scientists do not just use base R or base Python in their work. Instead, they commonly use various software packages that work in their chosen language. In R, several broadly useful packages have been collected into an overall package called tidyverse2. Loading this package, in turn, loads the individual packages readr, dplyr, purrr, tibble, tidyr, forcats, and ggplot23. readr provides functions for importing data. dplyr, purrr, and tidyr provide numerous functions for transforming, summarizing, and merging data (among other things). And ggplot2 provides functions for drawing graphics from data, theoretically grounded in the grammar of graphics (hence, the gg in ggplot2), an import concept we’ll use later.

You may load whatever you need from individual packages as you do not usually need them all in a given analysis. The R package ecosystem, of course, includes many additional packages that can be very useful, depending on your intended analysis.

As in R, in Python there are several packages widely used in data analyses. These include numpy, pandas4, matplotlib, seaborn, plotnine5 and many more. numpy (Numerical Python) provides functions to work with n-dimensional arrays, element-wise operations between them, reading and writing them, linear algebra operations, and integrating other languages like C++ or Fortran. pandas provides functions for creating and manipulating data structures including the important DataFrame, a two-dimensional tabular, column structure with rows and columns that you might think about as similar to an Excel or Google worksheet. data.frames are native to R. plotnine is intended to imitate R’s ggplot2. matplotlib may currently be more popular in Python (it’s older), but unlike plotnine, it lacks a theoretical basis in the grammar of graphics and has a higher learning curve.

4 R and Python similarities for import, tidy, transform, and visualize

4.1 Loading software for analysis

For this tutorial, we’ll just load tidyverse in R, like so:

And in a Python environment, we can similarly load its packages, relevant for our tutorial:

import pandas as pd
from plotnine import *

where import is a command and the as __ makes those software packages available through our chosen name or acronym (e.g., pd). Of note, the from ___ import ___ makes the chosen functions and such (or everything, if *) directly available.

4.2 Importing data

Both R and Python have many ways to import data for our analyses. As csv files are one common storage of data, here is a relevant function in R and Python to import:

In R, one way we can load the csv data is like this:

df_r <- read_csv("data/diamonds.csv")

Or in Python,

df_py = pd.read_csv("data/diamonds.csv")

For either R or Python, this assumes, of course, that we have a csv file called diamonds.csv in our data subdirectory of your project directory. Notice in this case the functions (names and parameters) seem almost identical. In R, the function read_csv is available in the global environment. Had we not loaded the function into the global environment as above, we would still access it through the package using double colons ::, like so: readr::read_csv.

And in Python the same function name (read_csv) is available within pd. In either case, we end up with an R data.frame or Pandas DataFrame, respectively. Both have a rectangular structure where rows are observations and columns are variables, as shown above, in table 1.

Next, we demonstrate a few ways to modify this dataframe (df_r and df_py, respectively), showing the similarities between R and Python functions.

4.3 Tidying and transforming data

Data structures in both R and Python have equivalents or near equivalents. A few of these include:

R Python Examples
Single-element vector Scalar 1, 1L, TRUE, "foo"
Multi-element vector List c(1.0, 2.0, 3.0), c(1L, 2L, 3L)
List of multiple types Tuple list(1L, TRUE, "foo")
Named list Dict list(a = 1L, b = 2.0), dict(x = x_data)
Matrix/Array NumPy ndarray matrix(c(1,2,3,4), nrow = 2, ncol = 2)
Data Frame Pandas DataFrame data.frame(x = c(1,2,3), y = c("a", "b", "c"))
Function Python function function(x) x + 1
NULL, TRUE, FALSE None, True, False NULL, TRUE, FALSE

We can perform transformations of these data structures using software functions. Some powerful functions are similar between both R and Python. Here’s a short cross-reference between the languages for a few of those that operate on data frames6:

R / dplyr Python / pandas comment
mutate assign creates or modifies a variable
select filter specifies variables to keep
rename rename renames variables
filter query specifies observations to keep
arrange sort_values specifies ordering of the observations
group_by groupby specifies grouping of observations
summarise agg specifies some summary of the data
pivot_longer melt convert from wide to long format
pivot_wider pivot convert from long to wide format

Notice the names of these functions are verbs, which describe what’s being done to something.

Before jumping into our comparisons, let’s talk briefly about the anatomy of a function. Functions let us group a series of (R or Python) program statements together to perform a specific task. We call a function by invoking or typing in its name, and supplying it with the necessary information to perform the task (by setting its parameters). The function, then, performs the task and, in most cases returns something. Visually, these components typically look like:

returned_object = function_name(parameter1, parameter2)

where the actual number of parameters may range from none to many. For the data analysis functions, in one way or another, we give the functions that accept a data frame as a parameter (along with, perhaps other parameters) our data with that structure; the function modifies the data frame in some way, and returns the modified data frame so that we can store it for some use. Above, the modified data frame would be assigned to returned_object.

Now we can of course apply just one function to our data object. Let’s call our fictitious function f(parameter) where both parameter and the return value are data frames. So, we might do something like this:

final_object = f(data)

And if we need to apply several functions, say, f(parameter), g(parameter), and h(parameter) in that order, again where parameter and return values are data frames, we might nest them so that the return information of one function is supplied to a parameter of the next function, something like this:

final_object = h( g( f(data) ) )

Notice that we have to read this code inside-out for order of application! Alternatively, we might store each function’s return information, something like this:

temp1 = f(data)
temp2 = g(temp1)
final_object = h(temp2)

Now, at least, we can read the operations in order of our intended application, but we end up saving intermediate variables (e.g., temp1, temp2) that we may not care about. There’s generally a better way in both R and Python. In R, we’ll use something called a pipe operator: %>%. And in Python, we can use something called method chaining.

4.3.1 R functions and the pipe operator

Using the pipe operator %>% in R, we can chain together operations in the natural order of application without saving intermediate objects. This pipe operator performs a very simple task. It takes the thing to its left, and supplies that thing as the first argument of the function to the right. Now, our generic example may look like this:

final_object <- 
  initial_object %>%
  f() %>%
  g() %>%
  h()

4.3.2 Python member functions, chained together

In Python, we can do something very similar. Instead of using a pipe operator between functions, we create an object, like a Pandas DataFrame. That object has member functions that can operate on the object. We can chain them together using the accessor dot . operator. With our above functions as member functions in Python, it may look like this:

final_object = (
  initial_object
    .f()
    .g()
    .h()
)

Now that we have some understanding in piping together or chaining together functions that transform our data in some way, let’s see them in action on our example diamonds data.

4.3.3 Example one

For our first example, let’s pipe several operations together in R. For this example, we select a couple of variables from our data frame, filter or keep only the observations where the diamond color was measured as grade E, and display the first three observations of the final object:

df_r %>%
  select(carat, color) %>%
  filter(color == 'E') %>%
  head(3)
# A tibble: 3 × 2
  carat color
  <dbl> <chr>
1  0.23 E    
2  0.21 E    
3  0.23 E    

And do the same thing in Python using objects and method chaining:

(
  df_py
  .filter(['carat', 'color'])
  .query('color == "E"')
  .head(3)
)
   carat color
0   0.23     E
1   0.21     E
2   0.23     E

Nice — we have a one-to-one correspondence between the languages!

4.3.4 Example two

Next, we try a slightly more complex example. Here, we first select only variables from our diamonds dataset that begin with the letter c, filter or keep observations where the diamond cut is graded as Ideal or Premium, group the observations by cut, within each group, subgroup by color, and within those subgroups, subgroup again by clarity. Once grouped, we summarise those subgroups, calculating the average (mean) diamond carat, and arrange those diamond groups in descending order of average carat size. Finally, we keep the first few (head of the) observations of our modified data frame.

In R, we can perform these operations, piped one after the other like so:

df_r %>%
  select(starts_with('c')) %>%
  filter(cut %in% c('Ideal', 'Premium')) %>%
  group_by(cut, color, clarity) %>%
  summarise(avgcarat = mean(carat, na.rm = TRUE),
            n = n()) %>%
  arrange(desc(avgcarat)) %>%
  head()
# A tibble: 6 × 5
# Groups:   cut, color [4]
  cut     color clarity avgcarat     n
  <chr>   <chr> <chr>      <dbl> <int>
1 Ideal   J     I1          1.99     2
2 Premium I     I1          1.61    24
3 Premium J     I1          1.58    13
4 Premium J     SI2         1.55   161
5 Ideal   H     I1          1.48    38
6 Premium I     SI2         1.42   312

Or equivalently, in Python using method chaining,

(
  df_py
    .query('cut in ["Ideal", "Premium"]')
    .groupby(['cut', 'color', 'clarity'])
    .agg(['mean', 'size'])
    .sort_values(by=('carat', 'mean'), ascending = False)
    .head()
)
                         Unnamed: 0          carat  ...    y         z     
                               mean size      mean  ... size      mean size
cut     color clarity                               ...                    
Ideal   J     I1       39076.000000    2  1.990000  ...    2  4.875000    2
Premium I     I1       24202.916667   24  1.605833  ...   24  4.482083   24
        J     I1       28924.307692   13  1.578462  ...   13  4.443846   13
              SI2      18268.689441  161  1.554534  ...  161  4.482422  161
Ideal   H     I1       10319.921053   38  1.475526  ...   38  4.464211   38

[5 rows x 16 columns]

4.3.5 Example three

For our third example with tidying and transforming our data, we’ll create a new variable from those we have, categorizing price into “low,” “medium,” or “high.” Then, we select our new variable (pricecat) and two dimension variables (x, z). We rename those dimensions as width and depth, respectively. Fourth, we change the structure of the data from “wide” to “long” format, filter or keep only observations that have a dimension less than 10 mm, arrange them by pricecat and dim, and we keep the first few (head of the) observations.

In R, here are our piped functions:

df_r %>%
  mutate(pricecat = cut(price, breaks = 3, labels = c('low', 'med', 'high'))) %>%
  select(pricecat, x, z) %>%
  rename(width = x, depth = z) %>%
  pivot_longer(cols = -pricecat, names_to = "dim", values_to = "mm") %>%
  filter(mm < 10) %>%
  arrange(pricecat, dim) %>%
  head()
# A tibble: 6 × 3
  pricecat dim      mm
  <fct>    <chr> <dbl>
1 low      depth  2.43
2 low      depth  2.31
3 low      depth  2.31
4 low      depth  2.63
5 low      depth  2.75
6 low      depth  2.48

And in Python, here are our chained methods:

(
  df_py
    .assign(pricecat = pd.cut(df_py['price'], bins = 3, labels = ['low', 'med', 'high']))
    .filter(['pricecat', 'x', 'z'])
    .rename(columns = {'x': 'width', 'z': 'depth'})
    .melt(id_vars = ['pricecat'], value_vars = ['width', 'depth'],
          var_name = 'dim', value_name = 'mm')
    .query('mm < 10')
    .sort_values(['pricecat', 'dim'])
    .head()
)
      pricecat    dim    mm
53940      low  depth  2.43
53941      low  depth  2.31
53942      low  depth  2.31
53943      low  depth  2.63
53944      low  depth  2.75

Again, while the syntax between the languages differs slightly, we have a one-to-one correspondence in functional operations between R and Python!

4.4 Visualizing data

To demonstrate similarities between R’s ggplot2 and Python’s plotnine for visualizing data, let’s redo the above third transformation of data and save it into R and Python objects:

In R,

df_r <- 
  df_r %>%
  mutate(pricecat = cut(price, breaks = 3, labels = c('low', 'med', 'high'))) %>%
  select(pricecat, x, z) %>%
  rename(width = x, depth = z) %>%
  pivot_longer(cols = -pricecat, names_to = "dim", values_to = "mm") %>%
  arrange(pricecat, dim) %>%
  filter(mm < 10)

In Python:

df_py = (
  df_py
    .assign(pricecat = pd.cut(df_py['price'], bins = 3, labels = ['low', 'med', 'high']))
    .filter(['pricecat', 'x', 'z'])
    .rename(columns = {'x': 'width', 'z': 'depth'})
    .melt(id_vars = ['pricecat'], value_vars = ['width', 'depth'],
          var_name = 'dim', value_name = 'mm')
    .sort_values(['pricecat', 'dim'])
    .query('mm < 10')
)

To visualize these, in R, we’ll use ggplot2 to code a faceted graphic where each facet or small multiple is a segment of the data based on our new pricecat variable, and within each facet, we map mm to the x-axis, map a statistic, density of observed mm to the y-axis, and map our dim variable to the fill color of those densities.

In R (ggplot2),

ggplot(
  data = df_r,
  mapping = aes(x = mm, fill = dim)
  ) +
  geom_density(alpha = 0.5) +
  facet_wrap(~pricecat) +
  ylab('') 

Notice, for historical reasons, we use + instead of %>% after the function ggplot to link together each layer or component of the graphic. In Python, its graphing package plotnine intentionally provides almost identical syntax!:

(
  ggplot(
    data = df_py,
    mapping = aes(x = 'mm', fill = 'dim')
    ) +
  geom_density(alpha = 0.5) +
  facet_wrap('~pricecat') +
  ylab('')
)
<ggplot: (8786865555917)>

ggplot2 — and its imitator, plotnine — are, unlike some charting libraries, based on a solid and influential theory for describing and constructing nearly any graphic from data. And that the two packages share similar syntax is a bonus for those learning the basics in both R and Python languages. To learn more about this implementation of the grammar of graphics, see Wickham, Navarro, and Lin (2021).

5 Bonus — mixing R and Python

Speaking of bonuses, we can even mix objects and functions between R and Python! It’s almost trivial using the RStudio IDE7, which uses the package reticulate8 for this functionality. In an R environment, we can access an object in the Python environment through the syntax py$. And, conversely, in a Python environment, we can access an R object through the syntax r$. I’ll leave the details for another day.

I hope this tutorial helps us focus on the basic steps of importing, tidying, transforming, and visualizing data in a way that demonstrates similar approaches using either R or Python. While there are definitely use cases where we would prefer one language over the other, in many cases, it’s just personal preference, or melding with your team. To go more in-depth for data analysis9, many resources exist to guide your learning from the beginning. For R, try the aforementioned Grolemund and Wickham (2016) or Mailund (2017); and for Python, try McKinney (2017).

Happy coding!

Grolemund, Garrett, and Hadley Wickham. 2016. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.
Mailund, Thomas. 2017. Beginning Data Science in r. Apress.
McKinney, Wes. 2017. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. Second. O’Reilly.
Wickham, Hadley, Danielle Navarro, and Thomas Lin. 2021. Ggplot2: Elegant Graphics for Data Analysis. Third. Springer. https://ggplot2-book.org/.

  1. For this tutorial, we’ll use a common dataset available in both R and Python, but write it to disk (into a subdirectory of our project called data) so that we can demonstrate starting from a common data storage format, the comma separated value (csv) file.↩︎

  2. For an introduction to the tidyverse package, browse: https://www.tidyverse.org.↩︎

  3. ggplot2 includes a thorough, online reference guide: https://ggplot2.tidyverse.org.↩︎

  4. Here’s more information on pandas: https://pandas.pydata.org.↩︎

  5. Here’s more information on plotnine: https://plotnine.readthedocs.io/en/stable/.↩︎

  6. More cross-references are available online: e.g., https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_r.html.↩︎

  7. For information on RStudio: https://www.rstudio.com/products/rstudio/.↩︎

  8. You can read more information on the package reticulate — an R interface to Python — online: https://rstudio.github.io/reticulate/.↩︎

  9. To learn programming in these languages, other resources provide a better guide.↩︎

References