# Similarities between R and Python for data analysis

Scott Spencer https://ssp3nc3r.github.io (Columbia University)https://sps.columbia.edu/faculty/scott-spencer
2020 February 03

R and Python are both useful languages for data science and can be similarly applied in a data analysis. Here, without explaining either language per se, we demonstrate similarities between them for importing, transforming, and visualizing a small, example dataset.

# 1 A generic, data-analysis workflow

In a typical or common data analysis, we iteratively import data, clean or tidy it, transform those measures into new variables, model and visualize the data, and finally communicate our findings. From an illustration in , we find a generic workflow in simplified form:

In this paper, we focus on similarities between R and Python for importing, transforming, and visualizing data, shown through a few examples.

# 2 Data used for this tutorial

The `diamonds` dataset we’ll be working with1 has 53,940 observations of 10 measures. The first few rows of observed measures follow:

Table 1: diamonds dataset, first few observations.
carat cut color clarity depth table price x y z
0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

where the variables are defined as:

• `carat`, weight of the diamond
• `cut`, quality of the cut
• `color`, diamond color
• `clarity`, diamond clarity
• `depth`, total depth percentage
• `table`, width of diamond top relative to widest point
• `price`, US dollars
• `x`, `y`, `z`, length, width, and depth in mm

# 3 Software packages in R and Python

Today, most data analysts and scientists do not just use base R or base Python in their work. Instead, they commonly use various software packages that work in their chosen language. In R, several broadly useful packages have been collected into an overall package called `tidyverse`2. Loading this package, in turn, loads the individual packages `readr`, `dplyr`, `purrr`, `tibble`, `tidyr`, `forcats`, and `ggplot2`3. `readr` provides functions for importing data. `dplyr`, `purrr`, and `tidyr` provide numerous functions for transforming, summarizing, and merging data (among other things). And `ggplot2` provides functions for drawing graphics from data, theoretically grounded in the grammar of graphics (hence, the `gg` in `ggplot2`), an import concept we’ll use later.

You may load whatever you need from individual packages as you do not usually need them all in a given analysis. The R package ecosystem, of course, includes many additional packages that can be very useful, depending on your intended analysis.

As in R, in Python there are several packages widely used in data analyses. These include `numpy`, `pandas`4, `matplotlib`, `seaborn`, `plotnine`5 and many more. `numpy` (Numerical Python) provides functions to work with n-dimensional arrays, element-wise operations between them, reading and writing them, linear algebra operations, and integrating other languages like C++ or Fortran. `pandas` provides functions for creating and manipulating data structures including the important `DataFrame`, a two-dimensional tabular, column structure with rows and columns that you might think about as similar to an Excel or Google worksheet. `data.frame`s are native to R. `plotnine` is intended to imitate R’s `ggplot2`. `matplotlib` may currently be more popular in Python (it’s older), but unlike `plotnine`, it lacks a theoretical basis in the grammar of graphics and has a higher learning curve.

# 4 R and Python similarities for import, tidy, transform, and visualize

For this tutorial, we’ll just load `tidyverse` in R, like so:

``````library(tidyverse)
``````

And in a Python environment, we can similarly load its packages, relevant for our tutorial:

``````import pandas as pd
from plotnine import *``````

where `import` is a command and the `as __` makes those software packages available through our chosen name or acronym (e.g., `pd`). Of note, the `from` ___ `import` ___ makes the chosen functions and such (or everything, if `*`) directly available.

## 4.2 Importing data

Both R and Python have many ways to import data for our analyses. As `csv` files are one common storage of data, here is a relevant function in R and Python to import:

In R, one way we can load the `csv` data is like this:

``````df_r <- read_csv("data/diamonds.csv")
``````

Or in Python,

``df_py = pd.read_csv("data/diamonds.csv")``

For either R or Python, this assumes, of course, that we have a `csv` file called `diamonds.csv` in our `data` subdirectory of your project directory. Notice in this case the functions (names and parameters) seem almost identical. In R, the function `read_csv` is available in the global environment. Had we not loaded the function into the global environment as above, we would still access it through the package using double colons `::`, like so: `readr::read_csv`.

And in Python the same function name (`read_csv`) is available within `pd`. In either case, we end up with an R `data.frame` or Pandas `DataFrame`, respectively. Both have a rectangular structure where rows are observations and columns are variables, as shown above, in table 1.

Next, we demonstrate a few ways to modify this dataframe (`df_r` and `df_py`, respectively), showing the similarities between R and Python functions.

## 4.3 Tidying and transforming data

Data structures in both R and Python have equivalents or near equivalents. A few of these include:

`R` `Python` Examples
Single-element vector Scalar `1`, `1L`, `TRUE`, `"foo"`
Multi-element vector List `c(1.0, 2.0, 3.0)`, `c(1L, 2L, 3L)`
List of multiple types Tuple `list(1L, TRUE, "foo")`
Named list Dict `list(a = 1L, b = 2.0)`, `dict(x = x_data)`
Matrix/Array NumPy ndarray `matrix(c(1,2,3,4),` `nrow = 2, ncol = 2)`
Data Frame Pandas DataFrame `data.frame(x = c(1,2,3),` `y = c("a", "b", "c"))`
Function Python function `function(x) x + 1`
NULL, TRUE, FALSE None, True, False `NULL`, `TRUE`, `FALSE`

We can perform transformations of these data structures using software functions. Some powerful functions are similar between both R and Python. Here’s a short cross-reference between the languages for a few of those that operate on data frames6:

`R` / `dplyr` `Python` / `pandas` comment
`mutate` `assign` creates or modifies a variable
`select` `filter` specifies variables to keep
`rename` `rename` renames variables
`filter` `query` specifies observations to keep
`arrange` `sort_values` specifies ordering of the observations
`group_by` `groupby` specifies grouping of observations
`summarise` `agg` specifies some summary of the data
`pivot_longer` `melt` convert from wide to long format
`pivot_wider` `pivot` convert from long to wide format

Notice the names of these functions are verbs, which describe what’s being done to something.

Before jumping into our comparisons, let’s talk briefly about the anatomy of a function. Functions let us group a series of (R or Python) program statements together to perform a specific task. We call a function by invoking or typing in its name, and supplying it with the necessary information to perform the task (by setting its parameters). The function, then, performs the task and, in most cases returns something. Visually, these components typically look like:

``````returned_object = function_name(parameter1, parameter2)
``````

where the actual number of parameters may range from none to many. For the data analysis functions, in one way or another, we give the functions that accept a data frame as a parameter (along with, perhaps other parameters) our data with that structure; the function modifies the data frame in some way, and returns the modified data frame so that we can store it for some use. Above, the modified data frame would be assigned to `returned_object`.

Now we can of course apply just one function to our data object. Let’s call our fictitious function `f(parameter)` where both parameter and the return value are data frames. So, we might do something like this:

``````final_object = f(data)
``````

And if we need to apply several functions, say, `f(parameter)`, `g(parameter)`, and `h(parameter)` in that order, again where parameter and return values are data frames, we might nest them so that the return information of one function is supplied to a parameter of the next function, something like this:

``````final_object = h( g( f(data) ) )
``````

Notice that we have to read this code inside-out for order of application! Alternatively, we might store each function’s return information, something like this:

``````temp1 = f(data)
temp2 = g(temp1)
final_object = h(temp2)
``````

Now, at least, we can read the operations in order of our intended application, but we end up saving intermediate variables (e.g., `temp1`, `temp2`) that we may not care about. There’s generally a better way in both R and Python. In R, we’ll use something called a pipe operator: `%>%`. And in Python, we can use something called method chaining.

### 4.3.1 R functions and the pipe operator

Using the pipe operator `%>%` in R, we can chain together operations in the natural order of application without saving intermediate objects. This pipe operator performs a very simple task. It takes the thing to its left, and supplies that thing as the first argument of the function to the right. Now, our generic example may look like this:

``````final_object <-
initial_object %>%
f() %>%
g() %>%
h()
``````

### 4.3.2 Python member functions, chained together

In Python, we can do something very similar. Instead of using a pipe operator between functions, we create an object, like a `Pandas DataFrame`. That object has member functions that can operate on the object. We can chain them together using the accessor dot `.` operator. With our above functions as member functions in Python, it may look like this:

``````final_object = (
initial_object
.f()
.g()
.h()
)``````

Now that we have some understanding in piping together or chaining together functions that transform our data in some way, let’s see them in action on our example `diamonds` data.

### 4.3.3 Example one

For our first example, let’s pipe several operations together in R. For this example, we select a couple of variables from our data frame, filter or keep only the observations where the diamond color was measured as grade `E`, and display the first three observations of the final object:

``````df_r %>%
select(carat, color) %>%
filter(color == 'E') %>%
``````
``````# A tibble: 3 × 2
carat color
<dbl> <chr>
1  0.23 E
2  0.21 E
3  0.23 E    ``````

And do the same thing in Python using objects and method chaining:

``````(
df_py
.filter(['carat', 'color'])
.query('color == "E"')
)``````
``````   carat color
0   0.23     E
1   0.21     E
2   0.23     E``````

Nice — we have a one-to-one correspondence between the languages!

### 4.3.4 Example two

Next, we try a slightly more complex example. Here, we first select only variables from our `diamonds` dataset that begin with the letter `c`, filter or keep observations where the diamond `cut` is graded as `Ideal` or `Premium`, group the observations by `cut`, within each group, subgroup by `color`, and within those subgroups, subgroup again by `clarity`. Once grouped, we summarise those subgroups, calculating the average (mean) diamond `carat`, and arrange those diamond groups in descending order of average carat size. Finally, we keep the first few (head of the) observations of our modified data frame.

In R, we can perform these operations, piped one after the other like so:

``````df_r %>%
select(starts_with('c')) %>%
group_by(cut, color, clarity) %>%
summarise(avgcarat = mean(carat, na.rm = TRUE),
n = n()) %>%
arrange(desc(avgcarat)) %>%
``````
``````# A tibble: 6 × 5
# Groups:   cut, color [4]
cut     color clarity avgcarat     n
<chr>   <chr> <chr>      <dbl> <int>
1 Ideal   J     I1          1.99     2
2 Premium I     I1          1.61    24
3 Premium J     I1          1.58    13
4 Premium J     SI2         1.55   161
5 Ideal   H     I1          1.48    38
6 Premium I     SI2         1.42   312``````

Or equivalently, in Python using method chaining,

``````(
df_py
.groupby(['cut', 'color', 'clarity'])
.agg(['mean', 'size'])
.sort_values(by=('carat', 'mean'), ascending = False)
)``````
``````                         Unnamed: 0          carat  ...    y         z
mean size      mean  ... size      mean size
cut     color clarity                               ...
Ideal   J     I1       39076.000000    2  1.990000  ...    2  4.875000    2
Premium I     I1       24202.916667   24  1.605833  ...   24  4.482083   24
J     I1       28924.307692   13  1.578462  ...   13  4.443846   13
SI2      18268.689441  161  1.554534  ...  161  4.482422  161
Ideal   H     I1       10319.921053   38  1.475526  ...   38  4.464211   38

[5 rows x 16 columns]``````

### 4.3.5 Example three

For our third example with tidying and transforming our data, we’ll create a new variable from those we have, categorizing price into “low,” “medium,” or “high.” Then, we select our new variable (`pricecat`) and two dimension variables (`x`, `z`). We rename those dimensions as `width` and `depth`, respectively. Fourth, we change the structure of the data from “wide” to “long” format, filter or keep only observations that have a dimension less than 10 mm, arrange them by `pricecat` and `dim`, and we keep the first few (head of the) observations.

In R, here are our piped functions:

``````df_r %>%
mutate(pricecat = cut(price, breaks = 3, labels = c('low', 'med', 'high'))) %>%
select(pricecat, x, z) %>%
rename(width = x, depth = z) %>%
pivot_longer(cols = -pricecat, names_to = "dim", values_to = "mm") %>%
filter(mm < 10) %>%
arrange(pricecat, dim) %>%
``````
``````# A tibble: 6 × 3
pricecat dim      mm
<fct>    <chr> <dbl>
1 low      depth  2.43
2 low      depth  2.31
3 low      depth  2.31
4 low      depth  2.63
5 low      depth  2.75
6 low      depth  2.48``````

And in Python, here are our chained methods:

``````(
df_py
.assign(pricecat = pd.cut(df_py['price'], bins = 3, labels = ['low', 'med', 'high']))
.filter(['pricecat', 'x', 'z'])
.rename(columns = {'x': 'width', 'z': 'depth'})
.melt(id_vars = ['pricecat'], value_vars = ['width', 'depth'],
var_name = 'dim', value_name = 'mm')
.query('mm < 10')
.sort_values(['pricecat', 'dim'])
)``````
``````      pricecat    dim    mm
53940      low  depth  2.43
53941      low  depth  2.31
53942      low  depth  2.31
53943      low  depth  2.63
53944      low  depth  2.75``````

Again, while the syntax between the languages differs slightly, we have a one-to-one correspondence in functional operations between R and Python!

## 4.4 Visualizing data

To demonstrate similarities between R’s ggplot2 and Python’s `plotnine` for visualizing data, let’s redo the above third transformation of data and save it into R and Python objects:

In R,

``````df_r <-
df_r %>%
mutate(pricecat = cut(price, breaks = 3, labels = c('low', 'med', 'high'))) %>%
select(pricecat, x, z) %>%
rename(width = x, depth = z) %>%
pivot_longer(cols = -pricecat, names_to = "dim", values_to = "mm") %>%
arrange(pricecat, dim) %>%
filter(mm < 10)
``````

In Python:

``````df_py = (
df_py
.assign(pricecat = pd.cut(df_py['price'], bins = 3, labels = ['low', 'med', 'high']))
.filter(['pricecat', 'x', 'z'])
.rename(columns = {'x': 'width', 'z': 'depth'})
.melt(id_vars = ['pricecat'], value_vars = ['width', 'depth'],
var_name = 'dim', value_name = 'mm')
.sort_values(['pricecat', 'dim'])
.query('mm < 10')
)``````

To visualize these, in R, we’ll use `ggplot2` to code a faceted graphic where each facet or small multiple is a segment of the data based on our new `pricecat` variable, and within each facet, we map `mm` to the x-axis, map a statistic, density of observed `mm` to the y-axis, and map our `dim` variable to the `fill` color of those densities.

In R (`ggplot2`),

``````ggplot(
data = df_r,
mapping = aes(x = mm, fill = dim)
) +
geom_density(alpha = 0.5) +
facet_wrap(~pricecat) +
ylab('')
``````

Notice, for historical reasons, we use `+` instead of `%>%` after the function `ggplot` to link together each layer or component of the graphic. In Python, its graphing package `plotnine` intentionally provides almost identical syntax!:

``````(
ggplot(
data = df_py,
mapping = aes(x = 'mm', fill = 'dim')
) +
geom_density(alpha = 0.5) +
facet_wrap('~pricecat') +
ylab('')
)``````
``<ggplot: (8786865555917)>``

`ggplot2` — and its imitator, `plotnine` — are, unlike some charting libraries, based on a solid and influential theory for describing and constructing nearly any graphic from data. And that the two packages share similar syntax is a bonus for those learning the basics in both R and Python languages. To learn more about this implementation of the grammar of graphics, see .

# 5 Bonus — mixing R and Python

Speaking of bonuses, we can even mix objects and functions between R and Python! It’s almost trivial using the RStudio IDE7, which uses the package `reticulate`8 for this functionality. In an R environment, we can access an object in the Python environment through the syntax `py\$`. And, conversely, in a Python environment, we can access an R object through the syntax `r\$`. I’ll leave the details for another day.

I hope this tutorial helps us focus on the basic steps of importing, tidying, transforming, and visualizing data in a way that demonstrates similar approaches using either R or Python. While there are definitely use cases where we would prefer one language over the other, in many cases, it’s just personal preference, or melding with your team. To go more in-depth for data analysis9, many resources exist to guide your learning from the beginning. For R, try the aforementioned or ; and for Python, try .

Happy coding!

Grolemund, Garrett, and Hadley Wickham. 2016. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.
Mailund, Thomas. 2017. Beginning Data Science in r. Apress.
McKinney, Wes. 2017. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. Second. O’Reilly.
Wickham, Hadley, Danielle Navarro, and Thomas Lin. 2021. Ggplot2: Elegant Graphics for Data Analysis. Third. Springer. https://ggplot2-book.org/.

1. For this tutorial, we’ll use a common dataset available in both R and Python, but write it to disk (into a subdirectory of our project called `data`) so that we can demonstrate starting from a common data storage format, the comma separated value (csv) file.↩︎

2. For an introduction to the `tidyverse` package, browse: https://www.tidyverse.org.↩︎

3. `ggplot2` includes a thorough, online reference guide: https://ggplot2.tidyverse.org.↩︎

4. Here’s more information on `pandas`: https://pandas.pydata.org.↩︎

5. Here’s more information on `plotnine`: https://plotnine.readthedocs.io/en/stable/.↩︎

6. More cross-references are available online: e.g., https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_r.html.↩︎

7. For information on RStudio: https://www.rstudio.com/products/rstudio/.↩︎

8. You can read more information on the package `reticulate` — an R interface to Python — online: https://rstudio.github.io/reticulate/.↩︎

9. To learn programming in these languages, other resources provide a better guide.↩︎