R and Python are both useful languages for data science and can be similarly applied in a data analysis. Here, without explaining either language per se, we demonstrate similarities between them for importing, transforming, and visualizing a small, example dataset.
In a typical or common data analysis, we iteratively import data, clean or tidy it, transform those measures into new variables, model and visualize the data, and finally communicate our findings. From an illustration in Grolemund and Wickham (2016), we find a generic workflow in simplified form:
In this paper, we focus on similarities between R and Python for importing, transforming, and visualizing data, shown through a few examples.
The diamonds
dataset we’ll be working with1 has 53,940 observations of 10 measures. The first few rows of observed measures follow:
carat | cut | color | clarity | depth | table | price | x | y | z |
---|---|---|---|---|---|---|---|---|---|
0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
0.23 | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 |
0.29 | Premium | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 |
0.31 | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 |
0.24 | Very Good | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 |
where the variables are defined as:
carat
, weight of the diamondcut
, quality of the cutcolor
, diamond colorclarity
, diamond claritydepth
, total depth percentagetable
, width of diamond top relative to widest pointprice
, US dollarsx
, y
, z
, length, width, and depth in mmToday, most data analysts and scientists do not just use base R or base Python in their work. Instead, they commonly use various software packages that work in their chosen language. In R, several broadly useful packages have been collected into an overall package called tidyverse
2. Loading this package, in turn, loads the individual packages readr
, dplyr
, purrr
, tibble
, tidyr
, forcats
, and ggplot2
3. readr
provides functions for importing data. dplyr
, purrr
, and tidyr
provide numerous functions for transforming, summarizing, and merging data (among other things). And ggplot2
provides functions for drawing graphics from data, theoretically grounded in the grammar of graphics (hence, the gg
in ggplot2
), an import concept we’ll use later.
You may load whatever you need from individual packages as you do not usually need them all in a given analysis. The R package ecosystem, of course, includes many additional packages that can be very useful, depending on your intended analysis.
As in R, in Python there are several packages widely used in data analyses. These include numpy
, pandas
4, matplotlib
, seaborn
, plotnine
5 and many more. numpy
(Numerical Python) provides functions to work with n-dimensional arrays, element-wise operations between them, reading and writing them, linear algebra operations, and integrating other languages like C++ or Fortran. pandas
provides functions for creating and manipulating data structures including the important DataFrame
, a two-dimensional tabular, column structure with rows and columns that you might think about as similar to an Excel or Google worksheet. data.frame
s are native to R. plotnine
is intended to imitate R’s ggplot2
. matplotlib
may currently be more popular in Python (it’s older), but unlike plotnine
, it lacks a theoretical basis in the grammar of graphics and has a higher learning curve.
For this tutorial, we’ll just load tidyverse
in R, like so:
And in a Python environment, we can similarly load its packages, relevant for our tutorial:
import pandas as pd
from plotnine import *
where import
is a command and the as __
makes those software packages available through our chosen name or acronym (e.g., pd
). Of note, the from
___ import
___ makes the chosen functions and such (or everything, if *
) directly available.
Both R and Python have many ways to import data for our analyses. As csv
files are one common storage of data, here is a relevant function in R and Python to import:
In R, one way we can load the csv
data is like this:
df_r <- read_csv("data/diamonds.csv")
Or in Python,
= pd.read_csv("data/diamonds.csv") df_py
For either R or Python, this assumes, of course, that we have a csv
file called diamonds.csv
in our data
subdirectory of your project directory. Notice in this case the functions (names and parameters) seem almost identical. In R, the function read_csv
is available in the global environment. Had we not loaded the function into the global environment as above, we would still access it through the package using double colons ::
, like so: readr::read_csv
.
And in Python the same function name (read_csv
) is available within pd
. In either case, we end up with an R data.frame
or Pandas DataFrame
, respectively. Both have a rectangular structure where rows are observations and columns are variables, as shown above, in table 1.
Next, we demonstrate a few ways to modify this dataframe (df_r
and df_py
, respectively), showing the similarities between R and Python functions.
Data structures in both R and Python have equivalents or near equivalents. A few of these include:
R |
Python |
Examples |
---|---|---|
Single-element vector | Scalar | 1 , 1L , TRUE , "foo" |
Multi-element vector | List | c(1.0, 2.0, 3.0) , c(1L, 2L, 3L) |
List of multiple types | Tuple | list(1L, TRUE, "foo") |
Named list | Dict | list(a = 1L, b = 2.0) , dict(x = x_data) |
Matrix/Array | NumPy ndarray | matrix(c(1,2,3,4), nrow = 2, ncol = 2) |
Data Frame | Pandas DataFrame | data.frame(x = c(1,2,3), y = c("a", "b", "c")) |
Function | Python function | function(x) x + 1 |
NULL, TRUE, FALSE | None, True, False | NULL , TRUE , FALSE |
We can perform transformations of these data structures using software functions. Some powerful functions are similar between both R and Python. Here’s a short cross-reference between the languages for a few of those that operate on data frames6:
R / dplyr |
Python / pandas |
comment |
---|---|---|
mutate |
assign |
creates or modifies a variable |
select |
filter |
specifies variables to keep |
rename |
rename |
renames variables |
filter |
query |
specifies observations to keep |
arrange |
sort_values |
specifies ordering of the observations |
group_by |
groupby |
specifies grouping of observations |
summarise |
agg |
specifies some summary of the data |
pivot_longer |
melt |
convert from wide to long format |
pivot_wider |
pivot |
convert from long to wide format |
Notice the names of these functions are verbs, which describe what’s being done to something.
Before jumping into our comparisons, let’s talk briefly about the anatomy of a function. Functions let us group a series of (R or Python) program statements together to perform a specific task. We call a function by invoking or typing in its name, and supplying it with the necessary information to perform the task (by setting its parameters). The function, then, performs the task and, in most cases returns something. Visually, these components typically look like:
returned_object = function_name(parameter1, parameter2)
where the actual number of parameters may range from none to many. For the data analysis functions, in one way or another, we give the functions that accept a data frame as a parameter (along with, perhaps other parameters) our data with that structure; the function modifies the data frame in some way, and returns the modified data frame so that we can store it for some use. Above, the modified data frame would be assigned to returned_object
.
Now we can of course apply just one function to our data object. Let’s call our fictitious function f(parameter)
where both parameter and the return value are data frames. So, we might do something like this:
final_object = f(data)
And if we need to apply several functions, say, f(parameter)
, g(parameter)
, and h(parameter)
in that order, again where parameter and return values are data frames, we might nest them so that the return information of one function is supplied to a parameter of the next function, something like this:
final_object = h( g( f(data) ) )
Notice that we have to read this code inside-out for order of application! Alternatively, we might store each function’s return information, something like this:
temp1 = f(data)
temp2 = g(temp1)
final_object = h(temp2)
Now, at least, we can read the operations in order of our intended application, but we end up saving intermediate variables (e.g., temp1
, temp2
) that we may not care about. There’s generally a better way in both R and Python. In R, we’ll use something called a pipe operator: %>%
. And in Python, we can use something called method chaining.
Using the pipe operator %>%
in R, we can chain together operations in the natural order of application without saving intermediate objects. This pipe operator performs a very simple task. It takes the thing to its left, and supplies that thing as the first argument of the function to the right. Now, our generic example may look like this:
final_object <-
initial_object %>%
f() %>%
g() %>%
h()
In Python, we can do something very similar. Instead of using a pipe operator between functions, we create an object, like a Pandas DataFrame
. That object has member functions that can operate on the object. We can chain them together using the accessor dot .
operator. With our above functions as member functions in Python, it may look like this:
= (
final_object
initial_object
.f()
.g()
.h() )
Now that we have some understanding in piping together or chaining together functions that transform our data in some way, let’s see them in action on our example diamonds
data.
For our first example, let’s pipe several operations together in R. For this example, we select a couple of variables from our data frame, filter or keep only the observations where the diamond color was measured as grade E
, and display the first three observations of the final object:
# A tibble: 3 × 2
carat color
<dbl> <chr>
1 0.23 E
2 0.21 E
3 0.23 E
And do the same thing in Python using objects and method chaining:
(
df_pyfilter(['carat', 'color'])
.'color == "E"')
.query(3)
.head( )
carat color
0 0.23 E
1 0.21 E
2 0.23 E
Nice — we have a one-to-one correspondence between the languages!
Next, we try a slightly more complex example. Here, we first select only variables from our diamonds
dataset that begin with the letter c
, filter or keep observations where the diamond cut
is graded as Ideal
or Premium
, group the observations by cut
, within each group, subgroup by color
, and within those subgroups, subgroup again by clarity
. Once grouped, we summarise those subgroups, calculating the average (mean) diamond carat
, and arrange those diamond groups in descending order of average carat size. Finally, we keep the first few (head of the) observations of our modified data frame.
In R, we can perform these operations, piped one after the other like so:
df_r %>%
select(starts_with('c')) %>%
filter(cut %in% c('Ideal', 'Premium')) %>%
group_by(cut, color, clarity) %>%
summarise(avgcarat = mean(carat, na.rm = TRUE),
n = n()) %>%
arrange(desc(avgcarat)) %>%
head()
# A tibble: 6 × 5
# Groups: cut, color [4]
cut color clarity avgcarat n
<chr> <chr> <chr> <dbl> <int>
1 Ideal J I1 1.99 2
2 Premium I I1 1.61 24
3 Premium J I1 1.58 13
4 Premium J SI2 1.55 161
5 Ideal H I1 1.48 38
6 Premium I SI2 1.42 312
Or equivalently, in Python using method chaining,
(
df_py'cut in ["Ideal", "Premium"]')
.query('cut', 'color', 'clarity'])
.groupby(['mean', 'size'])
.agg([=('carat', 'mean'), ascending = False)
.sort_values(by
.head() )
Unnamed: 0 carat ... y z
mean size mean ... size mean size
cut color clarity ...
Ideal J I1 39076.000000 2 1.990000 ... 2 4.875000 2
Premium I I1 24202.916667 24 1.605833 ... 24 4.482083 24
J I1 28924.307692 13 1.578462 ... 13 4.443846 13
SI2 18268.689441 161 1.554534 ... 161 4.482422 161
Ideal H I1 10319.921053 38 1.475526 ... 38 4.464211 38
[5 rows x 16 columns]
For our third example with tidying and transforming our data, we’ll create a new variable from those we have, categorizing price into “low,” “medium,” or “high.” Then, we select our new variable (pricecat
) and two dimension variables (x
, z
). We rename those dimensions as width
and depth
, respectively. Fourth, we change the structure of the data from “wide” to “long” format, filter or keep only observations that have a dimension less than 10 mm, arrange them by pricecat
and dim
, and we keep the first few (head of the) observations.
In R, here are our piped functions:
df_r %>%
mutate(pricecat = cut(price, breaks = 3, labels = c('low', 'med', 'high'))) %>%
select(pricecat, x, z) %>%
rename(width = x, depth = z) %>%
pivot_longer(cols = -pricecat, names_to = "dim", values_to = "mm") %>%
filter(mm < 10) %>%
arrange(pricecat, dim) %>%
head()
# A tibble: 6 × 3
pricecat dim mm
<fct> <chr> <dbl>
1 low depth 2.43
2 low depth 2.31
3 low depth 2.31
4 low depth 2.63
5 low depth 2.75
6 low depth 2.48
And in Python, here are our chained methods:
(
df_py= pd.cut(df_py['price'], bins = 3, labels = ['low', 'med', 'high']))
.assign(pricecat filter(['pricecat', 'x', 'z'])
.= {'x': 'width', 'z': 'depth'})
.rename(columns = ['pricecat'], value_vars = ['width', 'depth'],
.melt(id_vars = 'dim', value_name = 'mm')
var_name 'mm < 10')
.query('pricecat', 'dim'])
.sort_values([
.head() )
pricecat dim mm
53940 low depth 2.43
53941 low depth 2.31
53942 low depth 2.31
53943 low depth 2.63
53944 low depth 2.75
Again, while the syntax between the languages differs slightly, we have a one-to-one correspondence in functional operations between R and Python!
To demonstrate similarities between R’s ggplot2 and Python’s plotnine
for visualizing data, let’s redo the above third transformation of data and save it into R and Python objects:
In R,
In Python:
= (
df_py
df_py= pd.cut(df_py['price'], bins = 3, labels = ['low', 'med', 'high']))
.assign(pricecat filter(['pricecat', 'x', 'z'])
.= {'x': 'width', 'z': 'depth'})
.rename(columns = ['pricecat'], value_vars = ['width', 'depth'],
.melt(id_vars = 'dim', value_name = 'mm')
var_name 'pricecat', 'dim'])
.sort_values(['mm < 10')
.query( )
To visualize these, in R, we’ll use ggplot2
to code a faceted graphic where each facet or small multiple is a segment of the data based on our new pricecat
variable, and within each facet, we map mm
to the x-axis, map a statistic, density of observed mm
to the y-axis, and map our dim
variable to the fill
color of those densities.
In R (ggplot2
),
ggplot(
data = df_r,
mapping = aes(x = mm, fill = dim)
) +
geom_density(alpha = 0.5) +
facet_wrap(~pricecat) +
ylab('')
Notice, for historical reasons, we use +
instead of %>%
after the function ggplot
to link together each layer or component of the graphic. In Python, its graphing package plotnine
intentionally provides almost identical syntax!:
(
ggplot(= df_py,
data = aes(x = 'mm', fill = 'dim')
mapping +
) = 0.5) +
geom_density(alpha '~pricecat') +
facet_wrap('')
ylab( )
<ggplot: (8786865555917)>
ggplot2
— and its imitator, plotnine
— are, unlike some charting libraries, based on a solid and influential theory for describing and constructing nearly any graphic from data. And that the two packages share similar syntax is a bonus for those learning the basics in both R and Python languages. To learn more about this implementation of the grammar of graphics, see Wickham, Navarro, and Lin (2021).
Speaking of bonuses, we can even mix objects and functions between R and Python! It’s almost trivial using the RStudio IDE7, which uses the package reticulate
8 for this functionality. In an R environment, we can access an object in the Python environment through the syntax py$
. And, conversely, in a Python environment, we can access an R object through the syntax r$
. I’ll leave the details for another day.
I hope this tutorial helps us focus on the basic steps of importing, tidying, transforming, and visualizing data in a way that demonstrates similar approaches using either R or Python. While there are definitely use cases where we would prefer one language over the other, in many cases, it’s just personal preference, or melding with your team. To go more in-depth for data analysis9, many resources exist to guide your learning from the beginning. For R, try the aforementioned Grolemund and Wickham (2016) or Mailund (2017); and for Python, try McKinney (2017).
Happy coding!
For this tutorial, we’ll use a common dataset available in both R and Python, but write it to disk (into a subdirectory of our project called data
) so that we can demonstrate starting from a common data storage format, the comma separated value (csv) file.↩︎
For an introduction to the tidyverse
package, browse: https://www.tidyverse.org.↩︎
ggplot2
includes a thorough, online reference guide: https://ggplot2.tidyverse.org.↩︎
Here’s more information on pandas
: https://pandas.pydata.org.↩︎
Here’s more information on plotnine
: https://plotnine.readthedocs.io/en/stable/.↩︎
More cross-references are available online: e.g., https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_r.html.↩︎
For information on RStudio: https://www.rstudio.com/products/rstudio/.↩︎
You can read more information on the package reticulate
— an R interface to Python — online: https://rstudio.github.io/reticulate/.↩︎
To learn programming in these languages, other resources provide a better guide.↩︎