Example code used in class discussion.
In R, load libraries,
then import data:
df_r <- read_csv("data/201901-citibike-tripdata.csv")
Let’s look at the beginning of the dataframe:
Rows: 967,287
Columns: 15
$ tripduration <dbl> 320, 316, 591, 2719, 303, 535, 280…
$ starttime <dttm> 2019-01-01 00:01:47, 2019-01-01 0…
$ stoptime <dttm> 2019-01-01 00:07:07, 2019-01-01 0…
$ `start station id` <dbl> 3160, 519, 3171, 504, 229, 3630, 3…
$ `start station name` <chr> "Central Park West & W 76 St", "Pe…
$ `start station latitude` <dbl> 40.77897, 40.75187, 40.78525, 40.7…
$ `start station longitude` <dbl> -73.97375, -73.97771, -73.97667, -…
$ `end station id` <dbl> 3283, 518, 3154, 3709, 503, 3529, …
$ `end station name` <chr> "W 89 St & Columbus Ave", "E 39 St…
$ `end station latitude` <dbl> 40.78822, 40.74780, 40.77314, 40.7…
$ `end station longitude` <dbl> -73.97042, -73.97344, -73.95856, -…
$ bikeid <dbl> 15839, 32723, 27451, 21579, 35379,…
$ usertype <chr> "Subscriber", "Subscriber", "Subsc…
$ `birth year` <dbl> 1971, 1964, 1987, 1990, 1979, 1989…
$ gender <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2…
In Python, load libraries,
import pandas as pd
from plotnine import *
from datar.all import *
[2021-11-30 12:05:18][datar][WARNING] Builtin name "min" has been overriden by datar.
[2021-11-30 12:05:18][datar][WARNING] Builtin name "max" has been overriden by datar.
[2021-11-30 12:05:18][datar][WARNING] Builtin name "sum" has been overriden by datar.
[2021-11-30 12:05:18][datar][WARNING] Builtin name "abs" has been overriden by datar.
[2021-11-30 12:05:18][datar][WARNING] Builtin name "round" has been overriden by datar.
[2021-11-30 12:05:18][datar][WARNING] Builtin name "all" has been overriden by datar.
[2021-11-30 12:05:18][datar][WARNING] Builtin name "any" has been overriden by datar.
[2021-11-30 12:05:18][datar][WARNING] Builtin name "re" has been overriden by datar.
[2021-11-30 12:05:18][datar][WARNING] Builtin name "filter" has been overriden by datar.
[2021-11-30 12:05:18][datar][WARNING] Builtin name "slice" has been overriden by datar.
from pipda import options
= True options.assume_all_piping
then load our data:
= pd.read_csv("data/201901-citibike-tripdata.csv") df_py
As with R, let’s get some information on the data frame:
df_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 967287 entries, 0 to 967286
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tripduration 967287 non-null int64
1 starttime 967287 non-null object
2 stoptime 967287 non-null object
3 start station id 967269 non-null float64
4 start station name 967269 non-null object
5 start station latitude 967287 non-null float64
6 start station longitude 967287 non-null float64
7 end station id 967269 non-null float64
8 end station name 967269 non-null object
9 end station latitude 967287 non-null float64
10 end station longitude 967287 non-null float64
11 bikeid 967287 non-null int64
12 usertype 967287 non-null object
13 birth year 967287 non-null int64
14 gender 967287 non-null int64
dtypes: float64(6), int64(4), object(5)
memory usage: 110.7+ MB
In R, let’s create a new variable that flags whether the bike was rebalanced.
df_r <- df_r %>% rename_all(function(x) gsub(" ", "_", x))
df_r <- df_r %>%
filter(!is.na(start_station_id)) %>%
arrange(starttime) %>%
group_by(bikeid) %>%
mutate(
rebalanced =
if_else(row_number() > 1 &
start_station_id != lag(end_station_id),
TRUE, FALSE)
) %>%
ungroup()
df_r %>% pull(rebalanced) %>% table()
.
FALSE TRUE
937908 29361
In Python,
= df_py.rename(lambda x: x.replace(' ', '_'), axis = 1)
df_py
= df_py >> \
df_py filter( f.start_station_id.notnull() ) >> \
>> \
arrange(f.starttime) >> \
group_by(f.bikeid)
mutate(=
rebalanced > 1) &
if_else((row_number() != lag(f.end_station_id)),
(f.start_station_id True, False)
>> \
)
ungroup()
df_py.rebalanced.value_counts()
False 937908
True 29361
Name: rebalanced, dtype: int64
In R,
In Python,
+ \
ggplot(df_py)
geom_bar(= aes(x = 'rebalanced'),
mapping = 'count'
stat )
<ggplot: (8792502577595)>