The goal of the model is to predict the demand for bike share.This algorithm has a significant meaning in real-world cases. With the prediction model, bike share companies can make better decision of allocating new stations or bike resources to areas of high demand on a just-in-time basis.
The model is built based on weather, time and spatial data to predict the demand for bike share of stations in Philly. Since the bike share trips data has strong temporal patterns, temporalfeatures were explored in the model. Compared to the observed data, the model can better predict no matter higher or lower demands, with nearly 0.7-0.9 MAE. Besides, the result shows that the model performs well in the areas and times with less bike trips. What’s more, the model need to be improved the generalizability in predicting regions with different income, preference of public transit, ratio of white residents.
Philadelphia is chosen as the target city to analyze its bike share data. Indego’s bike trips data from 1 April, 2019 to 30 June, 2019 is gathered; 2017 census data is also used in the model, which includes information of median income, white population and travel preferences; besides, weather data from philadelphia international airport (1 April, 2019 to 31 May, 2019) is also gathered (which contains temperature, wind speed, precipitation on an hourly basis)
dat <- read.csv("indego-trips-2019-q2.csv")
dat2 <- dat %>%
mutate(interval60 = floor_date(ymd_hms(start_time), unit = "hour"),
interval15 = floor_date(ymd_hms(start_time), unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE))
rideinfo <- read.csv("bikeshare.csv")
kable(rideinfo,
caption = "Rideshare data") %>%
kable_styling("striped", full_width = F)
Variable_Name | Description |
---|---|
trip_id | Locally unique integer that identifies the trip |
start_time | The date/time when the trip began, presented in ISO 8601 format in local time |
end_time | The date/time when the trip began, presented in ISO 8601 format in local time |
start_station | The station ID where the trip originated |
start_lat | The latitude of the station where the trip originated |
start_lon | The longitude of the station where the trip originated |
end_station | The station ID where the trip terminated |
end_lat | The latitude of the station where the trip terminated |
end_lon | The longitude of the station where the trip terminated |
bike_id | Locally unique integer that identifies the bike |
plan_duration | The number of days that the plan the passholder is using entitles them to ride; 0 is used for a single ride plan (Walk-up) |
trip_route_category | “Round Trip” for trips starting and ending at the same station or “One Way” for all other trips |
passholder_type | The name of the passholder’s plan |
bike_type | The kind of bike used on the trip, including standard pedal-powered bikes or electric assist bikes |
The weather data by time is plotted below:
weather.Panel <-
riem_measures(station = "PHL", date_start = "2019-04-01", date_end = "2019-07-01") %>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
dplyr::summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() +
labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() +
labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() +
labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
top="Weather Data - Philadelphia PHL - April~June, 2019")
The plot below shows the study area tracts of Philadelphia
phillyCensus <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2017,
state = "PA",
geometry = TRUE,
county=c("Philadelphia"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
dplyr::select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
phillyTracts <-
phillyCensus %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
dplyr::select(GEOID, geometry) %>%
st_sf
dat_census <- st_join(dat2 %>%
filter(is.na(start_lon) == FALSE &
is.na(start_lat) == FALSE &
is.na(end_lat) == FALSE &
is.na(end_lon) == FALSE) %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
phillyTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(from_longitude = unlist(map(geometry, 1)),
from_latitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
dplyr::select(-geometry)%>%
st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
st_join(., phillyTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Destination.Tract = GEOID) %>%
mutate(to_longitude = unlist(map(geometry, 1)),
to_latitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
dplyr::select(-geometry)
phillyCensus%>%
ggplot(.)+
geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
labs(title="Study Area Tracts")+
mapTheme
The maps below show the distribution of bike demands in Philly. Although the demands change during a day and week, the highest demands concentrate in central city.
ggplot()+
geom_sf(data = phillyTracts %>%
st_transform(crs=4326))+
geom_point(data = dat_census %>%
mutate(hour = hour(start_time),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(start_station, from_latitude, from_longitude, weekend, time_of_day) %>%
tally(),
aes(x=from_longitude, y = from_latitude, color = n),
fill = "transparent", alpha = 0.5, size = 1)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
facet_grid(weekend ~ time_of_day)+
labs(title="Bike share trips per hr by station. Philly, April~June, 2019")+
mapTheme
The following plot shows the overall time pattern of bike share trips.There is clearly a daily periodicity and there are lull periods on weekends. Also, the peaks of bike share trips in June are higher than those in April and May, whcih indicates tempreture may influence the number of bike share trips.
ggplot(dat_census %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n))+
labs(title="Bike share trips per hr. Philly, April~June, 2019",
x="Date",
y="Number of trips")+
plotTheme
The following plots show the distribution of trip volume by station for different times of the day. From the temporal pattern, we can find that there are more high volume stations in PM rush hours. Thus, temporal pattern may influence the number of bike share.
dat_census %>%
mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(interval60, start_station, time_of_day) %>%
tally()%>%
group_by(start_station, time_of_day)%>%
dplyr::summarize(mean_trips = mean(n))%>%
ggplot()+
geom_histogram(aes(mean_trips), binwidth = 1)+
labs(title="Mean Number of Hourly Trips Per Station. Philly, April~June, 2019",
x="Number of trips",
y="Frequency")+
facet_wrap(~time_of_day)+
plotTheme
The plot below shows the difference of bike share trips in weekdays and weekends. It shows that the peaks of demand in weekdays are in rush hours, while the peak of demand in weekends is at mid-day.
ggplot(dat_census %>%
mutate(hour = hour(start_time),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
labs(title="Bike share trips in Philly - weekend vs weekday, April~June, 2019",
x="Hour",
y="Trip Counts")+
plotTheme
dat_census %>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"))
dat_census$weekend <- ifelse(dat_census$dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")
Now a plot is drawn to determine the number of bike share trips in weeks by day of the weeks. It indicates that the demands in weekdays are slightly different—Wednesday has highest demands. The demands in weekends are also different— Saturday has apparently higher demands.
ggplot(dat_census %>% mutate(hour = hour(start_time)))+
geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
labs(title="Bike share trips in Philly, by day of the week, April~June, 2019",
x="Hour",
y="Trip Counts")+
plotTheme
The following code evaluates the correlations in the time lag variables. The result shows that there are strong correlations between lag variables (especially lagHour) and the number of bike trips.
study.panel <-
expand.grid(interval60=unique(dat_census$interval60),
start_station = unique(dat_census$start_station)) %>%
left_join(., dat_census %>%
select(start_station, Origin.Tract, from_longitude, from_latitude )%>%
distinct() %>%
group_by(start_station) %>%
slice(1))
ride.panel <-
dat_census %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel) %>%
group_by(interval60, start_station, Origin.Tract, from_longitude, from_latitude) %>%
dplyr::summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(weather.Panel) %>%
ungroup() %>%
filter(is.na(start_station) == FALSE) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE)) %>%
filter(is.na(Origin.Tract) == FALSE)
ride.panel <-
left_join(ride.panel, phillyCensus %>%
as.data.frame() %>%
select(-geometry), by = c("Origin.Tract" = "GEOID"))
ride.panel <-
ride.panel %>%
arrange(start_station, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24),
lag7day = dplyr::lag(Trip_Count,24*7),
holiday = ifelse(yday(interval60) == 147,1,0)) %>%
mutate(day = yday(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
holidayLag = replace_na(holidayLag, 0))
as.data.frame(ride.panel) %>%
group_by(interval60) %>%
summarise_at(c(vars(starts_with("lag"), "Trip_Count")), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day","lag7day")))%>%
group_by(Variable) %>%
dplyr::summarize(correlation = round(cor(Value, Trip_Count),2))
## # A tibble: 7 x 2
## Variable correlation
## <fct> <dbl>
## 1 lagHour 0.84
## 2 lag2Hours 0.62
## 3 lag3Hours 0.42
## 4 lag4Hours 0.25
## 5 lag12Hours -0.43
## 6 lag1day 0.78
## 7 lag7day 0.64
The plot below shows that demand increases when temperature increases.
ggplot(data= ride.panel)+
geom_smooth(aes(Temperature,Trip_Count), method = 'lm',color = "green")+
labs(title="Relationship of Temperature and Bike share trips , April~June, 2019")
The plot below shows that demand increases during windy days
ggplot(data= ride.panel)+
geom_smooth(aes(Wind_Speed,Trip_Count), method = 'lm',color = "blue")+
labs(title="Relationship of Wind Speed and Bike share trips , April~June, 2019")
The plot below shows that demand decreases during rainy days
ggplot(data= ride.panel)+
geom_smooth(aes(Precipitation,Trip_Count), method = 'lm',color = "red")+
labs(title="Relationship of Precipitation and Bike share trips , April~June, 2019")
The following code builds an animated gif map of rideshare trips over space and time:
week14 <-
dat2 %>%
filter(week == 14 & dotw == "Mon")
week14.panel <-
expand.grid(
interval15=unique(week14$interval15),
start_station = unique(week14$start_station))
ride.animation.data <-
week14 %>%
mutate(Trip_Counter = 1) %>%
right_join(week14.panel) %>%
na.omit(ride.animation.data)%>%
group_by(interval15, start_station,start_lat,start_lon) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326)
ride.animation.data <- st_join(phillyTracts %>%
st_transform(crs=4326),
ride.animation.data,
join=st_intersects,
left = TRUE)
ride.animation.data$Trip_Count = ifelse(is.na(ride.animation.data$Trip_Count)==TRUE,
0, ride.animation.data$Trip_Count)
ride.animation.data<-
ride.animation.data %>%
mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
Trip_Count == 1 ~ "1 trips",
Trip_Count == 2 ~ "2 trips",
Trip_Count == 3 ~ "3 trips",
Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips")) %>%
mutate(Trips = factor(Trips, levels=c("0 trips","1 trips","2 trips",
"3 trips","4-6 trips")))
rideshare_animation <-
ride.animation.data %>%
ggplot(.)+
geom_sf(data=ride.animation.data, aes(fill=Trips)) +
scale_fill_manual(values = palette5) +
labs(title = "Rideshare pickups for one day in April 2019, Philly",
subtitle = "15 minute intervals: {current_frame}", caption = "Zirui Chen") +
transition_manual(interval15) +
mapTheme
animate(rideshare_animation, duration=20)
The resulting animated map provides further evidence of the space/time dependencies in the data.
Three models were created.
(1)The first model “ATime_FE” only contains station data, hour data and weather feature.
(2)The second model “BTime_Space_FE_timeLags” contains station data, hour data, weather data and time lag data.
(3)The third model “CTime_Space_FE_timeLags_holidayLags” contains station data, hour data, weather data, time lag data, holiday data and new features. In this model, “lag7day” is created because the demand of bike have a weekly pattern; also feature “join_hour_center” is created, which indicates whether the trip was originated in the high demand tracts and the hour it started.
ride.panel$start_station <- as.factor(ride.panel$start_station)
ride.panel$start_stationf <- as.factor(ride.panel$start_station)
ride.panel$center <- as.factor(ifelse(ride.panel$Origin.Tract== 42101012500|
ride.panel$Origin.Tract==42101000402,1,0))
ride.panel$join_hour_center <- paste(as.factor(hour(ride.panel$interval60)),
ride.panel$center, sep = "-")
ride.Train <- filter(ride.panel, week < 23)
ride.Test <- filter(ride.panel, week >= 23)
reg1 <-
lm(Trip_Count ~ start_stationf + hour(interval60) + dotw + Temperature + Precipitation,
data=ride.Train)
reg2 <-
lm(Trip_Count ~ start_stationf + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day,
data=ride.Train)
reg3 <-
lm(Trip_Count ~ start_stationf + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day +
Wind_Speed + holidayLag + holiday+ lag7day+join_hour_center,
data=ride.Train)
ride.Test.weekNest <-
ride.Test %>%
nest(-week)
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
week_predictions <-
ride.Test.weekNest %>%
mutate(ATime_Space_weather = map(.x = data, fit = reg1, .f = model_pred),
BTime_Space_FE_timeLags = map(.x = data, fit = reg2, .f = model_pred),
CTime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg3, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
The plot below shows that the third model, which has new features, has lower MAE than the first two model. In general, when the week is closer to the week in training data (week 13-22), it will has a lower MAE.
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette5) +
labs(title = "Mean Absolute Errors by model specification and week") +
plotTheme
The following plot shows the the predicted and observed bike share of the three models. The first model did a bad job in predicting the highest and lowest demands. The second model did better in predicting the lowest demand but still couldn’t predit the highest demands well; the last model better predicted the lowest and highest demands, but some of the highest demands were still underpredicted.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station)) %>%
dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -start_station) %>%
group_by(Regression, Variable, interval60) %>%
dplyr::summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed bike share time series", subtitle = "Philly; A test set of 4 weeks", x = "Hour", y= "Station Trips") +
plotTheme
The errors are shown on a map by weekend/weekday and time of day. The map indicates that the distribution of the big errors have a spatial pattern: in weekdays, highest errors concentrate in the downdown area and apprently the errors are higher during the rush hours; however, in weekends, highest errors still concentrate in the downdown area and the errors are higher during mid day and PM rush.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude),
dotw = map(data, pull, dotw) ) %>%
select(interval60, start_station, from_longitude,
from_latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "CTime_Space_FE_timeLags_holidayLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(start_station, weekend, time_of_day, from_longitude, from_latitude) %>%
dplyr::summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = from_longitude, y = from_latitude, color = MAE),
fill = "transparent", size = 1, alpha = 0.4)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
facet_grid(weekend~time_of_day)+
labs(title="Mean Absolute Errors, Test Set")+
mapTheme
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude)) %>%
dplyr::select(interval60, start_station, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
unnest() %>%
filter(Regression == "CTime_Space_FE_timeLags_holidayLags") %>%
group_by(start_station, from_longitude, from_latitude) %>%
dplyr::summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = from_longitude, y = from_latitude, color = MAE),
fill = "transparent", alpha = 0.4)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
labs(title="Mean Abs Error, Test Set, Model 3")+
mapTheme
The plots below show the errors as a function of median income, percentage of people taking public transit and percentage of white population. The first plot implies that the model does poorly in predicting the bike share trips in the tracts with higher median income. The second plot implies that the model does well in predicting the bike share trips in the tracts with higher percentage of people taking public transit. The last plot implies that the model does poorly in predicting the bike share trips in the tracts with higher percentage of white population.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude),
dotw = map(data, pull, dotw),
Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
Med_Inc = map(data, pull, Med_Inc),
Percent_White = map(data, pull, Percent_White)) %>%
select(interval60, start_station, from_longitude,
from_latitude, Observed, Prediction, Regression,
dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
unnest() %>%
filter(Regression == "CTime_Space_FE_timeLags_holidayLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
filter(time_of_day == "AM Rush") %>%
group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
dplyr::summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
gather(-start_station, -MAE, key = "variable", value = "value")%>%
ggplot(.)+
#geom_sf(data = chicagoCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = value, y = MAE), alpha = 0.4)+
geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
facet_wrap(~variable, scales = "free")+
labs(title="Errors as a function of socio-economic variables",
y="Mean Absolute Error (Trips)")+
plotTheme
In general, it is noteworthy that when the model is applied to the areas and times that more trips obviously occur, more errors will be generated. The model do better in the non-rush hours and the periphery of the downtown. Besides, the model may lack generalizability in regions with different income, preference of public transit, ratio of white residents.