Space-Time Prediction of Bike Share Demand

Introduction

The goal of the model is to predict the demand for bike share.This algorithm has a significant meaning in real-world cases. With the prediction model, bike share companies can make better decision of allocating new stations or bike resources to areas of high demand on a just-in-time basis.

The model is built based on weather, time and spatial data to predict the demand for bike share of stations in Philly. Since the bike share trips data has strong temporal patterns, temporalfeatures were explored in the model. Compared to the observed data, the model can better predict no matter higher or lower demands, with nearly 0.7-0.9 MAE. Besides, the result shows that the model performs well in the areas and times with less bike trips. What’s more, the model need to be improved the generalizability in predicting regions with different income, preference of public transit, ratio of white residents.

1 Data

Philadelphia is chosen as the target city to analyze its bike share data. Indego’s bike trips data from 1 April, 2019 to 30 June, 2019 is gathered; 2017 census data is also used in the model, which includes information of median income, white population and travel preferences; besides, weather data from philadelphia international airport (1 April, 2019 to 31 May, 2019) is also gathered (which contains temperature, wind speed, precipitation on an hourly basis)

1.1 Rideship data

dat <- read.csv("indego-trips-2019-q2.csv")

dat2 <- dat %>%
  mutate(interval60 = floor_date(ymd_hms(start_time), unit = "hour"),
         interval15 = floor_date(ymd_hms(start_time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))

rideinfo <- read.csv("bikeshare.csv")
kable(rideinfo,
      caption = "Rideshare data") %>%
  kable_styling("striped", full_width = F)

Rideshare data
Variable_Name	Description
trip_id	Locally unique integer that identifies the trip
start_time	The date/time when the trip began, presented in ISO 8601 format in local time
end_time	The date/time when the trip began, presented in ISO 8601 format in local time
start_station	The station ID where the trip originated
start_lat	The latitude of the station where the trip originated
start_lon	The longitude of the station where the trip originated
end_station	The station ID where the trip terminated
end_lat	The latitude of the station where the trip terminated
end_lon	The longitude of the station where the trip terminated
bike_id	Locally unique integer that identifies the bike
plan_duration	The number of days that the plan the passholder is using entitles them to ride; 0 is used for a single ride plan (Walk-up)
trip_route_category	“Round Trip” for trips starting and ending at the same station or “One Way” for all other trips
passholder_type	The name of the passholder’s plan
bike_type	The kind of bike used on the trip, including standard pedal-powered bikes or electric assist bikes

1.2 Weather data

The weather data by time is plotted below:

weather.Panel <- 
  riem_measures(station = "PHL", date_start = "2019-04-01", date_end = "2019-07-01") %>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    dplyr::summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))


grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - Philadelphia PHL - April~June, 2019")

1.3 Census data

The plot below shows the study area tracts of Philadelphia

phillyCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2017, 
          state = "PA", 
          geometry = TRUE, 
          county=c("Philadelphia"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  dplyr::select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

phillyTracts <- 
  phillyCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  dplyr::select(GEOID, geometry) %>% 
  st_sf

dat_census <- st_join(dat2 %>% 
          filter(is.na(start_lon) == FALSE &
                   is.na(start_lat) == FALSE &
                   is.na(end_lat) == FALSE &
                   is.na(end_lon) == FALSE) %>%
          st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
        phillyTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(from_longitude = unlist(map(geometry, 1)),
         from_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  dplyr::select(-geometry)%>%
  st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
  st_join(., phillyTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(to_longitude = unlist(map(geometry, 1)),
         to_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  dplyr::select(-geometry)

phillyCensus%>%
ggplot(.)+
  geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  labs(title="Study Area Tracts")+
  mapTheme

2 Exploring the data

2.1 Spatital pattern

The maps below show the distribution of bike demands in Philly. Although the demands change during a day and week, the highest demands concentrate in central city.

ggplot()+
  geom_sf(data = phillyTracts %>%
          st_transform(crs=4326))+
  geom_point(data = dat_census %>% 
            mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(start_station, from_latitude, from_longitude, weekend, time_of_day) %>%
              tally(),
            aes(x=from_longitude, y = from_latitude, color = n), 
            fill = "transparent", alpha = 0.5, size = 1)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station. Philly, April~June, 2019")+
  mapTheme

2.2 temporal pattern

The following plot shows the overall time pattern of bike share trips.There is clearly a daily periodicity and there are lull periods on weekends. Also, the peaks of bike share trips in June are higher than those in April and May, whcih indicates tempreture may influence the number of bike share trips.

ggplot(dat_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike share trips per hr. Philly, April~June, 2019",
       x="Date", 
       y="Number of trips")+
  plotTheme

The following plots show the distribution of trip volume by station for different times of the day. From the temporal pattern, we can find that there are more high volume stations in PM rush hours. Thus, temporal pattern may influence the number of bike share.

dat_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station, time_of_day) %>%
         tally()%>%
  group_by(start_station, time_of_day)%>%
  dplyr::summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean Number of Hourly Trips Per Station. Philly, April~June, 2019",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme

The plot below shows the difference of bike share trips in weekdays and weekends. It shows that the peaks of demand in weekdays are in rush hours, while the peak of demand in weekends is at mid-day.

ggplot(dat_census %>% 
         mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in Philly - weekend vs weekday, April~June, 2019",
       x="Hour", 
       y="Trip Counts")+
     plotTheme

dat_census %>% 
         mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"))
dat_census$weekend <- ifelse(dat_census$dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")

Now a plot is drawn to determine the number of bike share trips in weeks by day of the weeks. It indicates that the demands in weekdays are slightly different—Wednesday has highest demands. The demands in weekends are also different— Saturday has apparently higher demands.

ggplot(dat_census %>% mutate(hour = hour(start_time)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in Philly, by day of the week, April~June, 2019",
       x="Hour", 
       y="Trip Counts")+
     plotTheme

2.3 Relationship of dependent variables

The following code evaluates the correlations in the time lag variables. The result shows that there are strong correlations between lag variables (especially lagHour) and the number of bike trips.

study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), 
              start_station = unique(dat_census$start_station)) %>%
  left_join(., dat_census %>%
              select(start_station, Origin.Tract, from_longitude, from_latitude )%>%
              distinct() %>%
              group_by(start_station) %>%
              slice(1))

ride.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station, Origin.Tract, from_longitude, from_latitude) %>%
  dplyr::summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(start_station) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel <- 
  left_join(ride.panel, phillyCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

ride.panel <- 
  ride.panel %>% 
  arrange(start_station, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         lag7day = dplyr::lag(Trip_Count,24*7),
         holiday = ifelse(yday(interval60) == 147,1,0)) %>%
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = replace_na(holidayLag, 0))

as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
  summarise_at(c(vars(starts_with("lag"), "Trip_Count")), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day","lag7day")))%>%
    group_by(Variable) %>%  
    dplyr::summarize(correlation = round(cor(Value, Trip_Count),2))

## # A tibble: 7 x 2
##   Variable   correlation
##   <fct>            <dbl>
## 1 lagHour           0.84
## 2 lag2Hours         0.62
## 3 lag3Hours         0.42
## 4 lag4Hours         0.25
## 5 lag12Hours       -0.43
## 6 lag1day           0.78
## 7 lag7day           0.64

The plot below shows that demand increases when temperature increases.

ggplot(data= ride.panel)+
  geom_smooth(aes(Temperature,Trip_Count), method = 'lm',color = "green")+
  labs(title="Relationship of Temperature and Bike share trips , April~June, 2019")

The plot below shows that demand increases during windy days

ggplot(data= ride.panel)+
  geom_smooth(aes(Wind_Speed,Trip_Count), method = 'lm',color = "blue")+
  labs(title="Relationship of Wind Speed and Bike share trips , April~June, 2019")

The plot below shows that demand decreases during rainy days

ggplot(data= ride.panel)+
  geom_smooth(aes(Precipitation,Trip_Count), method = 'lm',color = "red")+
  labs(title="Relationship of Precipitation and Bike share trips , April~June, 2019")

2.4 Space/time animation

The following code builds an animated gif map of rideshare trips over space and time:

week14 <-
  dat2 %>%
  filter(week == 14 & dotw == "Mon")

week14.panel <-
  expand.grid(
    interval15=unique(week14$interval15),
    start_station = unique(week14$start_station))

ride.animation.data <-
  week14 %>%
    mutate(Trip_Counter = 1) %>%
    right_join(week14.panel) %>% 
  na.omit(ride.animation.data)%>%
    group_by(interval15, start_station,start_lat,start_lon) %>%
    summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
   st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326)


ride.animation.data <- st_join(phillyTracts %>%
          st_transform(crs=4326),
          ride.animation.data,
        join=st_intersects,
              left = TRUE) 
  
  ride.animation.data$Trip_Count = ifelse(is.na(ride.animation.data$Trip_Count)==TRUE, 
                                     0, ride.animation.data$Trip_Count)

ride.animation.data<-
  ride.animation.data %>%
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                             Trip_Count == 1 ~ "1 trips",
                             Trip_Count == 2 ~ "2 trips",
                             Trip_Count == 3 ~ "3 trips",
                             Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips")) %>%
           mutate(Trips  = factor(Trips, levels=c("0 trips","1 trips","2 trips",
                                                  "3 trips","4-6 trips")))

rideshare_animation <-
  ride.animation.data %>%
  ggplot(.)+
  geom_sf(data=ride.animation.data, aes(fill=Trips)) +
  scale_fill_manual(values = palette5) +
  labs(title = "Rideshare pickups for one day in April 2019, Philly",
       subtitle = "15 minute intervals: {current_frame}", caption = "Zirui Chen") +
  transition_manual(interval15) +
  mapTheme

animate(rideshare_animation, duration=20)

The resulting animated map provides further evidence of the space/time dependencies in the data.

3 Modeling

Three models were created.

(1)The first model “ATime_FE” only contains station data, hour data and weather feature.

(2)The second model “BTime_Space_FE_timeLags” contains station data, hour data, weather data and time lag data.

(3)The third model “CTime_Space_FE_timeLags_holidayLags” contains station data, hour data, weather data, time lag data, holiday data and new features. In this model, “lag7day” is created because the demand of bike have a weekly pattern; also feature “join_hour_center” is created, which indicates whether the trip was originated in the high demand tracts and the hour it started.

ride.panel$start_station <- as.factor(ride.panel$start_station)
ride.panel$start_stationf <- as.factor(ride.panel$start_station)
ride.panel$center <- as.factor(ifelse(ride.panel$Origin.Tract== 42101012500|
                              ride.panel$Origin.Tract==42101000402,1,0))
ride.panel$join_hour_center <- paste(as.factor(hour(ride.panel$interval60)),
                                     ride.panel$center, sep = "-")

ride.Train <- filter(ride.panel, week < 23)
ride.Test <- filter(ride.panel, week >= 23)

reg1 <- 
  lm(Trip_Count ~  start_stationf + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_stationf +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_stationf + hour(interval60) + dotw + Temperature + Precipitation +
       lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + 
       Wind_Speed + holidayLag + holiday+ lag7day+join_hour_center, 
     data=ride.Train)

4 Time-series predictions versus observed and MAE

ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_Space_weather = map(.x = data, fit = reg1, .f = model_pred),
           BTime_Space_FE_timeLags = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg3, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

The plot below shows that the third model, which has new features, has lower MAE than the first two model. In general, when the week is closer to the week in training data (week 13-22), it will has a lower MAE.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme

The following plot shows the the predicted and observed bike share of the three models. The first model did a bad job in predicting the highest and lowest demands. The second model did better in predicting the lowest demand but still couldn’t predit the highest demands well; the last model better predicted the lowest and highest demands, but some of the highest demands were still underpredicted.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station)) %>%
    dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station) %>%
    group_by(Regression, Variable, interval60) %>%
    dplyr::summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bike share time series", subtitle = "Philly; A test set of 4 weeks",  x = "Hour", y= "Station Trips") +
      plotTheme

5 Critique and endorse

5.1 Temporal and spatial dimensions of the error

The errors are shown on a map by weekend/weekday and time of day. The map indicates that the distribution of the big errors have a spatial pattern: in weekdays, highest errors concentrate in the downdown area and apprently the errors are higher during the rush hours; however, in weekends, highest errors still concentrate in the downdown area and the errors are higher during mid day and PM rush.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start_station, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "CTime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, from_longitude, from_latitude) %>%
  dplyr::summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", size = 1, alpha = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")+
  mapTheme

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude)) %>%
    dplyr::select(interval60, start_station, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "CTime_Space_FE_timeLags_holidayLags") %>%
  group_by(start_station, from_longitude, from_latitude) %>%
  dplyr::summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", alpha = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  labs(title="Mean Abs Error, Test Set, Model 3")+
  mapTheme

5.2 Demographic dimensions of the error

The plots below show the errors as a function of median income, percentage of people taking public transit and percentage of white population. The first plot implies that the model does poorly in predicting the bike share trips in the tracts with higher median income. The second plot implies that the model does well in predicting the bike share trips in the tracts with higher percentage of people taking public transit. The last plot implies that the model does poorly in predicting the bike share trips in the tracts with higher percentage of white population.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, start_station, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest() %>%
  filter(Regression == "CTime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  dplyr::summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = chicagoCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme

In general, it is noteworthy that when the model is applied to the areas and times that more trips obviously occur, more errors will be generated. The model do better in the non-rush hours and the periphery of the downtown. Besides, the model may lack generalizability in regions with different income, preference of public transit, ratio of white residents.