Bus Mapping with GeoPandas
Bus Mapping and Basic GIS with GeoPandas¶
GeoPandas wraps several of Python GIS tools into a set of convenient functions for storing and operating on shapefiles as DataFrames, and makes working with shapefiles look similar to working with Pandas. In this demo, I'm going to demonstrate some of the basic geopandas functionality for getting started with working with shapefiles, and compute some summary statistics from the data.
The shapefiles I'm using are from the DC open data portal http://opendata.dc.gov/datasets and contain WMTA ( Washington Metro Transit Authority ) Bus stops, and Bus Routes. I download the .zip files, and unzip them on my local machine.
The main steps that are covered in this demo are:
- Load Packages
- Load Shapefiles
- Examine the Data, and make some basic visualizations
- Combine the Datsets Using Spatial Joins
- Create a Full Crosswalk File
- Summary statistics about stops and routes
- What is the average number of bus stops are on each route in the WMTA system?
- How many bus lines serve each block group in DC?
- Starting to think about other questions that we can answer with additional geospatial operations.
Other topics that will be addressed separately on using geospatial data in python
- Coordinate Systems and Map projections
- Constructing an adjacency matrix from a shapefiles
- Constructing attributes using spatial operations such as distance, driving distance, and other spatial operations.
1. Load Packages¶
# !pip install geopandas
# !pip install rtree
# an rtree dependency
# !brew install Spatialindex ( For the mac )
import pandas as pd
import numpy as np
# geospatial tools
import geopandas
import rtree
# plotting tools
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
2. Loading Shapefiles¶
Shapefiles are a widely used geospatial data format ( although not the only one ).
I've obtained shapefiles from _ containing WMTA ( Washington Metro Transit Authority ) Bus stops, and Bus Routes. I download the .zip files, and unzip them on my local machine.
The "geopandas.read_file" command must point to the whole folder that is obtained from the downloaded .zip file. Some of the file extensions inside of the folder you'll see include ".shp" , ".shx" and ".prj". The read_file command will need to be able to locate each of these in order to complete successfully.
The result of the command is a geospatial datframe. It's generally similar to a normal pandas dataframe, except that the geospatial dataframe expects there to be a column named 'geometry', each row defining a polygon, point or line. Other columns in the dataframe describe the geometry in that row. For example in the Bus Stops dataset, each bus stop also has a cross street, and a flag for whether or not the bus stop is covered.
# Source: DC open data portal http://opendata.dc.gov/datasets files about WMTA bus lines and stops
# http://opendata.dc.gov/datasets/metro-bus-stops
bus_stops = geopandas.read_file('data/Metro_Bus_Stops/')
# http://opendata.dc.gov/datasets/metro-bus-lines
bus_routes = geopandas.read_file('data/Metro_Bus_Lines/')
Without going too deeply into this, for now we want to make sure that all of the shapefiles we are working with are in the same projection
bus_stops.crs
bus_routes.crs
# Source: US Census ACS 2015 5-year estimates
acs = geopandas.read_file('data/ACS_2015_5YR_BG_11_DISTRICT_OF_COLUMBIA.gdb')
acs.crs
Convert this file into the same projection as the others
acs = acs.to_crs({'init': 'epsg:4326'})
acs.crs
3. Examine the Data, and make some basic visualizations¶
Bus stops are represented as points
Geopandas has a convenience .plot() method ( similar to pandas ) which makes it very simple to create a basic visualization of the geometry. The dataframe needs to be a 'geopandas.GeoSeries' or a 'geopandas.GeoDataframe' in order for it to work.
bus_stops.geometry[0:4]
# The Geopandas dataframe is aware of the POINT object within the column
# and knows how to plot it
bus_stops.geometry[0:4].plot()
A normal pandas Dataframe can't do this, since it doesn't know how to deal with a POINT object
pd.DataFrame(bus_stops.geometry[0:4]).geometry
# This won't work, this will throw TypeError
# pd.DataFrame(bus_stops.geometry[0:4]).plot()
Bus routes are represented by lines
bus_routes.geometry.head()
bus_routes.sample(2).plot(color='gray')
Census Tracts are represented as polygons
acs.geometry.head()
acs.plot(color = 'red', alpha=0.3)
The objects overlap with each other spatially
The dataframes loaded don't have a linking key between them, but we can create the crosswalks that link the objects together using spatial joins.
# Create the plotting area
fig, ax = plt.subplots(1, figsize=(7,7))
# Plot bus routes
base = bus_routes.plot(ax=ax, color='gray')
# Plot bus stops
bus_stops.plot(ax=base, marker="o", alpha=0.5)
# Plot bus stops
acs.plot(ax=base, alpha=0.2, color = 'red')
_ = ax.axis('off')
4. Combine datsets through spatial joins¶
- Bus stops located in each Census Blockgroup
- Bus lines that intersect Bus stops
- Bus lines that intersect Census Blockgroup
a. Identify stops located in each Census Blockgroup¶
spatially joined, we are looking for points within the census shape
crosswalk_stops = geopandas.tools.sjoin(
geopandas.GeoDataFrame(bus_stops[['OBJECTID','geometry']]),
geopandas.GeoDataFrame(acs[['GEOID','geometry']]),
# Operation we are using is 'within'
how="inner", op='within')
crosswalk_stops.head()
Now each bus_stop record will be associated with an ACS GEOID ( a block-group id ). Since each bus stop will be contained within one block-group, we can just add it to the bus_stops dataframe as a column.
bus_stops = bus_stops.merge(crosswalk_stops[['OBJECTID','GEOID']], on='OBJECTID', how='left')
sample_bg = acs.GEOID.sample().tolist()[0]
fig, ax = plt.subplots(1, figsize=(10,10))
# Plot bus stops within the block group
base = geopandas.GeoDataFrame(bus_stops[bus_stops.GEOID == sample_bg ]).\
plot(ax=ax ,alpha=0.7)
# Plot the block group polygon
acs[acs.GEOID == sample_bg].plot(ax=base, alpha=0.5, color = 'gray')
#fig.suptitle(acs[acs.GEOID == sample_bg].NAMELSAD.tolist()[0])
x, y = acs[acs.GEOID == sample_bg][['TRACTCE','NAMELSAD']].to_dict('records')[0].values()
fig.suptitle("Tract: {0}\n{1}".format(x,y))
_ = ax.axis('off')
b. Bus Stops intersecting lines¶
Since "Points" don't have any area, we need to convert the bus stops into into polygons so that we can detect when the line overlaps with the point's area.
So first we have to 'buffer' the points so that they have some area. 'buffer' is a method on the point.
Other geometric manipulations that are available can be found here: http://geopandas.org/geometric_manipulations.html
bus_stops['geom2']= bus_stops.geometry.map(lambda x: x.buffer(0.0005))
# For now lets not think too hard about the size of the buffer, since we would need to delve into the coordinate system
geopandas.GeoSeries(bus_stops[0:2].geom2).plot(color ='black')
geopandas.GeoSeries(bus_stops.geom2).plot()
# Plot only the stops that have a GEOID ( ACS block-group in DC )
geopandas.GeoSeries(bus_stops[~bus_stops.GEOID.isnull()].geom2).plot()
Now the join , this time we use intersects to detect where a single bus route line, and bus stop circle cross one another.
For testing, we start with one sample bus route
# Start by looking at one bus route
sample_route = bus_routes.ROUTE.sample(1).tolist()[0]
oneroute = bus_routes[bus_routes.ROUTE == sample_route].sample(1)
oneroute
# Perform the join just using the one bus route
sample_crosswalk = geopandas.tools.sjoin(
geopandas.GeoDataFrame(oneroute[['OBJECTID','geometry']]),
geopandas.GeoDataFrame(bus_stops[['OBJECTID','geom2']].rename(columns={'geom2': 'geometry'})),
how="inner", op='intersects',
lsuffix='_route',
rsuffix = '_stop')
# Join the stops and routes using the crosswalk
stops_on_route = sample_crosswalk[['OBJECTID__route','OBJECTID__stop']].\
merge(oneroute[['OBJECTID']],
left_on = 'OBJECTID__route',
right_on = 'OBJECTID',
how = 'inner')
fig, ax = plt.subplots(1, figsize=(11,8))
# Plot bus stops along the route
base = geopandas.GeoDataFrame(
bus_stops.merge(stops_on_route,
left_on = 'OBJECTID',
right_on = 'OBJECTID__stop',
how = 'inner'),
geometry = 'geom2').\
plot(ax=ax ,alpha=0.7)
# Plot the single bus route
oneroute.plot(ax=base, alpha=0.5, color = 'green')
name = oneroute.LINENAME.tolist()[0]
fig.suptitle("Route Name: {0}".format(name))
_ = ax.axis('off')
c. Bus lines that intersect Census Blockgroups¶
Use intersects to detect where a bus route line, and census shape intersect. Bus lines don't need to be completely in the shape
crosswalk_acs = geopandas.tools.sjoin(
geopandas.GeoDataFrame(oneroute[['OBJECTID','geometry']]),
geopandas.GeoDataFrame(acs[['GEOID','geometry']]),
how="inner", op='intersects')
crosswalk_acs.head()
fig, ax = plt.subplots(1, figsize=(10,10))
# The block groups that intersect that bus line
base = acs.merge(crosswalk_acs[['GEOID']], how = 'inner').geometry.\
plot(ax = ax, color='gray', alpha=0.3)
# The one bus line
oneroute.\
plot(ax = base, color = 'green')
# Bus stops within the ACS block groups that intersect that bus line
geopandas.GeoSeries(
bus_stops.merge(crosswalk_acs[['GEOID']], how = 'inner').geom2).\
plot(ax = base,alpha=0.5)
name = oneroute.LINENAME.tolist()[0]
fig.suptitle("Route Name: {0}".format(name))
_ = ax.axis('off')
d. Subset the lines to fit within the DC boundary¶
First, create a full DC shape from the ACS block-groups
dc_shape = acs.unary_union
dc_shape
Then, a simple intersection of the lines with the shape returns the bus route boundaries within DC
lines_in_dc = bus_routes.intersection(dc_shape)
lines_in_dc.plot()
# The difference shows bus route boundaries that are outside of DC
line_diffs = bus_routes.difference(dc_shape)
line_diffs.plot()
# Percentage of the bus line mileage within DC boundaries.
100*round(lines_in_dc.length.sum()/bus_routes.geometry.length.sum(),2)
5. Create a full crosswalk¶
In the previous section, I focused on finding the points along a single bus route, but the same spatial join will return all of the stops on each bus line. The resulting crosswalk dataframe enables joining the routes and stops. It is a many-to-many crosswalk, since each stop can serve multiple routes, and each route has multiple stops.
crosswalk_route_stop = geopandas.tools.sjoin(
geopandas.GeoDataFrame(bus_routes[['OBJECTID','geometry']]),
geopandas.GeoDataFrame(bus_stops[['OBJECTID','geom2']].rename(columns={'geom2': 'geometry'})),
how="inner", op='intersects',
lsuffix='_route',
rsuffix = '_stop')
crosswalk_route_stop[['OBJECTID__route', 'OBJECTID__stop']].sample(5)
We can also include the ACS block group that the bus stop is located in.
crosswalk_stop_acs = geopandas.tools.sjoin(
geopandas.GeoDataFrame(bus_stops[['OBJECTID','geometry']]),
geopandas.GeoDataFrame(acs[['GEOID','geometry']]),
how="inner", op='intersects')
crosswalk_stop_acs[['OBJECTID', 'GEOID']].head()
crosswalk_route_stop = crosswalk_route_stop.merge(crosswalk_stop_acs,
left_on = 'OBJECTID__stop',
right_on = 'OBJECTID',
# Some stops are in DC , and some are not
how = 'left')[['OBJECTID__route', 'OBJECTID__stop', 'GEOID']]
crosswalk_route_stop.sample(10)
# Write to CSV for use in another step
crosswalk_route_stop.to_csv('data/crosswalk_route_stop.csv', index=False)
The crosswalk is many-to-many for stops and routes
# Each stop has multiple OBJECTID__route values associated with it
crosswalk_route_stop.groupby('OBJECTID__stop').\
OBJECTID__route.\
apply(lambda x: x.unique().tolist()).head()
# Each route has multiple OBJECTID__stop values associated with it
crosswalk_route_stop.groupby('OBJECTID__route').\
OBJECTID__stop.\
apply(lambda x: x.unique().tolist()).head()
6. Summary statistics about stops and routes¶
Now that we have a crosswalk that links our datasets together, we can compute some summary statistics like:
- What is the average number of bus stops are on each route?
- How many lines serve each block group in DC?
- What is the block-group with the most/least bus lines running through it?
What is the average number of bus stops that are along each route?
bus_routes.merge(crosswalk_route_stop, left_on ='OBJECTID', right_on = 'OBJECTID__route').\
groupby('ROUTE').OBJECTID__stop.nunique().\
agg({'Average Stops per Route':np.mean,
'Minimum Stops on a Route': np.min,
'Maximum Stops on a Route': np.max,
'Median Stops': np.median}).apply(round).to_frame()
How many lines serve each block group in DC?
bus_routes.merge(crosswalk_route_stop, left_on ='OBJECTID', right_on = 'OBJECTID__route').\
groupby('GEOID').ROUTE.nunique().\
agg({'Average Routes':np.mean,
'Minimum Routes': np.min,
'Maximum Routes': np.max,
'Median Routes': np.median}).apply(round).to_frame()
fig, ax = plt.subplots(1, figsize=(11,10))
# The block groups that intersect that bus line
base = acs.merge(bus_routes.merge(crosswalk_route_stop, left_on ='OBJECTID', right_on = 'OBJECTID__route').\
groupby('GEOID').ROUTE.nunique().to_frame().reset_index())[['geometry','ROUTE']].\
plot(column='ROUTE', ax=ax,
cmap=plt.cm.BrBG, legend=True, k = 8)
fig.suptitle("Number of Bus routes that cross through the block-group")
_ = ax.axis('off')
7. Answering more complex questions using other spatial operations¶
Now that we can combine the datasets, we can use other spatial operations to answer some more interesting questions in a future post.
- For each block group, what's the farthest away that a rider can get on a single bus line?
- What's the average distance between stops on a bus route?
- What's the correlation between number of routes in the block group, and the population of the block-group?
- What's the correlation between the number of routes in the block group and and number of cars estimated to be in the block group?