Bus Mapping with GeoPandas

Bus Mapping and Basic GIS with GeoPandas

image GeoPandas wraps several of Python GIS tools into a set of convenient functions for storing and operating on shapefiles as DataFrames, and makes working with shapefiles look similar to working with Pandas. In this demo, I'm going to demonstrate some of the basic geopandas functionality for getting started with working with shapefiles, and compute some summary statistics from the data.

The shapefiles I'm using are from the DC open data portal http://opendata.dc.gov/datasets and contain WMTA ( Washington Metro Transit Authority ) Bus stops, and Bus Routes. I download the .zip files, and unzip them on my local machine.

The main steps that are covered in this demo are:

  1. Load Packages
  2. Load Shapefiles
  3. Examine the Data, and make some basic visualizations
  4. Combine the Datsets Using Spatial Joins
  5. Create a Full Crosswalk File
  6. Summary statistics about stops and routes
    • What is the average number of bus stops are on each route in the WMTA system?
    • How many bus lines serve each block group in DC?
  7. Starting to think about other questions that we can answer with additional geospatial operations.

Other topics that will be addressed separately on using geospatial data in python

  • Coordinate Systems and Map projections
  • Constructing an adjacency matrix from a shapefiles
  • Constructing attributes using spatial operations such as distance, driving distance, and other spatial operations.

1. Load Packages

In [1]:
# !pip install geopandas 
# !pip install rtree

# an rtree dependency 
# !brew install Spatialindex ( For the mac )
In [2]:
import pandas as pd 
import numpy as np

# geospatial tools
import geopandas
import rtree

# plotting tools
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

2. Loading Shapefiles

Shapefiles are a widely used geospatial data format ( although not the only one ).

I've obtained shapefiles from _ containing WMTA ( Washington Metro Transit Authority ) Bus stops, and Bus Routes. I download the .zip files, and unzip them on my local machine.

The "geopandas.read_file" command must point to the whole folder that is obtained from the downloaded .zip file. Some of the file extensions inside of the folder you'll see include ".shp" , ".shx" and ".prj". The read_file command will need to be able to locate each of these in order to complete successfully.

The result of the command is a geospatial datframe. It's generally similar to a normal pandas dataframe, except that the geospatial dataframe expects there to be a column named 'geometry', each row defining a polygon, point or line. Other columns in the dataframe describe the geometry in that row. For example in the Bus Stops dataset, each bus stop also has a cross street, and a flag for whether or not the bus stop is covered.

In [3]:
# Source: DC open data portal http://opendata.dc.gov/datasets files about WMTA bus lines and stops 

# http://opendata.dc.gov/datasets/metro-bus-stops
bus_stops = geopandas.read_file('data/Metro_Bus_Stops/')

# http://opendata.dc.gov/datasets/metro-bus-lines
bus_routes = geopandas.read_file('data/Metro_Bus_Lines/')

Without going too deeply into this, for now we want to make sure that all of the shapefiles we are working with are in the same projection

http://geopandas.org/projections.html

In [4]:
bus_stops.crs
Out[4]:
{'init': 'epsg:4326'}
In [5]:
bus_routes.crs
Out[5]:
{'init': 'epsg:4326'}
In [6]:
# Source: US Census ACS 2015 5-year estimates

acs = geopandas.read_file('data/ACS_2015_5YR_BG_11_DISTRICT_OF_COLUMBIA.gdb')
In [7]:
acs.crs
Out[7]:
{'init': 'epsg:4269'}

Convert this file into the same projection as the others

In [8]:
acs = acs.to_crs({'init': 'epsg:4326'})
acs.crs
Out[8]:
{'init': 'epsg:4326'}

3. Examine the Data, and make some basic visualizations

Bus stops are represented as points

Geopandas has a convenience .plot() method ( similar to pandas ) which makes it very simple to create a basic visualization of the geometry. The dataframe needs to be a 'geopandas.GeoSeries' or a 'geopandas.GeoDataframe' in order for it to work.

In [9]:
bus_stops.geometry[0:4]
Out[9]:
0    POINT (-76.97413927339112 38.80478277297256)
1    POINT (-76.99163827882485 38.83910177500602)
2    POINT (-77.09605132403074 38.99227279837678)
3    POINT (-77.02202529441263 38.91432379310785)
Name: geometry, dtype: object
In [10]:
# The Geopandas dataframe is aware of the POINT object within the column
# and knows how to plot it

bus_stops.geometry[0:4].plot()
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x118bdeda0>

A normal pandas Dataframe can't do this, since it doesn't know how to deal with a POINT object

In [11]:
pd.DataFrame(bus_stops.geometry[0:4]).geometry
Out[11]:
0    POINT (-76.97413927339112 38.80478277297256)
1    POINT (-76.99163827882485 38.83910177500602)
2    POINT (-77.09605132403074 38.99227279837678)
3    POINT (-77.02202529441263 38.91432379310785)
Name: geometry, dtype: object
In [12]:
# This won't work, this will throw TypeError
# pd.DataFrame(bus_stops.geometry[0:4]).plot()

Bus routes are represented by lines

In [13]:
bus_routes.geometry.head()
Out[13]:
0    (LINESTRING (-76.89642225040846 38.88515778874...
1    LINESTRING (-77.13612674065163 38.818093313739...
2    LINESTRING (-77.10837641792699 38.856279711876...
3    LINESTRING (-77.10458510913409 38.857844060906...
4    (LINESTRING (-77.04491229360494 38.87324778473...
Name: geometry, dtype: object
In [14]:
bus_routes.sample(2).plot(color='gray')
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x118cf66a0>

Census Tracts are represented as polygons

In [15]:
acs.geometry.head()
Out[15]:
0    (POLYGON ((-77.04629699980666 38.9163110000526...
1    (POLYGON ((-77.05062299965169 38.9149029998736...
2    (POLYGON ((-77.04599200013064 38.9145000001749...
3    (POLYGON ((-77.04166999957079 38.9115340000940...
4    (POLYGON ((-77.02962299980157 38.9072399999261...
Name: geometry, dtype: object
In [16]:
acs.plot(color = 'red', alpha=0.3)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x118ed4ef0>

The objects overlap with each other spatially

The dataframes loaded don't have a linking key between them, but we can create the crosswalks that link the objects together using spatial joins.

In [17]:
# Create the plotting area 
fig, ax = plt.subplots(1, figsize=(7,7))

# Plot bus routes
base = bus_routes.plot(ax=ax, color='gray')

# Plot bus stops
bus_stops.plot(ax=base, marker="o", alpha=0.5)

# Plot bus stops
acs.plot(ax=base, alpha=0.2, color = 'red')
_ = ax.axis('off')

4. Combine datsets through spatial joins

  • Bus stops located in each Census Blockgroup
  • Bus lines that intersect Bus stops
  • Bus lines that intersect Census Blockgroup

a. Identify stops located in each Census Blockgroup

spatially joined, we are looking for points within the census shape

In [18]:
crosswalk_stops = geopandas.tools.sjoin(
    geopandas.GeoDataFrame(bus_stops[['OBJECTID','geometry']]),
    geopandas.GeoDataFrame(acs[['GEOID','geometry']]), 
                                        # Operation we are using is 'within'
                           how="inner", op='within')

crosswalk_stops.head()
Out[18]:
OBJECTID geometry index_right GEOID
1 1002 POINT (-76.99163827882485 38.83910177500602) 63 110010073041
465 1466 POINT (-76.98843827148823 38.84215477239719) 63 110010073041
1226 227 POINT (-76.98338628012409 38.84475477538278) 63 110010073041
2046 3047 POINT (-76.98656827956077 38.83577577621708) 63 110010073041
3154 4155 POINT (-76.98534127691076 38.84389777548228) 63 110010073041

Now each bus_stop record will be associated with an ACS GEOID ( a block-group id ). Since each bus stop will be contained within one block-group, we can just add it to the bus_stops dataframe as a column.

In [19]:
bus_stops = bus_stops.merge(crosswalk_stops[['OBJECTID','GEOID']], on='OBJECTID', how='left')
In [20]:
sample_bg = acs.GEOID.sample().tolist()[0]

fig, ax = plt.subplots(1, figsize=(10,10))

# Plot bus stops within the block group
base = geopandas.GeoDataFrame(bus_stops[bus_stops.GEOID == sample_bg ]).\
            plot(ax=ax ,alpha=0.7)

# Plot the block group polygon
acs[acs.GEOID == sample_bg].plot(ax=base, alpha=0.5, color = 'gray')

#fig.suptitle(acs[acs.GEOID == sample_bg].NAMELSAD.tolist()[0])
x, y = acs[acs.GEOID == sample_bg][['TRACTCE','NAMELSAD']].to_dict('records')[0].values()
fig.suptitle("Tract: {0}\n{1}".format(x,y))

_ = ax.axis('off')

b. Bus Stops intersecting lines

Since "Points" don't have any area, we need to convert the bus stops into into polygons so that we can detect when the line overlaps with the point's area.

So first we have to 'buffer' the points so that they have some area. 'buffer' is a method on the point.

Other geometric manipulations that are available can be found here: http://geopandas.org/geometric_manipulations.html

In [21]:
bus_stops['geom2']= bus_stops.geometry.map(lambda x: x.buffer(0.0005))
# For now lets not think too hard about the size of the buffer, since we would need to delve into the coordinate system
In [22]:
geopandas.GeoSeries(bus_stops[0:2].geom2).plot(color ='black')
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x119b79358>
In [23]:
geopandas.GeoSeries(bus_stops.geom2).plot()
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x119b885f8>
In [24]:
# Plot only the stops that have a GEOID ( ACS block-group in DC )
geopandas.GeoSeries(bus_stops[~bus_stops.GEOID.isnull()].geom2).plot()
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x119c6df98>

Now the join , this time we use intersects to detect where a single bus route line, and bus stop circle cross one another.

For testing, we start with one sample bus route

In [25]:
# Start by looking at one bus route
sample_route = bus_routes.ROUTE.sample(1).tolist()[0]
oneroute = bus_routes[bus_routes.ROUTE == sample_route].sample(1)
oneroute
Out[25]:
OBJECTID LINEABBR LINENAME Shape_Leng RT_D ROUTE DIRECTION TYPE ORIGIN DESTINATIO ... BPLN_SEC_1 CORRIDOR_I CORRIDOR_D ROUTES_P_1 SERVICE_PL REV_TYPE STATUS COMMENTS GEOMETRYSO geometry
745 746 12 Ballston - Farragut Square Line 9340.755033 38B_W 38B W R FARRAGUT SQUARE BALLSTON-MU STATION ... Western NOVA 35 Ballston / Pentagon 25B; 38B Full Service Plan Rev Active Jun09 Schedule NAVTEQ 2015, QUARTER 4 LINESTRING (-77.03851229509766 38.901525784743...

1 rows ร— 37 columns

In [26]:
# Perform the join just using the one bus route

sample_crosswalk = geopandas.tools.sjoin(
    geopandas.GeoDataFrame(oneroute[['OBJECTID','geometry']]),
    geopandas.GeoDataFrame(bus_stops[['OBJECTID','geom2']].rename(columns={'geom2': 'geometry'})), 
                           how="inner", op='intersects', 
                           lsuffix='_route',
                        rsuffix = '_stop')
In [45]:
# Join the stops and routes using the crosswalk

stops_on_route = sample_crosswalk[['OBJECTID__route','OBJECTID__stop']].\
                        merge(oneroute[['OBJECTID']], 
                                 left_on = 'OBJECTID__route',
                                 right_on = 'OBJECTID',
                                 how = 'inner')


fig, ax = plt.subplots(1, figsize=(11,8))

# Plot bus stops along the route      
base = geopandas.GeoDataFrame(
            bus_stops.merge(stops_on_route,
                         left_on = 'OBJECTID',
                         right_on = 'OBJECTID__stop',
                         how = 'inner'),
            geometry = 'geom2').\
            plot(ax=ax ,alpha=0.7)    
    

# Plot the single bus route
oneroute.plot(ax=base, alpha=0.5, color = 'green')

name = oneroute.LINENAME.tolist()[0]
fig.suptitle("Route Name: {0}".format(name))

_ = ax.axis('off')

c. Bus lines that intersect Census Blockgroups

Use intersects to detect where a bus route line, and census shape intersect. Bus lines don't need to be completely in the shape

In [28]:
crosswalk_acs = geopandas.tools.sjoin(
    geopandas.GeoDataFrame(oneroute[['OBJECTID','geometry']]),
    geopandas.GeoDataFrame(acs[['GEOID','geometry']]), 
                           how="inner", op='intersects')

crosswalk_acs.head()
Out[28]:
OBJECTID geometry index_right GEOID
745 746 LINESTRING (-77.03851229509766 38.901525784743... 80 110010001004
745 746 LINESTRING (-77.03851229509766 38.901525784743... 357 110010107001
745 746 LINESTRING (-77.03851229509766 38.901525784743... 20 110010055003
745 746 LINESTRING (-77.03851229509766 38.901525784743... 418 110010002024
745 746 LINESTRING (-77.03851229509766 38.901525784743... 19 110010055002
In [29]:
fig, ax = plt.subplots(1, figsize=(10,10))


# The block groups that intersect that bus line
base = acs.merge(crosswalk_acs[['GEOID']], how = 'inner').geometry.\
    plot(ax = ax,  color='gray', alpha=0.3)
    
# The one bus line
oneroute.\
    plot(ax = base, color = 'green')
    
# Bus stops within the ACS block groups that intersect that bus line
geopandas.GeoSeries(
    bus_stops.merge(crosswalk_acs[['GEOID']], how = 'inner').geom2).\
    plot(ax = base,alpha=0.5)

name = oneroute.LINENAME.tolist()[0]
fig.suptitle("Route Name: {0}".format(name))
_ = ax.axis('off')

d. Subset the lines to fit within the DC boundary

First, create a full DC shape from the ACS block-groups

In [30]:
dc_shape = acs.unary_union
dc_shape
Out[30]:

Then, a simple intersection of the lines with the shape returns the bus route boundaries within DC

In [31]:
lines_in_dc = bus_routes.intersection(dc_shape)
lines_in_dc.plot()
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b7aa390>
In [32]:
# The difference shows  bus route boundaries that are outside of DC
line_diffs = bus_routes.difference(dc_shape)
line_diffs.plot()
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x119efe048>
In [33]:
# Percentage of the bus line mileage within DC boundaries.
100*round(lines_in_dc.length.sum()/bus_routes.geometry.length.sum(),2)
Out[33]:
27.0

5. Create a full crosswalk

In the previous section, I focused on finding the points along a single bus route, but the same spatial join will return all of the stops on each bus line. The resulting crosswalk dataframe enables joining the routes and stops. It is a many-to-many crosswalk, since each stop can serve multiple routes, and each route has multiple stops.

In [34]:
crosswalk_route_stop = geopandas.tools.sjoin(
    geopandas.GeoDataFrame(bus_routes[['OBJECTID','geometry']]),
    geopandas.GeoDataFrame(bus_stops[['OBJECTID','geom2']].rename(columns={'geom2': 'geometry'})), 
                           how="inner", op='intersects', 
                           lsuffix='_route',
                        rsuffix = '_stop')
In [35]:
crosswalk_route_stop[['OBJECTID__route', 'OBJECTID__stop']].sample(5)
Out[35]:
OBJECTID__route OBJECTID__stop
688 689 4686
642 643 4586
73 74 6311
60 61 7979
186 187 7951

We can also include the ACS block group that the bus stop is located in.

In [36]:
crosswalk_stop_acs = geopandas.tools.sjoin(
    geopandas.GeoDataFrame(bus_stops[['OBJECTID','geometry']]),
    geopandas.GeoDataFrame(acs[['GEOID','geometry']]), 
                           how="inner", op='intersects')

crosswalk_stop_acs[['OBJECTID', 'GEOID']].head()
Out[36]:
OBJECTID GEOID
1 1002 110010073041
465 1466 110010073041
1226 227 110010073041
2046 3047 110010073041
3154 4155 110010073041
In [37]:
crosswalk_route_stop = crosswalk_route_stop.merge(crosswalk_stop_acs, 
                           left_on = 'OBJECTID__stop', 
                           right_on = 'OBJECTID',
                           # Some stops are in DC , and some are not
                           how = 'left')[['OBJECTID__route', 'OBJECTID__stop', 'GEOID']]

crosswalk_route_stop.sample(10)
Out[37]:
OBJECTID__route OBJECTID__stop GEOID
71998 570 8664 110010099022
15400 218 259 NaN
28589 124 2308 110010014021
69950 815 9177 110010033022
14956 420 7454 110010091022
73162 600 6206 NaN
48039 666 3291 NaN
8995 102 6221 NaN
70095 427 969 NaN
33597 499 3862 NaN
In [38]:
# Write to CSV for use in another step
crosswalk_route_stop.to_csv('data/crosswalk_route_stop.csv', index=False)

The crosswalk is many-to-many for stops and routes

In [39]:
# Each stop has multiple OBJECTID__route values associated with it
crosswalk_route_stop.groupby('OBJECTID__stop').\
        OBJECTID__route.\
        apply(lambda x: x.unique().tolist()).head()
Out[39]:
OBJECTID__stop
1                                           [796, 812]
2             [123, 151, 157, 245, 252, 503, 706, 719]
3    [46, 106, 128, 192, 216, 262, 351, 352, 353, 4...
4    [219, 220, 490, 511, 514, 654, 655, 660, 661, ...
5                   [23, 120, 479, 480, 481, 573, 577]
Name: OBJECTID__route, dtype: object
In [40]:
# Each route has multiple OBJECTID__stop values associated with it
crosswalk_route_stop.groupby('OBJECTID__route').\
        OBJECTID__stop.\
        apply(lambda x: x.unique().tolist()).head()
Out[40]:
OBJECTID__route
1    [3802, 3235, 1918, 8732, 430, 8561, 2462, 6668...
2    [7452, 9902, 9795, 9722, 9711, 7534, 9732, 741...
3    [10168, 589, 4221, 4181, 7886, 6497, 6486, 710...
4    [10168, 589, 4221, 4181, 7886, 6497, 6486, 710...
5    [10037, 9713, 1446, 8733, 10029, 8901, 10036, ...
Name: OBJECTID__stop, dtype: object

6. Summary statistics about stops and routes

Now that we have a crosswalk that links our datasets together, we can compute some summary statistics like:

  • What is the average number of bus stops are on each route?
  • How many lines serve each block group in DC?
  • What is the block-group with the most/least bus lines running through it?

What is the average number of bus stops that are along each route?

In [41]:
bus_routes.merge(crosswalk_route_stop, left_on  ='OBJECTID', right_on = 'OBJECTID__route').\
    groupby('ROUTE').OBJECTID__stop.nunique().\
    agg({'Average Stops per Route':np.mean, 
         'Minimum Stops on a Route': np.min, 
         'Maximum Stops on a Route': np.max,
         'Median Stops': np.median}).apply(round).to_frame()
Out[41]:
OBJECTID__stop
Average Stops per Route 106
Minimum Stops on a Route 6
Median Stops 104
Maximum Stops on a Route 248

How many lines serve each block group in DC?

In [42]:
bus_routes.merge(crosswalk_route_stop, left_on  ='OBJECTID', right_on = 'OBJECTID__route').\
    groupby('GEOID').ROUTE.nunique().\
    agg({'Average Routes':np.mean, 
         'Minimum Routes': np.min, 
         'Maximum Routes': np.max,
         'Median Routes': np.median}).apply(round).to_frame()
Out[42]:
ROUTE
Maximum Routes 42
Minimum Routes 1
Average Routes 6
Median Routes 5
In [43]:
fig, ax = plt.subplots(1, figsize=(11,10))


# The block groups that intersect that bus line
base = acs.merge(bus_routes.merge(crosswalk_route_stop, left_on  ='OBJECTID', right_on = 'OBJECTID__route').\
    groupby('GEOID').ROUTE.nunique().to_frame().reset_index())[['geometry','ROUTE']].\
    plot(column='ROUTE', ax=ax,
         cmap=plt.cm.BrBG, legend=True, k = 8)
    
fig.suptitle("Number of Bus routes that cross through the block-group")
_ = ax.axis('off')

7. Answering more complex questions using other spatial operations

Now that we can combine the datasets, we can use other spatial operations to answer some more interesting questions in a future post.

  • For each block group, what's the farthest away that a rider can get on a single bus line?
  • What's the average distance between stops on a bus route?
  • What's the correlation between number of routes in the block group, and the population of the block-group?
  • What's the correlation between the number of routes in the block group and and number of cars estimated to be in the block group?

Written by sideprojects in Posts on Fri 20 January 2017. Tags: data analysis, GIS, python, cities,