Bus Mapping and Basic GIS with GeoPandas¶

GeoPandas wraps several of Python GIS tools into a set of convenient functions for storing and operating on shapefiles as DataFrames, and makes working with shapefiles look similar to working with Pandas. In this demo, I'm going to demonstrate some of the basic geopandas functionality for getting started with working with shapefiles, and compute some summary statistics from the data.

The shapefiles I'm using are from the DC open data portal http://opendata.dc.gov/datasets and contain WMTA ( Washington Metro Transit Authority ) Bus stops, and Bus Routes. I download the .zip files, and unzip them on my local machine.

The main steps that are covered in this demo are:

Load Packages
Load Shapefiles
Examine the Data, and make some basic visualizations
Combine the Datsets Using Spatial Joins
Create a Full Crosswalk File
Summary statistics about stops and routes
- What is the average number of bus stops are on each route in the WMTA system?
- How many bus lines serve each block group in DC?
Starting to think about other questions that we can answer with additional geospatial operations.

Other topics that will be addressed separately on using geospatial data in python

Coordinate Systems and Map projections
Constructing an adjacency matrix from a shapefiles
Constructing attributes using spatial operations such as distance, driving distance, and other spatial operations.

1. Load Packages¶

In [1]:

# !pip install geopandas 
# !pip install rtree

# an rtree dependency 
# !brew install Spatialindex ( For the mac )

In [2]:

import pandas as pd 
import numpy as np

# geospatial tools
import geopandas
import rtree

# plotting tools
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

2. Loading Shapefiles¶

Shapefiles are a widely used geospatial data format ( although not the only one ).

I've obtained shapefiles from _ containing WMTA ( Washington Metro Transit Authority ) Bus stops, and Bus Routes. I download the .zip files, and unzip them on my local machine.

The "geopandas.read_file" command must point to the whole folder that is obtained from the downloaded .zip file. Some of the file extensions inside of the folder you'll see include ".shp" , ".shx" and ".prj". The read_file command will need to be able to locate each of these in order to complete successfully.

The result of the command is a geospatial datframe. It's generally similar to a normal pandas dataframe, except that the geospatial dataframe expects there to be a column named 'geometry', each row defining a polygon, point or line. Other columns in the dataframe describe the geometry in that row. For example in the Bus Stops dataset, each bus stop also has a cross street, and a flag for whether or not the bus stop is covered.

In [3]:

# Source: DC open data portal http://opendata.dc.gov/datasets files about WMTA bus lines and stops 

# http://opendata.dc.gov/datasets/metro-bus-stops
bus_stops = geopandas.read_file('data/Metro_Bus_Stops/')

# http://opendata.dc.gov/datasets/metro-bus-lines
bus_routes = geopandas.read_file('data/Metro_Bus_Lines/')

Without going too deeply into this, for now we want to make sure that all of the shapefiles we are working with are in the same projection

http://geopandas.org/projections.html

In [4]:

bus_stops.crs

Out[4]:

{'init': 'epsg:4326'}

In [5]:

bus_routes.crs

Out[5]:

{'init': 'epsg:4326'}

In [6]:

# Source: US Census ACS 2015 5-year estimates

acs = geopandas.read_file('data/ACS_2015_5YR_BG_11_DISTRICT_OF_COLUMBIA.gdb')

In [7]:

acs.crs

Out[7]:

{'init': 'epsg:4269'}

Convert this file into the same projection as the others

In [8]:

acs = acs.to_crs({'init': 'epsg:4326'})
acs.crs

Out[8]:

{'init': 'epsg:4326'}

3. Examine the Data, and make some basic visualizations¶

Bus stops are represented as points

Geopandas has a convenience .plot() method ( similar to pandas ) which makes it very simple to create a basic visualization of the geometry. The dataframe needs to be a 'geopandas.GeoSeries' or a 'geopandas.GeoDataframe' in order for it to work.

In [9]:

bus_stops.geometry[0:4]

Out[9]:

0    POINT (-76.97413927339112 38.80478277297256)
1    POINT (-76.99163827882485 38.83910177500602)
2    POINT (-77.09605132403074 38.99227279837678)
3    POINT (-77.02202529441263 38.91432379310785)
Name: geometry, dtype: object

In [10]:

# The Geopandas dataframe is aware of the POINT object within the column
# and knows how to plot it

bus_stops.geometry[0:4].plot()

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x118bdeda0>

A normal pandas Dataframe can't do this, since it doesn't know how to deal with a POINT object

In [11]:

pd.DataFrame(bus_stops.geometry[0:4]).geometry

Out[11]:

0    POINT (-76.97413927339112 38.80478277297256)
1    POINT (-76.99163827882485 38.83910177500602)
2    POINT (-77.09605132403074 38.99227279837678)
3    POINT (-77.02202529441263 38.91432379310785)
Name: geometry, dtype: object

In [12]:

# This won't work, this will throw TypeError
# pd.DataFrame(bus_stops.geometry[0:4]).plot()

Bus routes are represented by lines

In [13]:

bus_routes.geometry.head()

Out[13]:

0    (LINESTRING (-76.89642225040846 38.88515778874...
1    LINESTRING (-77.13612674065163 38.818093313739...
2    LINESTRING (-77.10837641792699 38.856279711876...
3    LINESTRING (-77.10458510913409 38.857844060906...
4    (LINESTRING (-77.04491229360494 38.87324778473...
Name: geometry, dtype: object

In [14]:

bus_routes.sample(2).plot(color='gray')

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0x118cf66a0>

Census Tracts are represented as polygons

In [15]:

acs.geometry.head()

Out[15]:

0    (POLYGON ((-77.04629699980666 38.9163110000526...
1    (POLYGON ((-77.05062299965169 38.9149029998736...
2    (POLYGON ((-77.04599200013064 38.9145000001749...
3    (POLYGON ((-77.04166999957079 38.9115340000940...
4    (POLYGON ((-77.02962299980157 38.9072399999261...
Name: geometry, dtype: object

In [16]:

acs.plot(color = 'red', alpha=0.3)

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x118ed4ef0>

The objects overlap with each other spatially

The dataframes loaded don't have a linking key between them, but we can create the crosswalks that link the objects together using spatial joins.

In [17]:

# Create the plotting area 
fig, ax = plt.subplots(1, figsize=(7,7))

# Plot bus routes
base = bus_routes.plot(ax=ax, color='gray')

# Plot bus stops
bus_stops.plot(ax=base, marker="o", alpha=0.5)

# Plot bus stops
acs.plot(ax=base, alpha=0.2, color = 'red')
_ = ax.axis('off')

4. Combine datsets through spatial joins¶

Bus stops located in each Census Blockgroup
Bus lines that intersect Bus stops
Bus lines that intersect Census Blockgroup

a. Identify stops located in each Census Blockgroup¶

spatially joined, we are looking for points within the census shape

In [18]:

crosswalk_stops = geopandas.tools.sjoin(
    geopandas.GeoDataFrame(bus_stops[['OBJECTID','geometry']]),
    geopandas.GeoDataFrame(acs[['GEOID','geometry']]), 
                                        # Operation we are using is 'within'
                           how="inner", op='within')

crosswalk_stops.head()

Out[18]:

	OBJECTID	geometry	index_right	GEOID
1	1002	POINT (-76.99163827882485 38.83910177500602)	63	110010073041
465	1466	POINT (-76.98843827148823 38.84215477239719)	63	110010073041
1226	227	POINT (-76.98338628012409 38.84475477538278)	63	110010073041
2046	3047	POINT (-76.98656827956077 38.83577577621708)	63	110010073041
3154	4155	POINT (-76.98534127691076 38.84389777548228)	63	110010073041

Now each bus_stop record will be associated with an ACS GEOID ( a block-group id ). Since each bus stop will be contained within one block-group, we can just add it to the bus_stops dataframe as a column.

In [19]:

bus_stops = bus_stops.merge(crosswalk_stops[['OBJECTID','GEOID']], on='OBJECTID', how='left')

In [20]:

sample_bg = acs.GEOID.sample().tolist()[0]

fig, ax = plt.subplots(1, figsize=(10,10))

# Plot bus stops within the block group
base = geopandas.GeoDataFrame(bus_stops[bus_stops.GEOID == sample_bg ]).\
            plot(ax=ax ,alpha=0.7)

# Plot the block group polygon
acs[acs.GEOID == sample_bg].plot(ax=base, alpha=0.5, color = 'gray')

#fig.suptitle(acs[acs.GEOID == sample_bg].NAMELSAD.tolist()[0])
x, y = acs[acs.GEOID == sample_bg][['TRACTCE','NAMELSAD']].to_dict('records')[0].values()
fig.suptitle("Tract: {0}\n{1}".format(x,y))

_ = ax.axis('off')

b. Bus Stops intersecting lines¶

Since "Points" don't have any area, we need to convert the bus stops into into polygons so that we can detect when the line overlaps with the point's area.

So first we have to 'buffer' the points so that they have some area. 'buffer' is a method on the point.

Other geometric manipulations that are available can be found here: http://geopandas.org/geometric_manipulations.html

In [21]:

bus_stops['geom2']= bus_stops.geometry.map(lambda x: x.buffer(0.0005))
# For now lets not think too hard about the size of the buffer, since we would need to delve into the coordinate system

In [22]:

geopandas.GeoSeries(bus_stops[0:2].geom2).plot(color ='black')

Out[22]:

<matplotlib.axes._subplots.AxesSubplot at 0x119b79358>

In [23]:

geopandas.GeoSeries(bus_stops.geom2).plot()

Out[23]:

<matplotlib.axes._subplots.AxesSubplot at 0x119b885f8>

In [24]:

# Plot only the stops that have a GEOID ( ACS block-group in DC )
geopandas.GeoSeries(bus_stops[~bus_stops.GEOID.isnull()].geom2).plot()

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x119c6df98>

Now the join , this time we use intersects to detect where a single bus route line, and bus stop circle cross one another.

For testing, we start with one sample bus route

In [25]:

# Start by looking at one bus route
sample_route = bus_routes.ROUTE.sample(1).tolist()[0]
oneroute = bus_routes[bus_routes.ROUTE == sample_route].sample(1)
oneroute

Out[25]:

	OBJECTID	LINEABBR	LINENAME	Shape_Leng	RT_D	ROUTE	DIRECTION	TYPE	ORIGIN	DESTINATIO	...	BPLN_SEC_1	CORRIDOR_I	CORRIDOR_D	ROUTES_P_1	SERVICE_PL	REV_TYPE	STATUS	COMMENTS	GEOMETRYSO	geometry
745	746	12	Ballston - Farragut Square Line	9340.755033	38B_W	38B	W	R	FARRAGUT SQUARE	BALLSTON-MU STATION	...	Western NOVA	35	Ballston / Pentagon	25B; 38B	Full Service Plan	Rev	Active	Jun09 Schedule	NAVTEQ 2015, QUARTER 4	LINESTRING (-77.03851229509766 38.901525784743...

1 rows × 37 columns

In [26]:

# Perform the join just using the one bus route

sample_crosswalk = geopandas.tools.sjoin(
    geopandas.GeoDataFrame(oneroute[['OBJECTID','geometry']]),
    geopandas.GeoDataFrame(bus_stops[['OBJECTID','geom2']].rename(columns={'geom2': 'geometry'})), 
                           how="inner", op='intersects', 
                           lsuffix='_route',
                        rsuffix = '_stop')

In [45]:

# Join the stops and routes using the crosswalk

stops_on_route = sample_crosswalk[['OBJECTID__route','OBJECTID__stop']].\
                        merge(oneroute[['OBJECTID']], 
                                 left_on = 'OBJECTID__route',
                                 right_on = 'OBJECTID',
                                 how = 'inner')


fig, ax = plt.subplots(1, figsize=(11,8))

# Plot bus stops along the route      
base = geopandas.GeoDataFrame(
            bus_stops.merge(stops_on_route,
                         left_on = 'OBJECTID',
                         right_on = 'OBJECTID__stop',
                         how = 'inner'),
            geometry = 'geom2').\
            plot(ax=ax ,alpha=0.7)    
    

# Plot the single bus route
oneroute.plot(ax=base, alpha=0.5, color = 'green')

name = oneroute.LINENAME.tolist()[0]
fig.suptitle("Route Name: {0}".format(name))

_ = ax.axis('off')

c. Bus lines that intersect Census Blockgroups¶

Use intersects to detect where a bus route line, and census shape intersect. Bus lines don't need to be completely in the shape

In [28]:

crosswalk_acs = geopandas.tools.sjoin(
    geopandas.GeoDataFrame(oneroute[['OBJECTID','geometry']]),
    geopandas.GeoDataFrame(acs[['GEOID','geometry']]), 
                           how="inner", op='intersects')

crosswalk_acs.head()

Out[28]:

	OBJECTID	geometry	index_right	GEOID
745	746	LINESTRING (-77.03851229509766 38.901525784743...	80	110010001004
745	746	LINESTRING (-77.03851229509766 38.901525784743...	357	110010107001
745	746	LINESTRING (-77.03851229509766 38.901525784743...	20	110010055003
745	746	LINESTRING (-77.03851229509766 38.901525784743...	418	110010002024
745	746	LINESTRING (-77.03851229509766 38.901525784743...	19	110010055002

In [29]:

fig, ax = plt.subplots(1, figsize=(10,10))


# The block groups that intersect that bus line
base = acs.merge(crosswalk_acs[['GEOID']], how = 'inner').geometry.\
    plot(ax = ax,  color='gray', alpha=0.3)
    
# The one bus line
oneroute.\
    plot(ax = base, color = 'green')
    
# Bus stops within the ACS block groups that intersect that bus line
geopandas.GeoSeries(
    bus_stops.merge(crosswalk_acs[['GEOID']], how = 'inner').geom2).\
    plot(ax = base,alpha=0.5)

name = oneroute.LINENAME.tolist()[0]
fig.suptitle("Route Name: {0}".format(name))
_ = ax.axis('off')

d. Subset the lines to fit within the DC boundary¶

First, create a full DC shape from the ACS block-groups

In [30]:

dc_shape = acs.unary_union
dc_shape

Out[30]:

Then, a simple intersection of the lines with the shape returns the bus route boundaries within DC

In [31]:

lines_in_dc = bus_routes.intersection(dc_shape)
lines_in_dc.plot()

Out[31]:

<matplotlib.axes._subplots.AxesSubplot at 0x11b7aa390>

In [32]:

# The difference shows  bus route boundaries that are outside of DC
line_diffs = bus_routes.difference(dc_shape)
line_diffs.plot()

Out[32]:

<matplotlib.axes._subplots.AxesSubplot at 0x119efe048>

In [33]:

# Percentage of the bus line mileage within DC boundaries.
100*round(lines_in_dc.length.sum()/bus_routes.geometry.length.sum(),2)

Out[33]:

27.0

5. Create a full crosswalk¶

In the previous section, I focused on finding the points along a single bus route, but the same spatial join will return all of the stops on each bus line. The resulting crosswalk dataframe enables joining the routes and stops. It is a many-to-many crosswalk, since each stop can serve multiple routes, and each route has multiple stops.

In [34]:

crosswalk_route_stop = geopandas.tools.sjoin(
    geopandas.GeoDataFrame(bus_routes[['OBJECTID','geometry']]),
    geopandas.GeoDataFrame(bus_stops[['OBJECTID','geom2']].rename(columns={'geom2': 'geometry'})), 
                           how="inner", op='intersects', 
                           lsuffix='_route',
                        rsuffix = '_stop')

In [35]:

crosswalk_route_stop[['OBJECTID__route', 'OBJECTID__stop']].sample(5)

Out[35]:

	OBJECTID__route	OBJECTID__stop
688	689	4686
642	643	4586
73	74	6311
60	61	7979
186	187	7951

We can also include the ACS block group that the bus stop is located in.

In [36]:

crosswalk_stop_acs = geopandas.tools.sjoin(
    geopandas.GeoDataFrame(bus_stops[['OBJECTID','geometry']]),
    geopandas.GeoDataFrame(acs[['GEOID','geometry']]), 
                           how="inner", op='intersects')

crosswalk_stop_acs[['OBJECTID', 'GEOID']].head()

Out[36]:

	OBJECTID	GEOID
1	1002	110010073041
465	1466	110010073041
1226	227	110010073041
2046	3047	110010073041
3154	4155	110010073041

In [37]:

crosswalk_route_stop = crosswalk_route_stop.merge(crosswalk_stop_acs, 
                           left_on = 'OBJECTID__stop', 
                           right_on = 'OBJECTID',
                           # Some stops are in DC , and some are not
                           how = 'left')[['OBJECTID__route', 'OBJECTID__stop', 'GEOID']]

crosswalk_route_stop.sample(10)

Out[37]:

	OBJECTID__route	OBJECTID__stop	GEOID
71998	570	8664	110010099022
15400	218	259	NaN
28589	124	2308	110010014021
69950	815	9177	110010033022
14956	420	7454	110010091022
73162	600	6206	NaN
48039	666	3291	NaN
8995	102	6221	NaN
70095	427	969	NaN
33597	499	3862	NaN

In [38]:

# Write to CSV for use in another step
crosswalk_route_stop.to_csv('data/crosswalk_route_stop.csv', index=False)

The crosswalk is many-to-many for stops and routes

In [39]:

# Each stop has multiple OBJECTID__route values associated with it
crosswalk_route_stop.groupby('OBJECTID__stop').\
        OBJECTID__route.\
        apply(lambda x: x.unique().tolist()).head()

Out[39]:

OBJECTID__stop
1                                           [796, 812]
2             [123, 151, 157, 245, 252, 503, 706, 719]
3    [46, 106, 128, 192, 216, 262, 351, 352, 353, 4...
4    [219, 220, 490, 511, 514, 654, 655, 660, 661, ...
5                   [23, 120, 479, 480, 481, 573, 577]
Name: OBJECTID__route, dtype: object

In [40]:

# Each route has multiple OBJECTID__stop values associated with it
crosswalk_route_stop.groupby('OBJECTID__route').\
        OBJECTID__stop.\
        apply(lambda x: x.unique().tolist()).head()

Out[40]:

OBJECTID__route
1    [3802, 3235, 1918, 8732, 430, 8561, 2462, 6668...
2    [7452, 9902, 9795, 9722, 9711, 7534, 9732, 741...
3    [10168, 589, 4221, 4181, 7886, 6497, 6486, 710...
4    [10168, 589, 4221, 4181, 7886, 6497, 6486, 710...
5    [10037, 9713, 1446, 8733, 10029, 8901, 10036, ...
Name: OBJECTID__stop, dtype: object

6. Summary statistics about stops and routes¶

Now that we have a crosswalk that links our datasets together, we can compute some summary statistics like:

What is the average number of bus stops are on each route?
How many lines serve each block group in DC?
What is the block-group with the most/least bus lines running through it?

What is the average number of bus stops that are along each route?

In [41]:

bus_routes.merge(crosswalk_route_stop, left_on  ='OBJECTID', right_on = 'OBJECTID__route').\
    groupby('ROUTE').OBJECTID__stop.nunique().\
    agg({'Average Stops per Route':np.mean, 
         'Minimum Stops on a Route': np.min, 
         'Maximum Stops on a Route': np.max,
         'Median Stops': np.median}).apply(round).to_frame()

Out[41]:

	OBJECTID__stop
Average Stops per Route	106
Minimum Stops on a Route	6
Median Stops	104
Maximum Stops on a Route	248

How many lines serve each block group in DC?

In [42]:

bus_routes.merge(crosswalk_route_stop, left_on  ='OBJECTID', right_on = 'OBJECTID__route').\
    groupby('GEOID').ROUTE.nunique().\
    agg({'Average Routes':np.mean, 
         'Minimum Routes': np.min, 
         'Maximum Routes': np.max,
         'Median Routes': np.median}).apply(round).to_frame()

Out[42]:

	ROUTE
Maximum Routes	42
Minimum Routes	1
Average Routes	6
Median Routes	5

In [43]:

fig, ax = plt.subplots(1, figsize=(11,10))


# The block groups that intersect that bus line
base = acs.merge(bus_routes.merge(crosswalk_route_stop, left_on  ='OBJECTID', right_on = 'OBJECTID__route').\
    groupby('GEOID').ROUTE.nunique().to_frame().reset_index())[['geometry','ROUTE']].\
    plot(column='ROUTE', ax=ax,
         cmap=plt.cm.BrBG, legend=True, k = 8)
    
fig.suptitle("Number of Bus routes that cross through the block-group")
_ = ax.axis('off')

7. Answering more complex questions using other spatial operations¶

Now that we can combine the datasets, we can use other spatial operations to answer some more interesting questions in a future post.

For each block group, what's the farthest away that a rider can get on a single bus line?
What's the average distance between stops on a bus route?
What's the correlation between number of routes in the block group, and the population of the block-group?
What's the correlation between the number of routes in the block group and and number of cars estimated to be in the block group?

Bus Mapping with GeoPandas