Federal Contractor Misconduct Database ( Part 1 )

Intro

The Project on Government Oversight has a database of misconduct that details settlements and cases that government entities have brought against their contractors going back to 1995. The database allows a user to explore instances of misconduct one by one, or by contractor. The information is mostly extracted and entered from official press release sources. It's important to note that the organization is upfront about the limitations to their data, primarily that the data is not complete , as they've pieced it together from numerous public sources and FOIAs.

With this key limitation in mind, I was curious about what patterns in misconduct instances and penalties there may be overtime. The site is organized around specific instances per contractor, and so it's hard to get a high level view.

Next steps

Next I'll go into some detail on how I scraped the data, parsed it into csv, and set up an analysis environment in ipython notebook . If none of that interests you, feel free to skip right on to part-2 where I will do some basic exploratory analysis on the instances of misconduct that I pulled.

Nuts and bolts

Scraping the Site

For simple scraping jobs, I usually turn to using Kimonolabs first to get some data to work with. I set up a scraper for the misconduct instance pages like this one, http://www.contractormisconduct.org/misconduct/2494/federal-and-state-environmental-violations-at-waterford-facility to pull the relevant information from the page.

Then, found that all of the misconduct instance urls are structured as: http://www.contractormisconduct.org/misconduct/ . So generating the urls, is simply a matter of looping through and trying a range of ids.

With the basic scraper set up, the data can be pulled and parsed. I use httpie , jq and json2csv command line tools for pulling and parsing api data like this.

http https://www.kimonolabs.com/api/blv0q7fe > instance_details.json

header="agency,contracting party,contractors.href,contractors.name,court type,description,disposition,index,misconduct,penalty,type,url"
echo $header >instance_details.csv

cat instance_details.json| jq  -c '.results.collection1[] ' | json2csv -k $header >> instance_details.csv

Ipython notebook with docker

To explore the tabular dataset that I now have, I prefer to use ipython notebook. It's easy to setup with docker.

  1. First you need to have docker installed and working on your computer.

  2. Create a directory within your current directory called 'notebooks'. This will be where things are synced into the docker container from your local computer

    mkdir -p $PWD/notebooks
    
  3. Now move or copy the csv file into the directory that you just created.

  4. Run the ipython/scipyserver container using your local notebooks folder synced to the containter's notebooks folder

    docker run -d -p 443:8888 -e "PASSWORD=test" -v $PWD/notebooks:/notebooks ipython/scipyserver
    
  5. Check what address your docker container is using

  6. go to: https:///tree and login with 'test' as the password. You should see nothing in the home folder, except for the csv that you copied in step 2.

  7. create & save notebooks - they will save locally to the directory that you created in step 1 and synced with docker in step 4.

What's next?

Next, I want to start with some basic exploratory data analysis on the instances of misconduct.


Written by sideprojects in Posts on Mon 28 December 2015. Tags: web scraping, docker, data analysis, fcmd,