Basic observations of the Misconduct Database ( Part 2 )

Exploring the variables in the Misconduct database

For info on how I came by this data, see part1. I wanted to break it free from looking at one or a handful of entries at a time as you find on the main website and see what I can find by looking at the data a bit differently. Keep in mind all of the caveats about this data.

In this post, I'll make some basic observations about the misconduct data that I've pulled, continuing to update the charts as I explore.

In addition, I'm practicing plotting with the seaborn package.

In [2]:
import pandas as pd
import numpy as np
%matplotlib inline

import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="ticks")
In [3]:
data = pd.read_csv("instance_details.csv")
print(data.shape)
data.head()
(2131, 13)
Out[3]:
agency contracting party contractors.href contractors.text court type description disposition index misconduct penalty settlementDate type url
0 Non-Governmental Non-Governmental http://www.contractormisconduct.org/contractor... Alliant Techsystems Inc. Civil On March 12, 2002, a civil action was filed ag... Settlement 1 National Metal Technologies, Inc. v. Alliant T... $6,000,000 3/24/2005 Antitrust http://www.contractormisconduct.org/misconduct...
1 Defense – Navy Defense - Navy http://www.contractormisconduct.org/contractor... Alliant Techsystems Inc. Civil In March of 1997, Alliant Techsystems was char... Settlement 2 Pratt v. Alliant Techsystems, Inc. and Hercule... $2,250,000 3/30/1998 Cost/Labor Mischarge http://www.contractormisconduct.org/misconduct...
2 Non-Governmental None http://www.contractormisconduct.org/contractor... AmerisourceBergen Civil Fagan, at age 16, almost died after he receive... Settlement 3 Fagan v. AmerisourceBergen Corp. (Counterfeit ... Undisclosed 2/24/2006 Health http://www.contractormisconduct.org/misconduct...
3 Non-Governmental None http://www.contractormisconduct.org/contractor... AmerisourceBergen Civil “The allegations in the class action suit alle... Settlement 4 Helbig v. Interstate Pharmacy Corporation (Rei... $3,200,000 1/27/2005 Health http://www.contractormisconduct.org/misconduct...
4 Health and Human Services Non-Governmental http://www.contractormisconduct.org/contractor... AmerisourceBergen Administrative PharMerica, a subsidiary of AmerisourceBergen,... Settlement 5 Allegations of Illegal Kickbacks $5,975,000 3/29/2005 Ethics http://www.contractormisconduct.org/misconduct...

Clean Data

  • make penalty into a number ( instead of a string with commas and $ signs)
  • make disclosed / undisclosed a separate flag
  • make date into a date
  • make date into year and month_year ( eg: 2014_01 )
In [4]:
# Flag for Non-numeric penalty
data['penalty_reported'] = (data.penalty
                            .map(lambda x: 1 
                                 if x[0] =='$'
                                 else x  )
                           )

# Amount into a numeric
data['penalty_amount'] = (data.penalty
                        .str.replace(',', '')
                        .str.replace('$', '')
                        .map(lambda x: int(x) 
                             if x[0] != 'U' 
                             else None)
                         )
data['penalty_amount_millions'] = data['penalty_amount'].map(
    lambda x: float(x)/1000000 
    if x 
    else None) 

# Date parsing

data['settle_date'] = pd.to_datetime(
    data.settlementDate , infer_datetime_format =True)

data['settle_year'] = data.settle_date.dt.year
data['settle_month'] = data.settle_date.dt.month

data['settle_month_year'] = data.settle_date.map(
    lambda x: "{}_{}".format(
        x.year, x.month 
        if len(str(x.month)) > 1 
        else "0{}".format(x.month)
    )
)
In [5]:
data.to_csv("clean_instance_details.csv", index=False)
In [6]:
p  = sns.distplot(
        data.penalty_amount_millions.dropna() 
        , kde = False).set_title('histogram of penalty amounts before removing outliers')
In [7]:
# remove outliers in penalty amount
outliers = data[np.abs(data.penalty_amount_millions - data.penalty_amount_millions.mean())\
             >=(3*data.penalty_amount_millions.std())]
data = data[~data.index.isin(outliers.index)]

print(data.shape)
print(outliers.shape)
outliers[['contractors.text', 'settle_year','court type', 'type' , 'penalty_amount_millions']].sort_values('penalty_amount_millions', ascending = False)
(2124, 20)
(7, 20)
Out[7]:
contractors.text settle_year court type type penalty_amount_millions
1198 BP P.L.C. 2010 Civil Environment 18700
1454 BP P.L.C. 2012 Civil Environment 7800
618 Merck & Co., Inc. 2007 Civil Health 4850
1464 BP P.L.C. 2012 Criminal Environment 4525
547 GlaxoSmithKline 2006 Civil Tax 3400
1382 GlaxoSmithKline 2011 Civil Health 3000
984 Pfizer, Inc. 2009 Criminal Health 2300
In [8]:
outliers.url.values
Out[8]:
array(['http://www.contractormisconduct.org/misconduct/748/1',
       'http://www.contractormisconduct.org/misconduct/837/1',
       'http://www.contractormisconduct.org/misconduct/1289/1',
       'http://www.contractormisconduct.org/misconduct/1519/1',
       'http://www.contractormisconduct.org/misconduct/1718/1',
       'http://www.contractormisconduct.org/misconduct/1792/1',
       'http://www.contractormisconduct.org/misconduct/1802/1'], dtype=object)
In [9]:
p  = sns.distplot(
        data.penalty_amount_millions.dropna() 
        , kde = False
        , bins = 30).set_title('histogram of penalty amounts after outlier removal')
In [10]:
p  = sns.pointplot(data = data
    , y = 'penalty_amount_millions'
    , x = 'settle_year'
    , order = range(1995,2016)
    , estimator = np.median
).set_title('median penalty per year')
x = plt.xticks(rotation=90) 
/usr/local/lib/python3.4/dist-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

Helper functions

In [11]:
def cumulative_col(df, col):
    colname= 'cum_sum__{}'.format(col)
    df[colname] = df[col].cumsum()
    df['cum_perc__{}'.format(col)] = 100*df[colname]/df[col].sum()

Contractor instances

Of the 152 contractors referenced in the database, 50% of observed misconduct instances involve one of 20 contractors.

25% of misconduct instances involve one of 7 contractors

In [12]:
len(data['contractors.text'].unique())
Out[12]:
152
In [13]:
contractors = (data
               .groupby('contractors.text')
               .index
               .count()
               .sort_values(ascending=False)
               .reset_index()
              )
cumulative_col(contractors, 'index')
print(contractors.head(7))
cumulative = contractors.cum_perc__index.plot()
               contractors.text  index  cum_sum__index  cum_perc__index
0                     BP P.L.C.     98              98         4.613936
1                   Exxon Mobil     95             193         9.086629
2               Lockheed Martin     83             276        12.994350
3                Boeing Company     76             352        16.572505
4             FedEx Corporation     69             421        19.821092
5              General Electric     67             488        22.975518
6  Honeywell International Inc.     62             550        25.894539
In [14]:
myplot = sns.distplot(
    data.groupby('contractors.text').index.count().sort_values(ascending=False)
    , kde=False
    , bins=10
    , axlabel = 'number of contractors \n binned by count of misconduct instances'
    # , label = 'number of contractors'
    ).set_title("number of contractors with X infractions")
In [15]:
contractors = (data
               [data.settle_year >= 2000 ]
               .groupby('contractors.text')
               .index
               .count()
               .sort_values(ascending=False)
               .reset_index()
              )
cumulative_col(contractors, 'index')
print(contractors.head(7))
cumulative = contractors.cum_perc__index.plot()
               contractors.text  index  cum_sum__index  cum_perc__index
0                     BP P.L.C.     94              94         4.747475
1                   Exxon Mobil     94             188         9.494949
2                Boeing Company     70             258        13.030303
3             FedEx Corporation     67             325        16.414141
4               Lockheed Martin     61             386        19.494949
5  Honeywell International Inc.     59             445        22.474747
6               GlaxoSmithKline     56             501        25.303030

A Quick look at the 7 contractors

who are involved in 25% of all misconduct instances since 2000

First: mean penalty amount per year

To Do:

  1. Some between group testing
In [17]:
data['top'] = data['contractors.text'].isin(contractors.head(7)['contractors.text'])

fplot = sns.factorplot(
    data = data[data.top ==  True]
    , x = 'settle_year'
    , y = 'penalty_amount_millions'
    , estimator = np.mean
    , row = 'contractors.text'
    , ci = 90
    , aspect = 3.5
)

x = plt.xticks(rotation=90) 

The chart below displays the number of misconduct instances per year for the 7 contractors that account for 25% of all misconduct instances in the database.

In [18]:
fplot = sns.factorplot(
    data = (data
            [data.top ==  True]
            .groupby(['settle_year', 'contractors.text', 'penalty_reported'])
            .index.count()
            .reset_index()
           ).rename(columns ={'index':'instance_count'})
    , x = 'settle_year'
    , y = 'instance_count'
    , estimator = np.mean
    , row = 'contractors.text'
    , ci = 90
    , size = 3
    , aspect = 4
    , hue = 'penalty_reported'
)

x = plt.xticks(rotation=90)
In [42]:
sns.factorplot(
             data = data[data['type'].isin(['Environment','Health', 'Labor'])]
             ,  x = 'settle_year'
             ,  y = 'penalty_amount_millions'
             , hue = 'top'
             , row = 'type'
            , estimator = np.mean
            , size = 2
            , aspect = 4)

x = plt.xticks(rotation=90)
In [43]:
fplot = sns.factorplot(
    data = (data)
    , x = 'settle_year'
    , y = 'penalty_amount_millions'
    , estimator = np.mean
    , row = 'type'
    , ci = 90
    , size = 3
    , aspect = 4
    #, hue = 'top'
)

x = plt.xticks(rotation=90)

Contracting parties and enforcement agencies:

None and "non-govermental" are the largest contracting party / enforcement agency combination -- nearly 25% of all instances of misconduct in the dataset fall under that category.

50% of all instances fall under one of 7 agency/contracting party combinations, detailed below

In [29]:
cpa = ( data
        .groupby(['agency','contracting party'])
        .index.count()
        .reset_index()
        .sort_values('index', ascending=False)
      )
cumulative_col(cpa, 'index')
cpa.head(6)
Out[29]:
agency contracting party index cum_sum__index cum_perc__index
110 Non-Governmental None 530 530 24.952919
132 State/Local None 244 774 36.440678
32 Energy Energy 95 869 40.913371
30 EPA None 83 952 44.821092
109 Non-Governmental Non-Governmental 66 1018 47.928437
50 International None 64 1082 50.941620

"None" is the contracting party on more than 60% of instances in the dataset.

Maybe the 'None' category is made up of suits brought by individuals, or someone other than the agencies represented in the category.

In [31]:
plt.xticks(rotation=90) 

cp = data.groupby('contracting party').index.count().reset_index().sort_values('index',ascending=False)
cumulative_col(cp, 'index')
print(cp.head())
bp = (
    sns.barplot(data = cp 
            , order = cp['contracting party'] 
            , x ='contracting party'
            , y = 'index'
            , color ='steelblue')
        .set_title("instances of misconduct by contracting party \n")
)
    contracting party  index  cum_sum__index  cum_perc__index
23               None   1336            1336        62.900188
7              Energy    144            1480        69.679849
2      Defense - Army    120            1600        75.329567
3   Defense - General     90            1690        79.566855
22   Non-Governmental     71            1761        82.909605

"Non-Governmental" is the prosecuting agency on 35% of instances in the dataset.

In [33]:
plt.xticks(rotation=90) 

ca = data.groupby('agency').index.count().reset_index().sort_values('index',ascending=False)
cumulative_col(ca, 'index')
print(ca.head())

bp2 = (
    sns.barplot(data = ca 
            , order = ca['agency'] 
            , x ='agency'
            , y = 'index'
            , color ='steelblue')
        .set_title("instances of misconduct by contracting party")
)
               agency  index  cum_sum__index  cum_perc__index
32   Non-Governmental    760             760        35.781544
40        State/Local    282            1042        49.058380
28  Multiple Agencies    133            1175        55.320151
25            Justice    120            1295        60.969868
14             Energy     97            1392        65.536723

Later, take a look at the content of those misconduct instances, for now, observing that it makes up a large proportion of all instances

In [36]:
# Flag the subset for further exploration
data['flag'] = data.index.isin(
        data[(data.agency == 'Non-Governmental') & 
             (data['contracting party'] == 'None')]
        .index
        ).astype(int)
In [37]:
data.groupby('flag').index.count()
Out[37]:
flag
0    1594
1     530
Name: index, dtype: int64
In [39]:
(data
    .groupby(['settle_year', 'flag'])
     .index
     .count()
     .unstack('flag')
     .fillna(0)
).plot().set_title("number of settlements per year \n comparing flagged group to unflagged")
Out[39]:
<matplotlib.text.Text at 0x7f8a47b9ea58>
In [40]:
(
    (data
        .groupby(['settle_year', 'flag'])
        .index.count().unstack().fillna(0)
    ).apply(lambda x: [100*x[k]/sum(x) for k in x.keys() ], axis = 1)
    .plot()
).set_title("proportion of misconduct instances per year \n that are in the flagged group. \n Flagged = Non-Governmental/None\n")
Out[40]:
<matplotlib.text.Text at 0x7f8a479479b0>

Penalty amount and number of misconduct instances by month

In [53]:
plt.xticks(rotation=90) 

p = sns.barplot(data= data
            , x= 'settle_year'
            , y = 'penalty_amount_millions'
            , order = range(1995,2014)
            #, color = 'steelblue'
            , hue = 'flag'
           ).set_title("number of settlements with pentalty vs without penalty by year")
        
In [42]:
(
    data
    .groupby('settle_year')
    .penalty_amount_millions
    .agg([sum]).plot()
    .set_title(" sum of all settlemens in millions by year")
    )
Out[42]:
<matplotlib.text.Text at 0x7f8a47cb69e8>

Subset to look at one type of instance

Environment

In [327]:
env = data[data.type == 'Environment'][data.settle_year >= 2000 ]
envheat = ( pd.crosstab(index = env.settle_year, columns =  env.agency
            , values = env.penalty_amount_millions
            , aggfunc = [np.mean, len, sum]
         )  ).stack()['sum'].unstack('agency')
sns.heatmap(envheat)
Out[327]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fcb4979af98>

Disclosed penalty vs undisclosed

In [55]:
(data
     .groupby('settle_year')
     [['index','penalty_amount_millions']]
     .count()
     .plot()
     .set_title("settlement count with payment vs \n without payment/ undiscolsed")
 )
Out[55]:
<matplotlib.text.Text at 0x7fcb481c4908>
In [329]:
plt.xticks(rotation=90) 
sns.boxplot(x="type"
            , y='index'
            , hue="penalty_reported",
            data = (data
                    [data.settle_year >= 2000]
                    .groupby(['type','penalty_reported','settle_year'])
                    .index.count()
                    .reset_index()  
                   )
           ).set_title('instances of misconduct per year by type')
Out[329]:
<matplotlib.text.Text at 0x7fcb3cd26198>

A higher proportion of instances have undisclosed penalties over time

In [150]:
(data
    .groupby('settle_year')
    [['index','penalty_amount_millions']].count()
   .apply(lambda x: 100*(1 - x['penalty_amount_millions']/x['index']), axis=1 ) 
).plot().set_title("percent of instances with undisclosed penalty by year ")
Out[150]:
<matplotlib.text.Text at 0x7ff9e294b438>
In [334]:
plt.xticks(rotation=90) 

# Used for sorting the bars
abt = data.groupby('type').penalty_amount_millions.mean().reset_index().sort('penalty_amount_millions',ascending=False)

ax = (sns.barplot(
                x="type"
                , y="penalty_amount_millions"
                , data=data
                , order = abt.type
                , color = 'steelblue')
            .set_title("mean settlement amount by type"))
In [100]:
plt.xticks(rotation=90) 

abtsum = (data.groupby('type')
            .penalty_amount_millions
            .agg([sum,np.mean, len])
            .reset_index()
            .sort('len', ascending=False)
          )

print(abtsum.head())
ax = sns.barplot(
        x="type"
        , y= 'mean' #"penalty_amount_millions"
      ,  data=abtsum
      , order = abtsum.sort('len', ascending=False).type #abt.type 
      , color = 'steelblue'
    ).set_title("mean settlement amount by type\n\
                sorted by number of instances\n ")
                         type          sum       mean  len
12                      Labor  2369.754940   8.555072  474
4                 Environment  7612.879808  25.806372  377
6   Government Contract Fraud  5301.919392  25.489997  294
8                      Health  6702.928853  87.051024  155
1            Consumer Affairs  2857.624373  35.279313  140
In [108]:
k = sns.heatmap(data = data.groupby(['type','settle_year'])
                .index.count().unstack()
                ).set_title("number of instances \n by type of misconduct and year")
In [331]:
k = sns.heatmap(data = data.groupby(['type','settle_year'])
                .penalty_amount_millions.sum().unstack()
                ).set_title("sum of settlement penalties in millions\n by type of misconduct and year")

Disposition

In [225]:
sns.heatmap(data = data.groupby(['disposition','settle_year']).index.count().unstack())
Out[225]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff9e1b70da0>
In [110]:
sns.heatmap(data = data.groupby(['disposition','type']).index.count().unstack())
Out[110]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fcb3f171198>
In [112]:
plt.xticks(rotation=90) 

abt = data.groupby('disposition').index.count().reset_index().sort('index',ascending=False)
ax = sns.barplot(
        x="disposition"
        , y="index"
        , data=abt
        , order = abt.disposition
        , color = 'steelblue'
    ).set_title("total number of misconduct instances by disposition \n")

Written by sideprojects in Posts on Sun 03 January 2016. Tags: data analysis, fcmd,