Ben Hayes - Predicting Criminal Recidivism with R

Predicting Criminal Recidivism with R

Can data science indicate what factors affect the rate of criminal or violent recidivism? (Hint: Yes)

Mar 20, 2018 45 min read Ben Hayes

This post was co-authored by David Pinski, a graduate student at Carnegie Mellon University. Please reach out to me directly if you are interested in the R code used in this report.

1. Introduction

Around the United States, municipalities have turned to risk assessment instruments (RAIs) to help judges determine which individuals to release on bail and which ones to keep in custody. The risk assessment process varies based on the specific instrument used but many rely on criminal recidivism data sets. These data sets typically contain various demographic indicators (age, race, gender, etc.) and also criminal history (charges, juvenile record, etc.).

Broward County, Florida, has turned to the use of one of the most popular RAIs today: COMPAS or the Correctional Offender Management Profiling for Alternative Sanctions tool. COMPAS assesses individuals based on criminal history and social profiling to categorize an individual as low, medium, or high risk. This tool, however, was not developed using the Broward County data set which may lead to poor performing predictions for individuals from Broward County, Florida.

In the following data analysis, we apply modern data mining techniques to:

Construct an RAI using the Broward County data set to predict two-year recidivism.

Construct an RAI using the Broward County data set to predict two-year violent recidivism.

Evaluate predictive quality for different ethnicities, ages, and genders.

Compare our custom RAI to the proprietary COMPAS RAI.

Before constructing the RAIs and comparing our results to COMPAS, we first explore, clean, describe, and interpret the Broward County data set.

2. Data Exploration

Scope of the Raw Data

The data set provided contains records of individuals from Broward County, Florida, who have been convicted with a crime. Columns provided in the data set include:

ID
Name
COMPAS Screening Date
Sex, Date of Birth, Age, Age Category, Race
Counts of Juvenile Felonies, Misdemeanors, Other offenses
Priors count
Days between Screening and Arrest
Dates in and out of jail
Charge Offense Date
Days from COMPAS screening
Charge Degree
Charge Description
Is Recidivist? (And other values related to the recidivism charge if applicable)
Is Violent Recidivist? (And other values related to the violent recidivism charge if applicable)
Dates in and out of custody
Two-year Recidivist?
Two-year Violent Recidivist? (This value was calculated manually by multiplying Is Violent Recidivist with Two-Year Recidivist)
COMPAS Decile Score

Data Cleaning

Columns

Overall, the data set contains 56 columns but some of these columns are unusable or provide little value for one or more of the following reasons:

Is a unique identification number (individual ID, case/charge number)
Directly relates to known recidivism (this data is unknown when assessing risk of future recidivism at a bail hearing)
Directly relates to known violent recidivism (this data is unknown when assessing risk of future recidivism at a bail hearing)
Reports the type of assessment performed (all cases are ‘Risk of Recidivism’ and ‘Risk of Violence’)

For these reasons, the columns have been ignored and filtered out of the analyzed data set (resulting in 24 columns).

Rows

The data set contains 7214 rows but, similar to the filtering performed by ProPublica, we have filtered out individuals who do not meet certain criteria:

Individuals with a COMPAS scored crime that has a charge date that was not within 30 days of the arrest date were removed.
Individuals with no COMPAS case were removed.
Individuals with a charge degree of ‘O’ (instead of ‘F’ or ’M’) were removed. These individuals are not expected to serve time in jail.
Individuals with less than two years of time outside of the correctional facility were removed.

For these reasons, the corresponding rows have been removed (resulting in 6172 rows). For more information on the row filtering reasons listed above, please visit ProPublica's source page.

Feature Engineering

To enhance our analysis of the Broward County data set, we believe additional features will aid prediction. Hidden within the data set are additional features/variables that could improve the predictive performance of our models. Below is the list of new features and a brief explanation:

Days spent in jail

While we lose the exact dates of when an individual entered and exited jail, we gain the ability to see if the duration or the term of the charge impacts the recidivism rate. To calcuate this value, we subtract the date the person entered jail from the date the person exited jail. This subtraction provides us with the number of days spent in jail.

Days spent in custody

Similarly, for days spent in custody, we can use this information to determine whether this information is important for predicting criminal recidivism. To calculate this value, we subtract the date the person entered custody from the date the person exited custody. This subtraction provides us with the number of days spent in custody.

Number of juvenile charges (felony, misdemeanor, other)

The original data set provides the number of juvenile charges, separated by type: felony, misdemeanor or other. To analyze the impact of criminal activity from an individual's youth, we summed the counts together. While we may lose the severity of the crime(s), we gain insight into how much juvenile criminal activity, as a whole, feeds into criminal recidivism.

Charge category

For each individual, a description of their charge was provided. This information while useful to an individual reading a police report, is not well suited for data analysis. Data labeled as "Driving License Suspended" would not be considered the same as data labeled as "DWLS Susp/Cancel Revoked" or "Susp Drivers Lic 1st Offense". These may have subtle differences in length of sentence or the size of a fine but provide more value when considered together. In this example, these and other related offenses have been categorized as "Driving/DUI".

We used this opportunity to also further categorize drug-related crimes. For these offenses, we have two high-level groupings: one for cannabis-related offenses, and one for non-cannabis-related offenses (cocaine, methamphetamine, heroin, synthetic drugs, etc.). This grouping and others will allow us to determine whether the type of crime committed impacts the recidivism rate.

Other charge categories include: assault, battery, burglary, resisting, criminal mischief, tampering, and lewdness.

Involved firearm

Since we are tasked with not only predicting general recidivism but also violent recidivism, we chose to engineer a binary variable for 'firearm' or 'deadly weapon' related offenses. The suspicion is that individuals involved with a firearm-related crime will be more likely to recidivate in the future (particularly within two years). Any record containing a description referring to 'firearm', 'deadly weapon', 'throwing missile into vehicle', or other related charges were labeled as '1'; all others were labeled as '0'.

Polynomial transformations

We included second and third degree terms for three continuous variables: "age", "priors_count", and "total_juv_count" (the variable created that sums up all juvenile offenses). These transformations will allow our models to capture any nonlinear effects that these variables have on recidivism and violent recividism.

Descriptive Statistics

To familiarize ourselves with the data set, we evaluate the key variables including the outcome variable. In the following section we describe the data and the distributions for each variable.

Age

The age variable is clearly right-skewed with the majority of individuals in the data set falling between the ages of 20 and 30. The average age is 34.5 years old.

Typically less associated with crime, there are elderly (60 years or older) individuals from Broward County with a criminal record in our data set. On the other end of the age spectrum, there are 0 individuals included below 18. We suspect this phenomenon is because detailed juvenile records are inaccessible.

Questions that are outside of the scope of this analysis but possibly interesting to study include: Do individuals nearing age milestones commit more crimes? Do individuals nearing retirement commit more crimes (relative to individuals a few years further away from retirement)?

Gender

The gender variable also appears unevenly distributed between men and women. The majority of individuals, 81%, in the data set are men.

Sex	Count	Proportion
Female	1175	0.19
Male	4997	0.81

Race

The race variable also appears unevenly distributed. African-American individuals account for 50% of the data set while Asian individuals are only 0%.

Priors Count

Similar to the age variable, the priors count variable is heavily right-skewed. The average number of prior convictions is 3.25 with 34% of individuals having 0 priors.

Juvenile Charges Count

It follows that if individuals exhibit a right-skewed distribution for prior convictions, then they may also exhibit a right-skewed distribution in their juvenile charges. That is the relationship found in the data set for Broward County. The average number of juvenile counts is 0.26 while 87% have 0 juvenile charges.

Days in Jail

The days in jail variable is also heavily right-skewed. The average number of days in jail is 15.11 with 11% of individuals having spent 0 days in jail. The maximum number of days spent in jail is 800.

Days.in.Jail	Count	Proportion
0 - 49 Days	5718	0.926
50 - 99 Days	211	0.034
100+ Days	243	0.039

Days in Custody

The days in custody variable is also heavily right-skewed. The average number of days in custody is 35.97 with 11% of individuals having spent 0 days in custody. The maximum number of days spent in custody is 6035.

Days.in.Custody	Count	Proportion
0 - 49 Days	5391	0.873
50 - 99 Days	326	0.053
100+ Days	455	0.074

Charge Degree

The charge degree variable indicates that the majority of individuals, 64%, in the data set are charged with a felony as opposed to a misdemeanor.

Charge Degree	Count	Proportion
Felony	3970	0.643
Misdemeanor	2202	0.357

Charge Category

When reviewing the charge category plot, we notice that a large proportion of individuals, 24%, are charged with battery and 13% are not charged at all (only arrested).

Firearm Involvement

Firearm is one of the features that we engineered to capture the nature of recidivism. In the Broward County data set, only 4.1% of individuals are charged with a crime that is described as involving a 'firearm' or 'deadly weapon'.

Involved Firearm	Count	Proportion
No Weapon/Firearm	5922	0.959
Weapon/Firearm	250	0.041

Days Between Screening and Arrest

For days between screening and arrest, we see that the majority of individuals are screened within 0 to 1 days of arrest. There are cases when an individual is screened prior to their arrest.

Days.Between.Screening.and.Arrest	Count	Proportion
Screened 1 Day Or More Before Arrest	69	0.011
Screened Same Day as Arrest	1379	0.223
Screened 1 Day After Arrest	3980	0.645
Screened 2 to 5 Days After Arrest	338	0.055
Screened 6 or More Days After Arrest	406	0.066

Outcome Variables (Two-Year Recidivism & Violent Two-Year Recidivism)

For the outcome variables, we find that the rate of recidivism is higher than the rate of violent recidivism. Notice that the prevalence (baseline) of recidivism is about 46% and the prevalence for violent recidivism is 11%. These figures are important to keep in mind as we evaluate each variable and the performance of our model. For example, if we were to assume every individual does not violently recidivate, then we would have approximately 89% accuracy which is misleading.

Recidivism	Count	Proportion
0	3363	0.545
1	2809	0.455

Violent.Recidivism	Count	Proportion
0	5520	0.894
1	652	0.106

Variable Impact on Recidivism and Violent Recidivism

Now that we have described the data, we perform cursory visual inspection of the relationship between each variable and our outcomes: general two-year recidivism and violent two-year recidivism.

Age

For the age variable, we observe a decrease in the rate of recidivism as age increases. This relationship occurs for both general recidivism and violent recidivism. Values over 66 years of age were binned as the number of individuals in those age groups is low.