Data science can change how any organization operates not just Facebook or Amazon, but even the American Civil Liberties Union - an organization 10 times older than Facebook. Continue reading to learn about my experience interning with the ACLU's data science team in New York.
Around the United States, municipalities have turned to risk assessment instruments (RAIs) to help judges determine which individuals to release on bail and which ones to keep in custody. The risk assessment process varies based on the specific instrument used but many rely on criminal recidivism data sets. These data sets typically contain various demographic indicators (age, race, gender, etc.) and also criminal history (charges, juvenile record, etc.).
Broward County, Florida, has turned to the use of one of the most popular RAIs today: COMPAS or the Correctional Offender Management Profiling for Alternative Sanctions tool. COMPAS assesses individuals based on criminal history and social profiling to categorize an individual as low, medium, or high risk. This tool, however, was not developed using the Broward County data set which may lead to poor performing predictions for individuals from Broward County, Florida. In this post we construct an RAI, compare to COMPAS and discuss findings.
The issue of gender-income disparity is not new - it is nuanced. In order to answer these questions, we will rely on the National Longitudinal Survey of Youth, 1979 cohort, data set (abbreviated ‘NLSY79’). To draw strong conclusions, we must evaluate the data set provided - is it accurate, relevant, and useful for drawing statistical conclusions? Once we summarize the data, we can discuss methodology - how should we approach the data, what variables should we consider, what techniques are appropriate? Third, we openly discuss findings about the sampled individuals and attempt to infer relationships about the income difference (if any) between men and women in the larger population based on other factors. Lastly, we end with a discussion of the relevancy and signicance of our findings given the context of the survey data available and the methodology used.
Understanding the confusion matrix is an important step in statistics, machine learning, or any other field where predictions or classifications are common. The confusion matrix is a type of contingency table with two dimensions that reveal how well a predictive model performs when the outcomes are known. Additionally, when associated costs of incorrect positive and negative guesses differ, the trade-offs can be optimized. Do you know the difference between Sensitivity, Specificity, Recall, Precision, True Positive Rate, and Positive Predictive Value?