CNIT 58100-RDM

Graduate course on responsible data science

Purdue CNIT581-RDM: Responsible Data Management (archived)
Spring 2023

Computer & Information Technology
Purdue University

TL;DR: Interested in data management and/or machine learning?
Consider taking CNIT 58100-RDM in Spring 2023.

Questions?
Send an email to the instructor
at rpradhan@purdue.edu

COURSE OVERVIEW

Responsible data management (RDM) is a fast-growing research area focused on responsible data handling practices in data-driven decision-making systems. Research in this area is centered around the transparency of data, algorithms, and data science (DS) pipelines.

This course examines advanced topics relating to algorithmic fairness, transparency, and interpretability of data-driven decision-making systems. We will study current issues related to the transparency and fairness of data-driven decisions, examine sources of unexpected and discriminatory behavior of data-driven decision-making systems and contrast existing methods and design novel techniques to mitigate undesired system decisions. Topics include algorithmic bias, fairness metrics, debugging and mitigating bias/errors, interpretability of algorithms, the data science lifecycle and bias in the DS pipelines.

When: TR 3:00 - 4:15 PM

Where: KNOY Hall, Room B031

Lecture style: The lectures will be a mix of traditional lectures, paper readings/presentations and practical problem solving, discussing responsible data management from different aspects. As a side goal, we will identify potential open problems for further research.

Prerequisites: Any undergraduate data management course and exposure to machine learning.

INSTRUCTOR

Romila Pradhan
Email: rpradhan@purdue.edu

EVALUATION

There will be 2-3 assignments followed by a semester-long project chosen by the student. Each student is also expected to present 2-3 research presentations. The project will be a group project Students will be evaluated as follows:
  • Project (40%): proposal + initial draft (5%), presentation (15%), report (20%)
  • Paper presentations (25%): 2-3 in-class presentations
  • Assignments (35%)

COURSE PROJECT

For the course project, you will work (individually or in teams of 2 or 3) to produce a research paper and present a research talk during the final weeks of the course. The project description will be provided in the second week of classes. There are four submissions for the class project: the initial project proposal, an intermediate draft of the paper, the final paper, and the final talk. The initial project proposal and the intermediate draft will be submitted primarily for feedback from the instructor.

ASSIGNMENTS

Assignments will consist of reviewing papers and summarizing student paper presentations. During each presentation, each student will provide 3 questions and 3 comments. One of the groups will be responsible for submitting a written report containing a brief review of the paper along with summarized responses to student questions. This written report will be shared with everyone in the class.

TENTATIVE SCHEDULE

  • Week 1: Introduction and background
  • Week 2: Algorithmic fairness
  • Week 3: Data science lifecycle and bias in data science pipelines
  • Weeks 4-5: Fairness metrics
  • Weeks 6-7: Bias mitigation techniques
  • Weeks 7-9: Explainability and interpretability of ML models
  • Week 10: SPRING BREAK
  • Weeks 11-15: Debugging ML models and pipelines
  • Week 16: Data management challenges in production ML; Project presentations

LIST OF PAPERS (tentative)

Week 1: Introduction and background
  • Julia Stoyanovich, Serge Abiteboul, Bill Howe, H. V. Jagadish, and Sebastian Schelter. 2022. Responsible Data Management. Communications of the ACM.
  • Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner. 2016. Machine Bias. Propublica.
Week 2: Algorithmic fairness Week 3: Data science lifecycle and bias in data science pipelines Weeks 4-5: Fairness metrics Week 6: Bias mitigation techniques (pre-processing) Week 7: Bias mitigation techniques (in-processing + post-processing) Week 8: Explainability and interpretability of ML models Week 9: Debugging ML models and pipelines (model performance) Week 10: SPRING BREAK

Week 11: Debugging ML models and pipelines (model performance) Week 12: Debugging ML models and pipelines (model performance, data acquisition) Week 13: Debugging ML models and pipelines (impact of data preprocessing, data cleaning) Week 14: Debugging ML models and pipelines (impact of data processing, data cleaning) Week 15: Data preparation auto-learning, evaluation Week 16: Project presentations