CNIT 58100-RDM | Responsible Data Science Lab at Purdue

Purdue CNIT581-RDM: Responsible Data Management (archived)
Spring 2023

Computer & Information Technology
Purdue University

TL;DR: Interested in data management and/or machine learning?
Consider taking CNIT 58100-RDM in Spring 2023.

Questions?
Send an email to the instructor
at rpradhan@purdue.edu

COURSE OVERVIEW

Responsible data management (RDM) is a fast-growing research area focused on responsible data handling practices in data-driven decision-making systems. Research in this area is centered around the transparency of data, algorithms, and data science (DS) pipelines.

This course examines advanced topics relating to algorithmic fairness, transparency, and interpretability of data-driven decision-making systems. We will study current issues related to the transparency and fairness of data-driven decisions, examine sources of unexpected and discriminatory behavior of data-driven decision-making systems and contrast existing methods and design novel techniques to mitigate undesired system decisions. Topics include algorithmic bias, fairness metrics, debugging and mitigating bias/errors, interpretability of algorithms, the data science lifecycle and bias in the DS pipelines.

When: TR 3:00 - 4:15 PM

Where: KNOY Hall, Room B031

Lecture style: The lectures will be a mix of traditional lectures, paper readings/presentations and practical problem solving, discussing responsible data management from different aspects. As a side goal, we will identify potential open problems for further research.

Prerequisites: Any undergraduate data management course and exposure to machine learning.

INSTRUCTOR

Romila Pradhan
Email: rpradhan@purdue.edu

EVALUATION

There will be 2-3 assignments followed by a semester-long project chosen by the student. Each student is also expected to present 2-3 research presentations. The project will be a group project Students will be evaluated as follows:

Project (40%): proposal + initial draft (5%), presentation (15%), report (20%)
Paper presentations (25%): 2-3 in-class presentations
Assignments (35%)

COURSE PROJECT

For the course project, you will work (individually or in teams of 2 or 3) to produce a research paper and present a research talk during the final weeks of the course. The project description will be provided in the second week of classes. There are four submissions for the class project: the initial project proposal, an intermediate draft of the paper, the final paper, and the final talk. The initial project proposal and the intermediate draft will be submitted primarily for feedback from the instructor.

ASSIGNMENTS

Assignments will consist of reviewing papers and summarizing student paper presentations. During each presentation, each student will provide 3 questions and 3 comments. One of the groups will be responsible for submitting a written report containing a brief review of the paper along with summarized responses to student questions. This written report will be shared with everyone in the class.

TENTATIVE SCHEDULE

Week 1: Introduction and background
Week 2: Algorithmic fairness
Week 3: Data science lifecycle and bias in data science pipelines
Weeks 4-5: Fairness metrics
Weeks 6-7: Bias mitigation techniques
Weeks 7-9: Explainability and interpretability of ML models
Week 10: SPRING BREAK
Weeks 11-15: Debugging ML models and pipelines
Week 16: Data management challenges in production ML; Project presentations

LIST OF PAPERS (tentative)

Week 1: Introduction and background

Julia Stoyanovich, Serge Abiteboul, Bill Howe, H. V. Jagadish, and Sebastian Schelter. 2022. Responsible Data Management. Communications of the ACM.
Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner. 2016. Machine Bias. Propublica.

Week 2: Algorithmic fairness

Ramya Srinivasan and Ajay Chander. 2021. Biases in AI Systems: A survey for practitioners. ACM Queue.
Batya Friedman and Helen Nissenbaum. 1996. Bias in computer systems. ACM Transactions on Information Systems.

Week 3: Data science lifecycle and bias in data science pipelines

Jeanette. M. Wing. 2019. The Data Life Cycle. Harvard Data Science Review
Sumon Biswas, Mohammad Wardat, and Hridesh Rajan. 2022. The art and practice of data science pipelines: A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large. In Proceedings of the 44th International Conference on Software Engineering (ICSE '22)

Weeks 4-5: Fairness metrics

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A Survey on Bias and Fairness in Machine Learning. ACM Computing Surveys.
Sahil Verma and Julia Rubin. 2018. Fairness Definitions Explained. 2018 ACM/IEEE International Workshop on Software Fairness.
Dana Pessach and Erez Shmueli. 2022. A Review on Fairness in Machine Learning. ACM Computing Surveys. (Sections 1 through 3)

Week 6: Bias mitigation techniques (pre-processing)

Faisal Kamiran and Toon Calders. 2012. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems. (Presenter: Anuj; Summary: Xinning, Mensah)
Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and Removing Disparate Impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). (Presenter: Shashank; Summary: Tejendra, Dairian)
Zemel, R., Wu, Y., Swersky, K., Pitassi, T. & Dwork, C.. 2013. Learning Fair Representations. Proceedings of the 30th International Conference on Machine Learning, in Proceedings of Machine Learning Research. (Presenter: Kevin; Summary: Yuzhe, Meher)
Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Interventional Fairness: Causal Database Repair for Algorithmic Fairness. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). (Presenter: Yi; Summary: Divya, Ekta)

Week 7: Bias mitigation techniques (in-processing + post-processing)

Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P. Gummadi. 2017. Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment. In Proceedings of the 26th International Conference on World Wide Web (WWW '17). (Presenter: Ekta; Summary: Yi, Kevin)
Maya Gupta, Andrew Cotter, Mahdi Milani Fard, and Serena Wang. 2018. Proxy Fairness. arXiv. (Presenter: Divya; Summary: Shashank, Anuj)
Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of opportunity in supervised learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS'16). (Presenter: Meher; Summary: Xinning, Tejendra)
Dwork, C., Immorlica, N., Kalai, A.T. & Leiserson, M.. (2018). Decoupled Classifiers for Group-Fair and Efficient Machine Learning. Proceedings of the 1st Conference on Fairness, Accountability and Transparency, in Proceedings of Machine Learning Research. (Presenter: Yuzhe; Summary: Mensah, Dairian)

Week 8: Explainability and interpretability of ML models

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). (Presenter: Dairian; Summary: Yuzhe, Divya)
Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). (Presenter: Tejendra; Summary: Meher, Ekta)
Koh, P.W. & Liang, P.. (2017). Understanding Black-box Predictions via Influence Functions. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research. (Presenter: Mensah; Summary: Yi, Kevin)
Ghorbani, A. & Zou, J.. (2019). Data Shapley: Equitable Valuation of Data for Machine Learning. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research. (Presenter: Xinning; Summary: Shashank, Anuj)

Week 9: Debugging ML models and pipelines (model performance)

Chung, Y., Kraska, T., Polyzotis, N., Tae, K. H., & Whang, S. E. 2019. Slice finder: Automated data slicing for model validation. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). (Presenter: Shashank; Summary: Yi, Mensah)
Weiyuan Wu, Lampros Flokas, Eugene Wu, and Jiannan Wang. 2020. Complaint-driven Training Data Debugging for Query 2.0. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). (Presenter: Anuj; Summary: Kevin, Ekta)
Laure Berti-Equille. 2019. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. In The World Wide Web Conference (WWW '19). (Presenter: Kevin; Summary: Anuj, Tejendra)
Yanhui Li, Linghan Meng, Lin Chen, Li Yu, Di Wu, Yuming Zhou, and Baowen Xu. 2022. Training data debugging for the fairness of machine learning software. In Proceedings of the 44th International Conference on Software Engineering (ICSE '22). (Presenter: Yi; Summary: Divya, Dairian)

Week 10: SPRING BREAK

Week 11: Debugging ML models and pipelines (model performance)

Junwen Yang, Yeye He, and Surajit Chaudhuri. 2021. Auto-pipeline: synthesizing complex data pipelines by-target using reinforcement learning and search. Proceedings of the VLDB Endowment. (Presenter: Ekta; Summary: Shashank, Yuzhe)
Robin Cugny, Julien Aligon, Max Chevalier, Geoffrey Roman Jimenez, and Olivier Teste. 2022. AutoXAI: A Framework to Automatically Select the Most Adapted XAI Solution. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM '22). (Presenter: Divya; Summary: Meher, Xinning)
Raoni Lourenço, Juliana Freire, and Dennis Shasha. 2020. BugDoc: Algorithms to Debug Computational Processes. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20).(Presenter: Meher; Summary: Kevin, Xinning)
Sainyam Galhotra, Anna Fariha, Raoni Lourenço, Juliana Freire, Alexandra Meliou, and Divesh Srivastava. 2022. DataPrism: Exposing Disconnect between Data and Systems. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD '22). (Presenter: Yuzhe; Summary: Ekta, Meher)

Week 12: Debugging ML models and pipelines (model performance, data acquisition)

Ki Hyun Tae and Steven Euijong Whang. 2021. Slice Tuner: A Selective Data Acquisition Framework for Accurate and Fair Machine Learning Models. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21). (Presenter: Dairian; Summary: Yi, Shashank)
A. Asudeh, Z. Jin and H. V. Jagadish. 2019. Assessing and Remedying Coverage for a Given Dataset. IEEE 35th International Conference on Data Engineering (ICDE). (Presenter: Tejendra; Summary: Mensah, Yuzhe)
Chengliang Chai, Jiabin Liu, Nan Tang, Guoliang Li, and Yuyu Luo. 2022. Selective data acquisition in the wild for model charging. Proceedings of the VLDB Endowment. (Presenter: Mensah; Summary: Anuj, Dairian)
Abolfazl Asudeh, Nima Shahbazi, Zhongjun Jin, and H. V. Jagadish. 2021. Identifying Insufficient Data Coverage for Ordinal Continuous-Valued Attributes. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21). (Presenter: Xinning; Summary: Tejendra, Divya)

Week 13: Debugging ML models and pipelines (impact of data preprocessing, data cleaning)

Sainyam Galhotra, Karthikeyan Shanmugam, Prasanna Sattigeri, and Kush R. Varshney. 2022. Causal Feature Selection for Algorithmic Fairness. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD '22).
Nianyun Li, Naman Goel, and Elliott Ash. 2022. Data-Centric Factors in Algorithmic Fairness. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society (AIES '22).
Yiqiao Liao and Parinaz Naghizadeh. 2023. Social Bias Meets Data Bias: The Impacts of Labeling and Measurement Errors on Fairness Criteria. To appear in Proceedings of AAAI 2023.
Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. arXiV.

Week 14: Debugging ML models and pipelines (impact of data processing, data cleaning)

Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. 2016. ActiveClean: interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment.
Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Noack, Hendrik Patzlaff, Felix Naumann, and Hazar Harmouch. The Effects of Data Quality on Machine Learning Performance. arXiV.
Felix Neutatz, Binger Chen, Yazan Alkhatib, Jingwen Ye & Ziawasch Abedjan. 2022. Data Cleaning and AutoML: Would an Optimizer Choose to Clean?. Datenbank Spektrum.
Yejia Liu, Weiyuan Wu, Lampros Flokas, Jiannan Wang, and Eugene Wu. 2022. Enabling SQL-based training data debugging for federated learning. Proceedings of the VLDB Endowment

Week 15: Data preparation auto-learning, evaluation

Cong Yan, and Yeye He. Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks. SIGMOD 2020
Junwen Yang, Yeye He, and Surajit Chaudhuri. 2021. Auto-pipeline: synthesizing complex data pipelines by-target using reinforcement learning and search. Proceedings of the VLDB Endowment
Maliha Tashfia Islam, Anna Fariha, Alexandra Meliou, and Babak Salimi. 2022. Through the Data Management Lens: Experimental Analysis and Evaluation of Fair Classification. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD '22).
Sumon Biswas and Hridesh Rajan. 2021. Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021).

Week 16: Project presentations