Purdue CNIT581-RDM: Responsible Data Management (archived)
Spring 2023
Computer & Information Technology
Purdue University
TL;DR: Interested in data management and/or machine learning?
Consider taking CNIT 58100-RDM in Spring 2023.
Questions?
Send an email to the instructor
at rpradhan@purdue.edu COURSE OVERVIEW
Responsible data management (RDM) is a fast-growing research area focused on responsible data handling practices in data-driven decision-making systems. Research in this area is centered around the transparency of data, algorithms, and data science (DS) pipelines.
This course examines advanced topics relating to algorithmic fairness, transparency, and interpretability of data-driven decision-making systems. We will study current issues related to the transparency and fairness of data-driven decisions, examine sources of unexpected and discriminatory behavior of data-driven decision-making systems and contrast existing methods and design novel techniques to mitigate undesired system decisions. Topics include algorithmic bias, fairness metrics, debugging and mitigating bias/errors, interpretability of algorithms, the data science lifecycle and bias in the DS pipelines.
When: TR 3:00 - 4:15 PM
Where: KNOY Hall, Room B031
Lecture style: The lectures will be a mix of traditional lectures, paper readings/presentations and practical problem solving, discussing responsible data management from different aspects. As a side goal, we will identify potential open problems for further research.
Prerequisites: Any undergraduate data management course and exposure to machine learning.
INSTRUCTOR
Romila Pradhan Email:
rpradhan@purdue.edu EVALUATION
There will be 2-3 assignments followed by a semester-long project chosen by the student. Each student is also expected to present 2-3 research presentations. The project will be a group project Students will be evaluated as follows:
- Project (40%): proposal + initial draft (5%), presentation (15%), report (20%)
- Paper presentations (25%): 2-3 in-class presentations
- Assignments (35%)
COURSE PROJECT
For the course project, you will work (individually or in teams of 2 or 3) to produce a research paper and present a research talk during the final weeks of the course. The project description will be provided in the second week of classes. There are four submissions for the class project: the initial project proposal, an intermediate draft of the paper, the final paper, and the final talk. The initial project proposal and the intermediate draft will be submitted primarily for feedback from the instructor.
ASSIGNMENTS
Assignments will consist of reviewing papers and summarizing student paper presentations. During each presentation, each student will provide 3 questions and 3 comments. One of the groups will be responsible for submitting a written report containing a brief review of the paper along with summarized responses to student questions. This written report will be shared with everyone in the class.
TENTATIVE SCHEDULE
- Week 1: Introduction and background
- Week 2: Algorithmic fairness
- Week 3: Data science lifecycle and bias in data science pipelines
- Weeks 4-5: Fairness metrics
- Weeks 6-7: Bias mitigation techniques
- Weeks 7-9: Explainability and interpretability of ML models
- Week 10: SPRING BREAK
- Weeks 11-15: Debugging ML models and pipelines
- Week 16: Data management challenges in production ML; Project presentations
LIST OF PAPERS (tentative)
Week 1: Introduction and background
- Julia Stoyanovich, Serge Abiteboul, Bill Howe, H. V. Jagadish, and Sebastian Schelter. 2022. Responsible Data Management. Communications of the ACM.
- Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner. 2016. Machine Bias. Propublica.
Week 2: Algorithmic fairness
Week 3: Data science lifecycle and bias in data science pipelines
Weeks 4-5: Fairness metrics
Week 6: Bias mitigation techniques (pre-processing)
- Faisal Kamiran and Toon Calders. 2012. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems. (Presenter: Anuj; Summary: Xinning, Mensah)
- Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and Removing Disparate Impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). (Presenter: Shashank; Summary: Tejendra, Dairian)
- Zemel, R., Wu, Y., Swersky, K., Pitassi, T. & Dwork, C.. 2013. Learning Fair Representations. Proceedings of the 30th International Conference on Machine Learning, in Proceedings of Machine Learning Research. (Presenter: Kevin; Summary: Yuzhe, Meher)
- Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Interventional Fairness: Causal Database Repair for Algorithmic Fairness. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). (Presenter: Yi; Summary: Divya, Ekta)
Week 7: Bias mitigation techniques (in-processing + post-processing)
- Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P. Gummadi. 2017. Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment. In Proceedings of the 26th International Conference on World Wide Web (WWW '17). (Presenter: Ekta; Summary: Yi, Kevin)
- Maya Gupta, Andrew Cotter, Mahdi Milani Fard, and Serena Wang. 2018. Proxy Fairness. arXiv. (Presenter: Divya; Summary: Shashank, Anuj)
- Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of opportunity in supervised learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS'16). (Presenter: Meher; Summary: Xinning, Tejendra)
- Dwork, C., Immorlica, N., Kalai, A.T. & Leiserson, M.. (2018). Decoupled Classifiers for Group-Fair and Efficient Machine Learning. Proceedings of the 1st Conference on Fairness, Accountability and Transparency, in Proceedings of Machine Learning Research. (Presenter: Yuzhe; Summary: Mensah, Dairian)
Week 8: Explainability and interpretability of ML models
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). (Presenter: Dairian; Summary: Yuzhe, Divya)
- Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). (Presenter: Tejendra; Summary: Meher, Ekta)
- Koh, P.W. & Liang, P.. (2017). Understanding Black-box Predictions via Influence Functions. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research. (Presenter: Mensah; Summary: Yi, Kevin)
- Ghorbani, A. & Zou, J.. (2019). Data Shapley: Equitable Valuation of Data for Machine Learning. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research. (Presenter: Xinning; Summary: Shashank, Anuj)
Week 9: Debugging ML models and pipelines (model performance)
- Chung, Y., Kraska, T., Polyzotis, N., Tae, K. H., & Whang, S. E. 2019. Slice finder: Automated data slicing for model validation. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). (Presenter: Shashank; Summary: Yi, Mensah)
- Weiyuan Wu, Lampros Flokas, Eugene Wu, and Jiannan Wang. 2020. Complaint-driven Training Data Debugging for Query 2.0. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). (Presenter: Anuj; Summary: Kevin, Ekta)
- Laure Berti-Equille. 2019. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. In The World Wide Web Conference (WWW '19). (Presenter: Kevin; Summary: Anuj, Tejendra)
- Yanhui Li, Linghan Meng, Lin Chen, Li Yu, Di Wu, Yuming Zhou, and Baowen Xu. 2022. Training data debugging for the fairness of machine learning software. In Proceedings of the 44th International Conference on Software Engineering (ICSE '22). (Presenter: Yi; Summary: Divya, Dairian)
Week 10: SPRING BREAK
Week 11: Debugging ML models and pipelines (model performance)
- Junwen Yang, Yeye He, and Surajit Chaudhuri. 2021. Auto-pipeline: synthesizing complex data pipelines by-target using reinforcement learning and search. Proceedings of the VLDB Endowment. (Presenter: Ekta; Summary: Shashank, Yuzhe)
- Robin Cugny, Julien Aligon, Max Chevalier, Geoffrey Roman Jimenez, and Olivier Teste. 2022. AutoXAI: A Framework to Automatically Select the Most Adapted XAI Solution. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM '22). (Presenter: Divya; Summary: Meher, Xinning)
- Raoni Lourenço, Juliana Freire, and Dennis Shasha. 2020. BugDoc: Algorithms to Debug Computational Processes. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20).(Presenter: Meher; Summary: Kevin, Xinning)
- Sainyam Galhotra, Anna Fariha, Raoni Lourenço, Juliana Freire, Alexandra Meliou, and Divesh Srivastava. 2022. DataPrism: Exposing Disconnect between Data and Systems. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD '22). (Presenter: Yuzhe; Summary: Ekta, Meher)
Week 12: Debugging ML models and pipelines (model performance, data acquisition)
- Ki Hyun Tae and Steven Euijong Whang. 2021. Slice Tuner: A Selective Data Acquisition Framework for Accurate and Fair Machine Learning Models. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21). (Presenter: Dairian; Summary: Yi, Shashank)
- A. Asudeh, Z. Jin and H. V. Jagadish. 2019. Assessing and Remedying Coverage for a Given Dataset. IEEE 35th International Conference on Data Engineering (ICDE). (Presenter: Tejendra; Summary: Mensah, Yuzhe)
- Chengliang Chai, Jiabin Liu, Nan Tang, Guoliang Li, and Yuyu Luo. 2022. Selective data acquisition in the wild for model charging. Proceedings of the VLDB Endowment. (Presenter: Mensah; Summary: Anuj, Dairian)
- Abolfazl Asudeh, Nima Shahbazi, Zhongjun Jin, and H. V. Jagadish. 2021. Identifying Insufficient Data Coverage for Ordinal Continuous-Valued Attributes. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21). (Presenter: Xinning; Summary: Tejendra, Divya)
Week 13: Debugging ML models and pipelines (impact of data preprocessing, data cleaning)
- Sainyam Galhotra, Karthikeyan Shanmugam, Prasanna Sattigeri, and Kush R. Varshney. 2022. Causal Feature Selection for Algorithmic Fairness. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD '22).
- Nianyun Li, Naman Goel, and Elliott Ash. 2022. Data-Centric Factors in Algorithmic Fairness. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society (AIES '22).
- Yiqiao Liao and Parinaz Naghizadeh. 2023. Social Bias Meets Data Bias: The Impacts of Labeling and Measurement Errors on Fairness Criteria. To appear in Proceedings of AAAI 2023.
- Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. arXiV.
Week 14: Debugging ML models and pipelines (impact of data processing, data cleaning)
- Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. 2016. ActiveClean: interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment.
- Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Noack, Hendrik Patzlaff, Felix Naumann, and Hazar Harmouch. The Effects of Data Quality on Machine Learning Performance. arXiV.
- Felix Neutatz, Binger Chen, Yazan Alkhatib, Jingwen Ye & Ziawasch Abedjan. 2022. Data Cleaning and AutoML: Would an Optimizer Choose to Clean?. Datenbank Spektrum.
- Yejia Liu, Weiyuan Wu, Lampros Flokas, Jiannan Wang, and Eugene Wu. 2022. Enabling SQL-based training data debugging for federated learning. Proceedings of the VLDB Endowment
Week 15: Data preparation auto-learning, evaluation
Week 16: Project presentations