Assignments will consist of reviewing papers and summarizing student paper presentations. During each presentation, each student will provide 3 questions and 3 comments. One of the groups will be responsible for submitting a written report containing a brief review of the paper along with summarized responses to student questions. This written report will be shared with everyone in the class.
LIST OF PAPERS (tentative)
Week 1: Introduction and background
- Julia Stoyanovich, Serge Abiteboul, Bill Howe, H. V. Jagadish, and Sebastian Schelter. 2022. Responsible Data Management. Communications of the ACM.
- Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner. 2016. Machine Bias. Propublica.
Week 2: Algorithmic fairness
Week 3: Data science lifecycle and bias in data science pipelines
Weeks 4-5: Fairness metrics
Week 6: Bias mitigation techniques (pre-processing)
- Faisal Kamiran and Toon Calders. 2012. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems. (Presenter: Anuj; Summary: Xinning, Mensah)
- Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and Removing Disparate Impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). (Presenter: Shashank; Summary: Tejendra, Dairian)
- Zemel, R., Wu, Y., Swersky, K., Pitassi, T. & Dwork, C.. 2013. Learning Fair Representations. Proceedings of the 30th International Conference on Machine Learning, in Proceedings of Machine Learning Research. (Presenter: Kevin; Summary: Yuzhe, Meher)
- Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Interventional Fairness: Causal Database Repair for Algorithmic Fairness. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). (Presenter: Yi; Summary: Divya, Ekta)
Week 7: Bias mitigation techniques (in-processing + post-processing)
- Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P. Gummadi. 2017. Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment. In Proceedings of the 26th International Conference on World Wide Web (WWW '17). (Presenter: Ekta; Summary: Yi, Kevin)
- Maya Gupta, Andrew Cotter, Mahdi Milani Fard, and Serena Wang. 2018. Proxy Fairness. arXiv. (Presenter: Divya; Summary: Shashank, Anuj)
- Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of opportunity in supervised learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS'16). (Presenter: Meher; Summary: Xinning, Tejendra)
- Dwork, C., Immorlica, N., Kalai, A.T. & Leiserson, M.. (2018). Decoupled Classifiers for Group-Fair and Efficient Machine Learning. Proceedings of the 1st Conference on Fairness, Accountability and Transparency, in Proceedings of Machine Learning Research. (Presenter: Yuzhe; Summary: Mensah, Dairian)
Week 8: Explainability and interpretability of ML models
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). (Presenter: Dairian; Summary: Yuzhe, Divya)
- Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). (Presenter: Tejendra; Summary: Meher, Ekta)
- Koh, P.W. & Liang, P.. (2017). Understanding Black-box Predictions via Influence Functions. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research. (Presenter: Mensah; Summary: Yi, Kevin)
- Ghorbani, A. & Zou, J.. (2019). Data Shapley: Equitable Valuation of Data for Machine Learning. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research. (Presenter: Xinning; Summary: Shashank, Anuj)
Week 9: Debugging ML models and pipelines (model performance)
- Chung, Y., Kraska, T., Polyzotis, N., Tae, K. H., & Whang, S. E. 2019. Slice finder: Automated data slicing for model validation. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). (Presenter: Shashank; Summary: Yi, Mensah)
- Weiyuan Wu, Lampros Flokas, Eugene Wu, and Jiannan Wang. 2020. Complaint-driven Training Data Debugging for Query 2.0. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). (Presenter: Anuj; Summary: Kevin, Ekta)
- Laure Berti-Equille. 2019. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. In The World Wide Web Conference (WWW '19). (Presenter: Kevin; Summary: Anuj, Tejendra)
- Yanhui Li, Linghan Meng, Lin Chen, Li Yu, Di Wu, Yuming Zhou, and Baowen Xu. 2022. Training data debugging for the fairness of machine learning software. In Proceedings of the 44th International Conference on Software Engineering (ICSE '22). (Presenter: Yi; Summary: Divya, Dairian)
Week 10: SPRING BREAK
Week 11: Debugging ML models and pipelines (model performance)
- Junwen Yang, Yeye He, and Surajit Chaudhuri. 2021. Auto-pipeline: synthesizing complex data pipelines by-target using reinforcement learning and search. Proceedings of the VLDB Endowment. (Presenter: Ekta; Summary: Shashank, Yuzhe)
- Robin Cugny, Julien Aligon, Max Chevalier, Geoffrey Roman Jimenez, and Olivier Teste. 2022. AutoXAI: A Framework to Automatically Select the Most Adapted XAI Solution. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM '22). (Presenter: Divya; Summary: Meher, Xinning)
- Raoni Lourenço, Juliana Freire, and Dennis Shasha. 2020. BugDoc: Algorithms to Debug Computational Processes. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20).(Presenter: Meher; Summary: Kevin, Xinning)
- Sainyam Galhotra, Anna Fariha, Raoni Lourenço, Juliana Freire, Alexandra Meliou, and Divesh Srivastava. 2022. DataPrism: Exposing Disconnect between Data and Systems. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD '22). (Presenter: Yuzhe; Summary: Ekta, Meher)
Week 12: Debugging ML models and pipelines (model performance, data acquisition)
- Ki Hyun Tae and Steven Euijong Whang. 2021. Slice Tuner: A Selective Data Acquisition Framework for Accurate and Fair Machine Learning Models. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21). (Presenter: Dairian; Summary: Yi, Shashank)
- A. Asudeh, Z. Jin and H. V. Jagadish. 2019. Assessing and Remedying Coverage for a Given Dataset. IEEE 35th International Conference on Data Engineering (ICDE). (Presenter: Tejendra; Summary: Mensah, Yuzhe)
- Chengliang Chai, Jiabin Liu, Nan Tang, Guoliang Li, and Yuyu Luo. 2022. Selective data acquisition in the wild for model charging. Proceedings of the VLDB Endowment. (Presenter: Mensah; Summary: Anuj, Dairian)
- Abolfazl Asudeh, Nima Shahbazi, Zhongjun Jin, and H. V. Jagadish. 2021. Identifying Insufficient Data Coverage for Ordinal Continuous-Valued Attributes. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21). (Presenter: Xinning; Summary: Tejendra, Divya)
Week 13: Debugging ML models and pipelines (impact of data preprocessing, data cleaning)
- Sainyam Galhotra, Karthikeyan Shanmugam, Prasanna Sattigeri, and Kush R. Varshney. 2022. Causal Feature Selection for Algorithmic Fairness. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD '22).
- Nianyun Li, Naman Goel, and Elliott Ash. 2022. Data-Centric Factors in Algorithmic Fairness. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society (AIES '22).
- Yiqiao Liao and Parinaz Naghizadeh. 2023. Social Bias Meets Data Bias: The Impacts of Labeling and Measurement Errors on Fairness Criteria. To appear in Proceedings of AAAI 2023.
- Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. arXiV.
Week 14: Debugging ML models and pipelines (impact of data processing, data cleaning)
- Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. 2016. ActiveClean: interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment.
- Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Noack, Hendrik Patzlaff, Felix Naumann, and Hazar Harmouch. The Effects of Data Quality on Machine Learning Performance. arXiV.
- Felix Neutatz, Binger Chen, Yazan Alkhatib, Jingwen Ye & Ziawasch Abedjan. 2022. Data Cleaning and AutoML: Would an Optimizer Choose to Clean?. Datenbank Spektrum.
- Yejia Liu, Weiyuan Wu, Lampros Flokas, Jiannan Wang, and Eugene Wu. 2022. Enabling SQL-based training data debugging for federated learning. Proceedings of the VLDB Endowment
Week 15: Data preparation auto-learning, evaluation
Week 16: Project presentations