IST 597: Trustworthy Machine Learning [Spring 2026]

Overview

Instructor: Minhao Cheng (mmc7149@psu.edu)
Teaching assistant:

Announcements

Description

This is an intensive graduate seminar on Trustworthy machine learning. The course covers different topics in emerging research areas related to the broader study of security and privacy in machine learning. Students will learn about attacks against computer systems leveraging machine learning, as well as defense techniques to mitigate such attacks.

Prerequisites

The course assumes students already have some basic understanding of machine learning. Students will familiarize themselves with the emerging body of literature from different research communities investigating security&privacy questions in machine learning. The class is designed to help students explore new research directions and applications. Most of the course readings will come from both seminal and recent papers in the field.

Grading Policy

Grades will be computed based on the following factors:

Paper presentation (30%)
Paper summaries (10%)
Class notes & participation (10%)
Research project (50%)

Final grade cutoff:

A [93%, 100%]
A- [90%, 93%)
B+ [87%, 90%)
B [83%, 87%)
B- [80%, 83%)
C+ [77%, 80%)
C [70%, 77%)
D [60%, 70%)
F [0%, 60%)

Assignments

Reading summary: A 1 page summary of reading assigned is due each class (starting from week 2 and onwards). A physical copy should be turned in before the beginning of class. The summary should cover the following: (a) what did the papers do well?, (b) where did the papers fall short?, (c) what did you learn from these papers?, and (d) what questions do you have about the papers?
Paper presentation: starting from week 2, each student will present the papers assigned for reading each week. The student may choose an appropriate format (e.g., slides, interactive demos or code tutorials, …) for this presentation with the only requirements being that the presentation should (a) involve the class in active discussions, (b) cover all papers assigned for reading, and (c) last no more than 1h including discussions.
Class notes: Another team of students will be charged with writing notes synthesizing the content of the presentation and class discussion.

Research Projects

Students are required to do a project in this class. The goal of the course project is to provide the students an opportunity to explore research directions in trustworthy machine learning. The project should be related to the course content. An expected project consists of

A novel and sound solution to an interesting problem
Comprehensive literature review and discussion
Thorough theoretical/experimental evaluation and comparisons with existing approaches

Late Submission Policy

All reports are due on Thursday at 11:59 pm (EST).
Students can submit late with the penalty of 25% deduction for every 12 hours late (up to 2 days).
After 2 days, no more late submission is allowed.
Extensions can be granted for special cases (email the instructor)

Tentative Schedule and Material

Topic	Slides	Readings&links
Overview of Trustworthy Machine Learning	lecture_0
Machine learning basics part 1	lecture_1
Machine learning basics part 2	lecture_2
Machine learning basics part 3	lecture_3
Machine learning basics part 4	lecture_4
Generative Models
Test-time intergrity (attack)	slides	White-box attack: • Goodfellow et al., Explaining and Harnessing Adversarial Examples • Carlinin and Wagner, Towards Evaluating the Robustness of Neural Networks • Moosavi-dezfooli et al., Universal adversarial perturbations Hard-label black-box attack: • Brendel et al., Decision-based adversarial attacks: reliable attacks against black-box machine learning models • Cheng et al., Query-efficient hard-label black-box attack: an optimization-based approach • Chen et al., HopSkipJumpAttack: A Query-Efficient Decision-Based Attack
Test-time intergrity (defense)	slides	• Madry et al., Towards Deep Learning Models Resistant to Adversarial Attacks • Wong et al., Fast is better than Free: Revisiting Adversarial Training • Zhang et al., Theoretically Principled Trade-off between Robustness and Accuracy
Training-time intergrity (backdoor attack)	slides	• Liu et al., Trojaning Attack on Neural Networks • Shafahi et al., Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks • Gu et al., BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Training-time intergrity (defense)	slides	• Wang et al., Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks • Wang et al., Practical Detection of Trojan Neural Networks: Data-Limited and Data-Free Cases
Test-time intergrity (verification) part 1	slides	• Eric and Kolter, Provable defenses against adversarial examples via the convex outer adversarial polytope • Zhang et al., Efficient Neural Network Robustness Certification with General Activation Functions Option: • Zhang et al., General Cutting Planes for Bound-Propagation-Based Neural Network Verification
Test-time intergrity (verification) part 2	slides	• Cohen et al., Certified Adversarial Robustness via Randomized Smoothing
Training-time intergrity (poisoning attack)		• Koh and Liang, Understanding Black-box Predictions via Influence Functions • Carlini and Terzis, Poisoning and Backdooring Contrastive Learning • Carlini, Poisoning the Unlabeled Dataset of Semi-Supervised Learning
Confidentiality (data) attack	slides	• Carlini et al., Extracting Training Data from Large Language Models • Kahla et al., Label-Only Model Inversion Attacks via Boundary Repulsion • Shokri et al., Membership Inference Attacks against Machine Learning Models • Fredrikson et al., Model inversion attacks that exploit confidence information and basic countermeasures • Choquette-Choo et al., Label-Only Membership Inference Attacks
Confidentiality (model)	slides	• Jagielski et al., High Accuracy and High Fidelity Extraction of Neural Networks • Tramer et al., Stealing Machine Learning Models via Prediction APIs
Confidentiality defense	slides	• Huang et al., Unlearnable Examples: Making Personal Data Unexploitable • Maini, Dataset Inference: Ownership Resolution in Machine Learning
Fairness	slides	• Zhao et al., Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints • Dwork et al., Fairness Through Awareness • Caliskan et al., Semantics derived automatically from language corpora contain human-like biases
Watermarking
Differential privacy part I	slides	• Dwork et al., Calibrating Noise to Sensitivity in Private Dat Analysis • Abadi et al., Deep Learning with Differential Privacy
Differential privacy part II	slides	• Papernot et al., Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data • Mironov, Renyi Differential Privacy
Interpretability (XAI) part 1	slides	• Simonyan et al., Deep inside convolutional networks: Visualising image classication models and saliency maps • Selvaraju et al., Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Interpretability (XAI) part 2	slides	• Ribeiro et al., “Why Should I Trust You?”: Explaining the Predictions of Any Classifier • Lundberg and Lee, A unified approach to interpreting model predictions
LLM security
Unlearning
Uncertainty	slides	• Guo et al., On Calibration of Modern Neural Networks • Minderer et al., Revisiting the Calibration of Modern Neural Networks
Project Presentation

References

There is no required textbook for this course. Some recommended readings are

Deep Learning (by Ian Goodfellow, Yoshua Bengio, Aaron Courville)
Adversarial Robustness for Machine Learning (By Pin-Yu Chen and Cho-Jui Hsieh )