IST 597: Trustworthy Machine Learning [Spring 2026]

Overview

Announcements

Description

This is an intensive graduate seminar on Trustworthy machine learning. The course covers different topics in emerging research areas related to the broader study of security and privacy in machine learning. Students will learn about attacks against computer systems leveraging machine learning, as well as defense techniques to mitigate such attacks.

Prerequisites

The course assumes students already have some basic understanding of machine learning. Students will familiarize themselves with the emerging body of literature from different research communities investigating security&privacy questions in machine learning. The class is designed to help students explore new research directions and applications. Most of the course readings will come from both seminal and recent papers in the field.

Grading Policy

Grades will be computed based on the following factors:

Final grade cutoff:

Assignments

Research Projects

Students are required to do a project in this class. The goal of the course project is to provide the students an opportunity to explore research directions in trustworthy machine learning. The project should be related to the course content. An expected project consists of

Late Submission Policy

Tentative Schedule and Material

DateTopicSlidesReadings&linksAssignments
 Overview of Trustworthy Machine Learninglecture_0  
 Machine learning basics part 1lecture_1  
 Machine learning basics part 2lecture_2  
 Machine learning basics part 3lecture_3  
 Machine learning basics part 4lecture_4  
 Generative Models   
 Test-time intergrity (attack)slidesWhite-box attack:
Goodfellow et al., Explaining and Harnessing Adversarial Examples
Carlinin and Wagner, Towards Evaluating the Robustness of Neural Networks
Moosavi-dezfooli et al., Universal adversarial perturbations
Hard-label black-box attack:
Brendel et al., Decision-based adversarial attacks: reliable attacks against black-box machine learning models
Cheng et al., Query-efficient hard-label black-box attack: an optimization-based approach
Chen et al., HopSkipJumpAttack: A Query-Efficient Decision-Based Attack
 
 Test-time intergrity (defense)slidesMadry et al., Towards Deep Learning Models Resistant to Adversarial Attacks
Wong et al., Fast is better than Free: Revisiting Adversarial Training
Zhang et al., Theoretically Principled Trade-off between Robustness and Accuracy
 
 Training-time intergrity (backdoor attack)slidesLiu et al., Trojaning Attack on Neural Networks
Shafahi et al., Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks
Gu et al., BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
 
 Training-time intergrity (defense)slidesWang et al., Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks
Wang et al., Practical Detection of Trojan Neural Networks: Data-Limited and Data-Free Cases
 
 Test-time intergrity (verification) part 1slidesEric and Kolter, Provable defenses against adversarial examples via the convex outer adversarial polytope
Zhang et al., Efficient Neural Network Robustness Certification with General Activation Functions
Option:
Zhang et al., General Cutting Planes for Bound-Propagation-Based Neural Network Verification
 
 Test-time intergrity (verification) part 2slidesCohen et al., Certified Adversarial Robustness via Randomized Smoothing 
 Training-time intergrity (poisoning attack) Koh and Liang, Understanding Black-box Predictions via Influence Functions
Carlini and Terzis, Poisoning and Backdooring Contrastive Learning
Carlini, Poisoning the Unlabeled Dataset of Semi-Supervised Learning
 
 Confidentiality (data) attackslidesCarlini et al., Extracting Training Data from Large Language Models
Kahla et al., Label-Only Model Inversion Attacks via Boundary RepulsionShokri et al., Membership Inference Attacks against Machine Learning Models
Fredrikson et al., Model inversion attacks that exploit confidence information and basic countermeasures
Choquette-Choo et al., Label-Only Membership Inference Attacks
 
 Confidentiality (model)slidesJagielski et al., High Accuracy and High Fidelity Extraction of Neural Networks
Tramer et al., Stealing Machine Learning Models via Prediction APIs
 
 Confidentiality defenseslidesHuang et al., Unlearnable Examples: Making Personal Data Unexploitable
Maini, Dataset Inference: Ownership Resolution in Machine Learning
 
 FairnessslidesZhao et al., Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints
Dwork et al., Fairness Through Awareness
Caliskan et al., Semantics derived automatically from language corpora contain human-like biases
 
 Watermarking   
 Differential privacy part IslidesDwork et al., Calibrating Noise to Sensitivity in Private Dat Analysis
Abadi et al., Deep Learning with Differential Privacy
 
 Differential privacy part IIslidesPapernot et al., Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data
Mironov, Renyi Differential Privacy
 
 Interpretability (XAI) part 1slidesSimonyan et al., Deep inside convolutional networks: Visualising image classication models and saliency maps
Selvaraju et al., Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
 
 Interpretability (XAI) part 2slidesRibeiro et al., “Why Should I Trust You?”: Explaining the Predictions of Any Classifier
Lundberg and Lee, A unified approach to interpreting model predictions
 
 LLM security   
 Unlearning   
 UncertaintyslidesGuo et al., On Calibration of Modern Neural Networks
Minderer et al., Revisiting the Calibration of Modern Neural Networks
 
 Project Presentation   

References

There is no required textbook for this course. Some recommended readings are