List of theory topics
This Google Drive Folder contains the slides for the 2024 lectures.
There are 11 theory sessions of 2 hours each. They will all take place face-to-face. Please bring your laptop.
Before each class, there are short videos you should watch. They are up to 20 minutes in total, and watching them requires some preparation/scheduling on your part. Please set aside time in your schedule to watch these videos before coming to class, ideally on the day before.
During class, I will present the contents using slides and we will do together some exercises using Nearpod or Google Spreadsheets. Please avoid distractions: place your phone in airplane mode, close all other windows in your computer, and try to stay focused. We will pause frequently during the session to help you regain focus. In one of the sessions, a midterm exam will be taken, and at the end of the course, a final exam will be taken. The exam questions are based exclusively on the materials shown or discussed in the lectures during class.
After each session, there is some reading for you to do. These readings will be much easier after you have attended each lecture, will bring depth to what you learn in class, and will help you remember these contents better. Think of these readings as a form of continuous studying that will save you time and effort when preparing for the exams.
This Google Drive Folder contains the slides for the 2024 lectures.
Session 1: Introduction
Before class
During class
- Lecture TT01: The Data Mining Process
- Course overview
-
Lecture TT02: Data, Methods, and Scenarios
- Exercise: number of columns
- Exercise: confidence of a rule
- Exercise: which ones are data mining tasks? (pin board)
-
Lecture TT03: Data Preparation - Data Types
- Exercise: compute binning (spreadsheet)
After class
Optional/additional material
-
Strongly recommended to help you prepare for the practice sessions: tutorials 1 and 2 of the book by Tan et al. These tutorials are an introduction to Python, which you should do unless you are very comfortable with this language, and about
numpy
and pandas
, which will save you a ton of time in the practices.
- Read data sources and biases free book chapter of the Trustworthy Machine Learning book.
Session 2: Data cleaning
Before class
-
Watch this 14-minutes talk and demo on Cleaning Data by Mike Pound; his examples use R but the same can be done in many languages.
During class
-
Lecture TT04: Data Preparation - Integration and Cleaning
- Exercise: how to handle missing data
- Lecture TT05: Data Preparation - Reduction and Transformation
-
Lecture TT06: Similarity - Numerical Data
- Exercise: compute Lp norm
After class
Optional/additional material
-
Tutorials 3 and 4 of the book by Tan et al. cover issues of data exploration and data pre-processing. The latter is quite similar to our first practice session, but uses a different dataset.
Session 3: Near duplicates
Before class
During class
-
Lecture TT07: Similarity - Beyond Numerical Data
- Exercise: compute Jaccard similarity
-
Lecture TT08: Finding Near-Duplicates
- Exercise: compute signature matrix
- Lecture TT09: Locality-Sensitive Hashing
After class
Optional/additional material
-
Watch this 37-minutes lecture by Ben Langmead on Jaccard coefficient and min-hashing
-
Watch Jeffrey D. Ullman, a famous computer scientist and co-author of one of the books we use in the course and of this method specifically describe this near-duplicate finding method. Two 50-minutes videos: part 1, part 2.
- See presentation TT10: Locality-Sensitive Hashing (Additional)
Session 4: Itemsets
Before class
During class
-
Lecture TT11: Itemsets
- Exercise: compute itemset support (spreadsheet)
- Exercise: compute maximal itemsets
-
Lecture TT12: Association Rules
- Exercise: compute support, confidence, and lift of a rule
After class
Session 5: Association rules mining
Before class
During class
-
Lecture TT13: Association Rules Mining
- Exercise: prove confidence monotonicity
- Exercise: execute Apriori algorithm (spreadsheet)
-
Lecture TT14: Improved Association Rules Mining
- Exercise: indicate which items are visited in a hash tree
After class
Optional/additional material
Session 6: Mid-term exam (Tue October 22nd, 2024 08:30-10:30)
Before class
Study on your own TT01-TT09, TT11-TT14, try to solve exams from past years. Ask your questions in the forum.
The exam will not include TT10.
During class
We will have a mid-term exam, with no class after the mid-term.
Session 7: Recommender systems
Before class
During class
-
Lecture TT16: Recommender Systems
- Exercise: content-based single-user recommendations (spreadsheet)
-
Lecture TT17: Recommender Systems - Interaction-Based
- Exercise: recommender based on user similarity (spreadsheet)
After class
Session 8: Recommender systems (cont.) + Outlier analysis
Before class
-
Watch the 7-minutes lecture on outlier analysis by Gourab Nath, discussing why outliers occur
During class
-
We will watch this 8-minutes presentation on how recommender systems work by Art of the Problem, which describes a factorization-based (latent-factors based) method
- Lecture TT18: Recommender Systems - Latent Factors
-
Lecture TT19: Outliers - Extreme Values
- Exercise: outliers using z-score (spreadsheet)
-
Lecture TT20: Outliers - Probability Density Methods
- Exercise: clustering-based outlier detection (spreadsheet)
After class
Optional/additional material
Session 9: Outlier analysis (cont.) + Data streams
Before class
During class
-
Lecture TT21: Outliers - Density- and Isolation-Based Methods
- Exercise: isolation forest example
-
Lecture TT22: Data Streams
- Exercise: sampling at a fixed rate
-
Lecture TT23: Data Streams - Reservoir Sampling
- Exercise: probabilities in a reservoir sample
After class
Optional/additional material
Session 10: Streams (cont.) + Time series mining
Before class
-
Watch this 3-minutes quick presentation on bloom filters by Cube Drone
During class
-
Lecture TT24: Data Streams - Bloom Filters
-
Lecture TT25: Data Streams - Probabilistic Counting
- Exercise: ideas for simple probabilistic counting
-
Lecture TT27: Time Series Analysis
- Exercise: smooth a time series (spreadsheet)
After class
Optional/additional material
- See presentation TT26: Data Streams - Estimating Moments (Additional)
Session 11: Time series mining (cont.)
Before class
During class
-
Lecture TT28: Time Series - Similarity
- Example: dynamic time warping (spreadsheet)
-
Lecture TT29: Time Series - Forecasting
- Example: simple auto-regressive model (spreadsheet)
After class
Final exam (December 11th, 09:30-11:30)
The date of the final exam is fixed by the School of Engineering. Please check their webpage for potential changes.
The final exam will include recommender systems, outlier analysis, data streams, and forecasting: topics TT16-TT25, TT27-TT29; it will not include topic TT26.
Notes
Session numbers are approximate and subject to change. Materials should not be considered final until the end of the course.
Slides are available under a Creative Commons license unless specified otherwise.
Main bibliography
Data Mining, The Textbook (2015) by Charu Agrawal. ISBN 978-3-319-14142-8. Free Download
Mining of Massive Datasets SECOND EDITION (2014) by Leskovec et al. ISBN 978-1107077232. Online materials: http://www.mmds.org/. Free Download
Additional bibliography
Introduction to Data Mining SECOND EDITION (2019) by Tan et al. ISBN 978-0-13-312890-1. Online materials: https://www-users.cs.umn.edu/~kumar001/dmbook/index.php
Data Mining and Machine Learning SECOND EDITION (2020) by Zaki and Meira. ISBN 978-1108473989.
Data Mining Concepts and Techniques THIRD EDITION (2011) by Han et al. ISBN 978-0123814791.