List of theory topics
These materials should not be considered final until the end of the course. Materials from previous editions can be found in other branches of the repository for the course.
There are 11 theory sessions of 2 hours each. They will all take place face-to-face. Please bring your laptop.
Before each class, there are short videos you should watch. They are up to 20 minutes in total, and watching them requires some preparation/scheduling on your part. Please set aside time in your schedule to watch these videos before coming to class, ideally on the day before.
During class, I will present the contents using slides and we will do together some exercises using Nearpod or Google Spreadsheets. Please avoid distractions: place your phone in airplane mode, close all other windows in your computer, and try to stay focused. We will pause frequently during the session to help you regain focus. In one of the sessions, a midterm exam will be taken, and at the end of the course, a final exam will be taken. The exam questions are based exclusively on the materials shown or discussed in the lectures during class.
After each session, there is some reading for you to do. These readings will be much easier after you have attended each lecture, will bring depth to what you learn in class, and will help you remember these contents better. Think of these readings as a form of continuous studying that will save you time and effort when preparing for the exams.
Session 1: Introduction
Before class
During class
- Lecture TT01: introduction ppt/pdf
- Course overview
- Lecture TT02: data, methods, and scenarios ppt/pdf
- Exercise: confidence of a rule
- Exercise: which ones are data mining tasks? (pin board)
- Lecture TT03: data preparation, data types ppt/pdf
- Exercise: compute binning (spreadsheet)
After class
Optional/additional material
-
Strongly recommended to help you prepare for the practice sessions: tutorials 1 and 2 of the book by Tan et al. These tutorials are an introduction to Python, which you should do unless you are very comfortable with this language, and about
numpy
and pandas
, which will save you a ton of time in the practices.
- Read data sources and biases free book chapter of the Trustworthy Machine Learning book.
Session 2: Data cleaning
Before class
-
Watch this 14-minutes talk and demo on Cleaning Data by Mike Pound; his examples use R but the same can be done in many languages.
During class
- Lecture TT04: data integration and cleaning ppt/pdf
- Exercise: how to handle missing data
- Lecture TT05: data reduction and transformation ppt/pdf
- Lecture TT06: similarity computation on numerical data ppt/pdf
- Exercise: compute Lp norm
After class
Optional/additional material
-
Tutorials 3 and 4 of the book by Tan et al. cover issues of data exploration and data pre-processing. The latter is quite similar to our first practice session, but uses a different dataset.
Session 3: Near duplicates
Before class
During class
- Lecture TT07: similarity computation beyond numerical data ppt/pdf
- Exercise: compute Jaccard similarity
- Lecture TT08: finding near-duplicates ppt/pdf
- Exercise: compute signature matrix
- Lecture TT09: locality-sensitive hashing ppt/pdf [SKIPPED IN 2022 – NOT TO BE INCLUDED IN EXAMS]
After class
Optional/additional material
-
Watch this 37-minutes lecture by Ben Langmead on Jaccard coefficient and min-hashing
-
Watch Jeffrey D. Ullman, a famous computer scientist and co-author of one of the books we use in the course and of this method specifically describe this near-duplicate finding method. Two 50-minutes videos: part 1, part 2.
- See presentation TT10: locality-sensitive hashing additional materials ppt/pdf
Session 4: Itemsets
Before class
During class
- Lecture TT11: itemsets ppt/pdf
- Exercise: compute itemset support (spreadsheet)
- Exercise: compute maximal itemsets
- Lecture TT12: association rules ppt/pdf
- Exercise: compute support, confidence, and lift of a rule
After class
Session 5: Association rules mining
Before class
During class
- Lecture TT13: association rules mining ppt/pdf
- Exercise: prove confidence monotonicity
- Exercise: execute Apriori algorithm (spreadsheet)
- Lecture TT14: speeding up association rules mining ppt/pdf
- Exercise: indicate which items are visited in a hash tree
After class
Optional/additional material
Session 6: Mid-term exam (Wednesday October 25th, 2023)
Before class
Study on your own TT01-TT08, TT11-TT13, try to solve exams from past years. Ask your questions in the forum.
The exam will not include TT09, TT10 or TT14.
During class
We will have a mid-term exam, with no class after the mid-term.
Session 7: Recommender systems
Before class
During class
- Lecture TT16: recommender systems ppt/pdf
- Exercise: content-based single-user recommendations (spreadsheet)
- Lecture TT17: interaction-based recommender systems ppt/pdf
- Exercise: recommender based on user similarity (spreadsheet)
After class
Session 8: Recommender systems (cont.) + Outlier analysis
Before class
-
Watch the 7-minutes lecture on outlier analysis by Gourab Nath, discussing why outliers occur
During class
-
We will watch this 8-minutes presentation on how recommender systems work by Art of the Problem, which describes a factorization-based (latent-factors based) method
- Lecture TT18: latent-factors based recommender systems ppt/pdf
- Lecture TT19: outliers introduction and extreme value analysis ppt/pdf
- Exercise: outliers using z-score (spreadsheet)
- Lecture TT20: probability and clustering-based methods ppt/pdf
- Exercise: clustering-based outlier detection (spreadsheet)
After class
Optional/additional material
Session 9: Outlier analysis (cont.) + Data streams
Before class
During class
- Lecture TT21: density- and isolation-based methods ppt/pdf
- Exercise: isolation forest example
- Lecture TT22: data streams ppt/pdf
- Exercise: sampling at a fixed rate
- Lecture TT23: reservoir sampling ppt/pdf
- Exercise: probabilities in a reservoir sample
Note for 2022 – need to finish outlier detection on this class - isolation forest - for practice sessions to have the necessary background.
After class
Optional/additional material
Session 10: Streams (cont.) + Time series mining
Before class
-
Watch this 3-minutes quick presentation on bloom filters by Cube Drone
During class
- Lecture TT24: bloom filters ppt/pdf
- Lecture TT25: probabilistic counting ppt/pdf
- Exercise: ideas for simple probabilistic counting
- Lecture TT27: time series ppt/pdf
- Exercise: smooth a time series (spreadsheet)
Note for 2022 – need to finish probabilistic counting for practice session on probabilistic counting to have the necessary background
After class
Optional/additional material
- See presentation TT26: moments estimation ppt/pdf
Session 11: Time series mining (cont.)
Before class
During class
- Lecture TT28: time series similarity ppt/pdf
- Example: dynamic time warping (spreadsheet)
- Lecture TT29: time series forecasting ppt/pdf
- Example: simple auto-regressive model (spreadsheet)
Note for 2023 – need to finish forecasting for practice session on temperature prediction to have the necessary background
After class
Final exam (Wednesday, December 13th, 2023)
The final exam will include recommender systems, outlier analysis, data streams, and forecasting: topics TT16-TT25, TT27-TT29; it will not include topic TT26.
Notes
Note that session numbers are approximate and subject to change.
Slides available under a Creative Commons license unless specified otherwise.
Main bibliography
Data Mining, The Textbook (2015) by Charu Agrawal. ISBN 978-3-319-14142-8. Free Download
Mining of Massive Datasets SECOND EDITION (2014) by Leskovec et al. ISBN 978-1107077232. Online materials: http://www.mmds.org/. Free Download
Additional bibliography
Introduction to Data Mining SECOND EDITION (2019) by Tan et al. ISBN 978-0-13-312890-1. Online materials: https://www-users.cs.umn.edu/~kumar001/dmbook/index.php
Data Mining and Machine Learning SECOND EDITION (2020) by Zaki and Meira. ISBN 978-1108473989.
Data Mining Concepts and Techniques THIRD EDITION (2011) by Han et al. ISBN 978-0123814791.