Henry Adams

The Geometry of Data

                        

AIMS Rwanda Training School, Spring 2023

This one-week course is part of the graduate training school Foundational Methods in Data Science in Kigali, Rwanda.

Instructor: Henry Adams
Email: henryhughadams at gmail dot com

Course topic: The geometry of a dataset often reflects important patterns within. For example, different clusters may represent different groups within a dataset that could be modeled separately. This course provides an introduction to geometric techniques for analyzing data, primarily in an unsupervised fashion. We will provide visual and mathematical introductions to clustering and dimensionality reduction. We will also implement these algorithms on real-world datasets, including (i) the conformation space of the cyclo-octane molecule, and (ii) a space of 3x3 pixel patches from optical images.

Goals: Students will become fluent with the main ideas and the language of clustering and dimensionality reduction, and will be able to communicate these ideas to others.

Course notes: The main course notes are available at [Course notes PDF], but we will not use all of this material.

Course software: The course software is available at the following link: [Course GitHub Page].

Schedule

Date Class Topic Remark

March 20 (A) Clustering: K-means clustering [Notes, pages 19-65]
March 20 (B) Clustering: Hierarchical clustering [Notes, pages 66-97]
March 21 (A) Clustering: Hands-on practice [Exercises 1, Jupyter 1, Exercises 2, Jupyter 2]
March 21 (B) Special topic: An introduction to applied topology [Slides, Software tutorial]
March 22 Dimensionality reduction: Principal component analysis (PCA) [Notes, pages 98-116]
March 23 (A) Dimensionality reduction: Hands-on practice (3-circle model) [Exercises, Jupyter notebook]
March 23 (B) Dimensionality reduction: Nonlinear techniques [Notes, pages 117-121]
March 24 Dimensionality reduction: Hands-on practice [Jupyter notebook]

Installation instructions: We will be running a few jupyter notebooks. I recommend doing so locally on your computer. To do so, I first recommend installing Anaconda. Afterwards, install scikit-learn on your computer, using a terminal command such as "pip install -U scikit-learn". Then, download the course jupyter notebooks (the .ipynb files) from the [Course GitHub Page]. In your terminal, go to the folder containing these download files, and type "jupyter notebook" in the terminal. This should open a browser, where you can select one of the jupyter notebooks to run, which should open in your browser in a graphical user interface.

If the above installation instructions don't yet work for you, please solicit help from somebody with more experience with jupyter notebooks. As a possible back-up plan, you can try skipping the local installation and instead try running things directly in your browser using Colab; see for example the [K-means Colab] or the [PCA Colab]. However, jupyter notebooks tend to work better when downloaded locally, and so alternatively you can work with somebody for whom the installation was successful.

YouTube video contributions: I help organize the Applied Algebraic Topology Research Network (AATRN), which has active weekly online seminars, and whose YouTube channel has 5,000 subscribers, 500 videos, and about 20 hours watched per day. We are interested in hosting 5-20 minute videos from any of you on data science (broadly intepreted) --- the topic does not need to be related to topology. These could be tutorial videos or research videos. Please see the "Video contributions" section at https://www.aatrn.net/participate for more details.