Data Science W261

Machine Learning at Scale

3 units

Course Description

This course builds on and goes beyond the collect-and-analyze phase of big data by focusing on how machine learning algorithms can be rewritten and extended to scale to work on petabytes of data, both structured and unstructured, to generate sophisticated models used for real-time predictions. Conceptually, the course is divided into two parts. The first covers fundamental concepts of MapReduce parallel computing, through the eyes of Hadoop, MrJob, and Spark, while diving deep into Spark Core, data frames, the Spark Shell, Spark Streaming, Spark SQL, MLlib, and more. The second part focuses on hands-on algorithmic design and development in parallel computing environments (Spark), developing algorithms (decision tree learning), graph processing algorithms (pagerank/shortest path), gradient descent algorithms (support vectors machines), and matrix factorization. Students will use MapReduce parallel compute frameworks for industrial applications and deployments for various fields, including advertising, finance, healthcare, and search engines. Examples and exercises will be made available in Python notebooks (Hadoop Streaming, MrJob and pySpark).

Skill Sets

Code up machine learning algorithms on single machines and on clusters of machines / Amazon AWS / Working on problems with terabytes of data / Machine learning pipelines for petabyte-scale data / Algorithmic design / Parallel computing

Tools

Apache Hadoop / Apache Spark

Course Designer

Prerequisites

Data Science W205 & W207. Intermediate programming skills in an object-oriented language (e.g., Python). Master of Information and Data Science students only.

Last updated:

September 19, 2019