Info 290T

Data Engineering

4 units

Course Description

This class will cover the principles and practices of managing data at scale, with a focus on use cases in data analysis and machine learning. We will cover the entire life cycle of data management and science, ranging from data preparation to exploration, visualization and analysis, to machine learning and collaboration.

The class will balance foundational concerns with exposure to practical languages, tools, and real-world concerns. We will study the foundations of prevalent data models in use today, including relations, tensors, and dataframes, and mappings between them. We will study SQL as a means to query and manipulate data at scale, including performance concerns like views and indexes, query processing and optimization, and transactions, all from a user perspective. We will study the foundations and realities of data preparation, including hands-on work with real-world data using standard Python and SQL frameworks. We will explore data exploration modalities for non-programmers, including the fundamentals behind spreadsheet systems and interactive visual analytics packages. We will look at approaches for managing the machine learning lifecycle of data preparation, model selection and training, model serving and monitoring. Time permitting we will look at technologies for moving, sharing, and caching data including event streaming systems, key-value/document stores, log analytics, and search engines.

Also offered as CS194.

Prerequisites

COMPSCI C100/DATA C100/STAT C100 or COMPSCI189 or INFO 251 or DATA 144/INFO 254 or equivalent upper-div course in data science. COMPSCI 61A or COMPSCI 88 or INFO 206B or equivalent courses in programming.

Last updated:

October 7, 2020