Jumpstart Your Data Science Career

I teach a 10 week part time class that aims to give students a rapid introduction to the field of data science that's designed to be equal parts practical, relevant, and hands on.

Class Descriptions

Welcome to DAT! The first class will go over the essentials of the class, and give students a quick overview of what's to come. We'll go over the syllabus, and have everyone setup their github repo for the class, which will be used to distribute class material. We'll also cover some of the most basic commands you use to navigate git. We'll finish the class by going over the Data Science workflow and finish the class by looking at a simple, but complete data science script!
We'll begin the meat of our course where data scientists typically spend 50% of their time: cleaning, managing, and visualizing data! Understanding how to think about structuring and manipulating data to get it to work according to your stated objectives. Today's class will introduce students to the basics of the pandas library, where we go over its most common syntactical patterns and operations.
What's the best way to efficiently fill in missing values? What if we want to apply an operation to numeric data and nothing else? These sorts of operations and many more are cleanly built into the Pandas API, and this class provides a crash course on the most succinct way of implementing the dozen or so most common of these types of operations.
More fun with cleaning data: today we go over common strategies for using pandas to summarize and merge data. We also go over some of the most common tools to extract time based data, which is often one of the most informative portions of your dataset.
Today we continue to work with time based data, but adding an additional, complicated twist. What if you want to extract information that compares data that happened today with data that happened last week? Last year? Understanding how to capture this data is critically important for adequately capturing the information contained within your dataset. Additionally, we look to add the following twist: how do you perform these operations if your data is structured in hierarchies? Hierarchical data is a common pattern in real world datasets, and understanding how to work with it is a common competency that data scientists must have in their tool belt.
Today's class goes over the fundamentals of plotting and visualization with one of the most powerful python graphing tools: Plot.ly. The class goes over the basics of creating charts of various ilk, and later covers how the tool can be used to make your charts interactive and suitable for dashboards.
In today's class students will break into groups to present their unit 2 homework projects. In the second half, we'll kick off unit 3 with a discussion of decision trees. Decision trees are one of the most common base learners in powerful ML models. We'll walk through their basic workings, as well as their largest strengths and weaknesses. We'll finish the class by taking our first look at the most common ML library used in the python ecosystem: SciKit Learn.
With our knowledge of decision trees behind us, we'll discuss the most powerful technique for analyzing structured data: gradient boosting. Gradient boosting is a highly powered (and underused!) machine learning technique that very effectively captures the primary benefits of decision trees and minimizes their weaknesses. Today's class will go over its main workings.
In today's class we'll go over some of the most important techniques for prepping data for Machine Learning models. This includes dealing with categorical (text) based data, as well as how to treat data within training and test sets. The class will have an extended discussion of one of the most common issues with encoding text-based data: categorical data with a large number of unique values. We'll wrap up by discussing how to use model pipelines for chaining together multiple processing steps, to make out-of-sample prediction much more feasible in production settings.
How do we know our results will generalize to out-of-sample data? What's the best way to surmise whether or not a change to our model or dataset will provide an enduring improvement? These are critical questions to ask when embarking on a data science problem, and we'll discuss the framework to use for answering these questions. We'll start with an overview of the training-validation-test set framework, and wrap up with a look at a more thorough way of estimating out-of-sample performance: KFold cross-validation.
In today's class we'll discuss the most powerful way of understanding causality in "black box" models: Partial Dependence. A large reason organizations tend to avoid more powerful machine learning techniques is due to difficulty in understanding what causes their output. Partial dependence analysis allows a data scientist to clearly understand what drives the results that they are seeing, and what factors are most responsible for causing the outcome being measured. They are indispensable for answering the "why" of whatever is being studied.
In today's class we'll go over an absolutely critical, but often undertaught aspect of data science: putting machine learning models into production! Aspiring data scientists need to understand that business decisions do not typically happen due to what happens inside of a jupyter notebook. Instead, analysis needs to make its way to people who do not have familiarity with your code or your techniques, but still be able to use it to illuminate their decision making. We'll make a data application using a new framework called streamlit, and we'll finish the class by taking the exhilarating step of deploying our application onto the web! Many students mention that having a deployed application is the most gratifying portion of the entire class, and the most useful takeaway skill they take with them to work.
Now we'll shift our attention to using gradient boosting for classification -- when you have to predict a category and not a number. We'll talk about the most common issues practitioners face when working with classification: imbalanced classes. In doing so we'll look at a more modern, specialized implementation of Gradient Boosting: xgboost! Xgboost is a high performance implementation of gradient boosting that is better built for parallelization and large, messy datasets and is the most common implementation of gradient boosting used for production ml models.
We'll continue our discussion of classification models by discussing how their interpretation and tuning is often different from regression models. We'll discuss the tradeoff between false positives and false negatives, and improve our model by re-weighting our classes to make their total losses equal to our model's loss function.
We'll continue our work for this unit by having a more immersive, in-depth lab where people will have the chance to work through a dataset with a classification problem and then look through the results and work through it from end-to-end.
Welcome to unit 4! With unit 3 behind us, we'll move past gradient boosting and take a close look at the primary method to use on unstructured data: deep learning. Deep learning is the application of neural networks with at least 2 hidden layers of weights. The first portion of this class is explanatory in nature, where we walk through the mechanics of how a neural network makes its calculations. The last portion of class introduces us to an NLP dataset, where we'll build our first neural network in keras to predict text sentiment.
With our first taste of how deep learning works, we'll spend the majority of this class hunkering down on the important but often elusive details of how and why a neural network for NLP should be constructed in a particular way. We'll discuss the use of word embeddings to numerically capture the interplay of different words, activation functions, and optimizers, and wrap up class by making improvements to our initial NLP model from the previous class.
Neural networks work best with extremely large, multi-faceted datasets. One of the largest challenges for catalyzing a deep learning project is getting enough training data for a neural network to actually be useful. In today's class we'll discuss the primary method for catalyzing model training for NLP datasets: transfer learning. Transfer learning allows you to re-use pretrained neural network layers that were exposed to data similar to your own. It allows you to dramatically reduce the amount of training data necessary to get useful results, and can often allow you to build workable models with thousands, instead of millions or billions of samples.
Class will conclude with a class where we take our NLP neural network and deploy it on the web using heroku and streamlit. The application will allow users to search for a twitter user and check their current sentiment based on their latest tweet. The final activity will elegantly tie together many of the concepts discussed in class: harvesting an API for real time data, deep learning, as well as streamlined model deployment to make our work interactive and usable through the world writ-large.

Unit Summaries

Unit 1

Cleaning And Visualizing Data

Objective: This module takes an extended deep dive into Pandas, the most commonly used tool for cleaning, analyzing and visualizing data. It also wraps up with an introduction to visualization with plotly. It's designed to give students hands on practice for wrangling data and visualizing their results with interactive graphs.

Tools used: Pandas, Plotly

Homework Assignments

  • IMDB Movie Dataset
  • Chipotle Data Set HW
  • File Streaming and Large Data Sets

Unit 2

Machine Learning Fundamentals

Objective: Unit 2 covers the fundamental design principles in building high performance machine learning models? What are the most powerful techniques to use, and how do we get them to perform at their best? Unit 2 trains students in common problems and techniques practitioners need to use to build reliable data science projects.

Tools used: Scikit-Learn, Pandas, xgboost, streamlit

Homework Assignments

  • Model Building Challenge

Unit 3

XGBoost & Classification`

Objective: Unit 3 is designed to build on and extend the main lessons from unit 2 and apply them with a more flexible, powerful machine learning framework: xgboost. We'll learn about its inner workings, how it allows you to scale your models more effectively, and in doing so we'll tackle a different type of machine learning problem: classification.

Tools used: XGBoost

Homework Assignments

  • Model Deployment Challenge

Unit 4

Deep Learning & NLP

Objective: Unit 4 gives students an introduction to using neural networks to derive statistical patterns in unstructured data, such as text and images. We'll start with basics, and end with state of the art models that use the latest deep learning architectures.

Tools used: Tensorflow, Keras

Homework Assignments

  • Independent Project