About This Course
This 5-day class combines our 3-day Python Foundations with materials on data access, data visualization, and machine learning essential to data scientists. This fast-paced class is intended for practicing data scientists, data analysts, and business intelligence experts interested in using Python for their day-to-day work. The primary focus is on learning to use Python tools for data science, data analysis, and machine learning efficiently and effectively.
Days 1–3: Python Foundations
- It begins with a one-day introduction to the Python language focusing on standard data structures, control constructs, and code organization.
- After a brief overview of the Scientific Python ecosystem, we dive into techniques for numeric data processing, including efficiently manipulating and processing large data sets using NumPy arrays and data visualization with 2D plots using Matplotlib.
- Next up is an introduction to Pandas to efficiently load, clean, normalize, aggregate, transform, and visualize data.
Days 4–5: Data Access, Visual Exploration, and Machine Learning with scikit-learn
- Accessing data from common data file types and databases using Pandas and SQL Alchemy
- Exploring data and advanced visual exploration with Seaborn and matplotlib
- Introduction to machine learning with scikit-learn
- Cleansing and normalizing data with Pandas and scikit-learn
- Using and evaluating regression models
- Using and evaluating classification models
- Using and evaluating clustering models
"Everyone in the class is very impressed by our instructor. He knows Python inside out. He encourages questions and answer our questions thoroughly. He gave us a lot to think about, such as, things going on behind the Python syntax, pointers on how to write fast Python. He really made this class worthwhile, way above learning from examples that is what I usually do."
"A very insightful course which delivered by a true expert. I have left the course with hundreds of ideas upon which I can now act."
- Neal M.
Onsite corporate classes are also available. Discounts are available for 3 or more attendees and academics currently at a degree-granting institution. Contact us using the form on this page to learn more.
There are no classes scheduled at this time. To request one, please contact us using the form on this page.
Course Syllabus & Topics
The course assumes a working knowledge of key data science topics (statistics, machine learning, and general data analytic methods). Programming experience in some language (such as R, MATLAB, SAS, Mathematica, Java, C, C++, VB, or FORTRAN) is expected. In particular, participants need to be comfortable with general programming concepts like variables, loops, and functions. Experience with Python is helpful (but not required).
Introduction to Python
We kick off the class by exploring the functionality of the IPython Shell, an enhanced interactive science-centric console. Next we review the Jupyter Notebook, a cell-based environment that renders scripts, plots, and rich media in a web-like interface, making it ideal for sharing and publishing analysis with peers. You’ll leave with a mastery of these tools that will accelerate your productivity and facilitate collaboration.
- Data-Types (strings, lists, dictionaries and more)
- Control Flow (if-then statements, looping)
- Organizing code (functions, modules, packages)
- Reading and writing files
Introduction to NumPy and 2D plotting
- Plotting with matplotlib
- Understanding the N-dimensional data structure
- Creating arrays
- Indexing arrays by slicing or more generally with indices or masks
- Basic operations and manipulations on N-dimensional arrays
Time series analysis and data manipulation with Pandas
Built on top of NumPy arrays, the Python Data Analysis Library (Pandas) is a powerful and convenient package for dealing with multi-dimensional datasets. Participants will learn about its powerful data aggregation and reorganization capabilities for data set explorations, including support for labeling data along each dimension, missing values, and time series manipulations.
- Pandas I/O operations
- Pandas 1D and 2D data structures (Series and DataFrame)
- Data alignment, aggregation, and summarization
- Computation and analysis with Pandas
- Dealing with dates and times
- Querying SQL databases with Python DB-API
- Loading data from databases using Pandas, and SQL Alchemy
Visual Exploration with seaborn and matplotlib
- Inspect feature distributions before applying transformations
- Spot correlations, non-linearities, and level combinations between features
- Identify interactions between features using faceted plots
Intro to machine learning with scikit-learn
- Input: 2D, samples, and features
- Estimator, predictor, transformer interfaces
- Pre-processing data
- Model selection
- Re-encoding features
- Dimensionality reduction
- Features common to regressors in scikit-learn
- Useful regression models
- Evaluating regression models
- Features common to classifiers in scikit-learn
- Useful classifier models
- Evaluating classifier models
- Loss functions in clustering models
- Types of classifiers in scikit-learn
- Evaluating clustering models
(Time permitting) Natural Language Processing
- Preparing and cleaning text
- Extracting features from text
- Classifying text
- Topic modeling