New Pandas and scikit-learn Sheets

Author: Alexandre Chabot-Leclerc, Ph.D., Vice President, Digital Transformation Solutions

The Enthought training team has prepared a series of 8 quick-reference guides for Pandas (the Python Data Analysis library) and 3 quick-reference guides for scikit-learn (machine learning for Python). The topics were selected based on the idea that 20% of the functionality provides 80% of the usage. They include simple illustrations of the different concepts, and summaries of each follow in this blog post.

Pandas

Sheet 1: Reading and Writing Data with Pandas
Sheet 2: Pandas Data Structures: Series and DataFrames
Sheet 3: Plotting with Series and DataFrames
Sheet 4: Computation with Series and DataFrames
Sheet 5: Manipulating Dates and Times Using Pandas
Sheet 6: Combining Pandas DataFrames
Sheet 7: Split/Apply/Combine with DataFrames
Sheet 8: Reshaping Pandas DataFrames and Pivot Tables

Scikit-learn

Sheet 1: Classification: Predict categorical data
Sheet 2: Clustering: Unsupervised Learning
Sheet 3: Regression: Predict Continuous Data

Pandas has recently released version 1.0.0. It includes a new number of new exciting features, such as using Numba in rolling.apply, a new DataFrame method for converting to Markdown, a new scalar for missing values, and dedicated extension types for string and nullable boolean data. Visit https://pandas.pydata.org/ to learn more.

Grow your coding skills with Pandas

If you would like more hands-on experience with Pandas or are looking for additional guidance, Enthought offers a number of Python training courses. Enthought’s Pandas Mastery Workshop, designed for experienced Python users, and Python for Data Analysis classes, for those newer to Python, are ideal for those who work heavily with data. Sign up for these training sessions through our website, or contact us to learn more about our on-site corporate classes.

Sheet 1: Reading and Writing Data with Pandas

This document presents common usage patterns when reading data from text files with read_table, from Excel documents with read_excel, from databases with read_sql, or when scraping web pages with read_html. It also introduces how to write data to disk as text files, into an HDF5 file, or into a database.

Sheet 2: Pandas Data Structures: Series and DataFrames

This reference sheet focuses on the two main data structures: the DataFrame, and the Series. It explains how to think about them in terms of common Python data structure and how to create them. It gives guidelines about how to select subsets of rows and columns, with clear explanations of the difference between label-based indexing, with .loc, and position-based indexing, with .iloc.

Sheet 3: Plotting with Series and DataFrames

This presents some of the most common kinds of plots together with their arguments. It also explains the relationship between Pandas and matplotlib and how to use them effectively. It highlights the similarities and differences of plotting data stored in Series or DataFrames.

Sheet 4: Computation with Series and DataFrames

This codifies the behavior of DataFrames and Series as following three rules: alignment first, element-by-element mathematical operations, and column-based reduction operations. It covers the built-in methods for most common statistical operations, such as mean or sum. It also covers how missing values are handled by Pandas.

Sheet 5: Manipulating Dates and Times Using Pandas

The first part of this reference sheet describes how to create and manipulate time series data. Having a Series or DataFrame with a Datetime index allows for easy time-based indexing and slicing, as well as for powerful resampling and data alignment. The second part covers “vectorized” string operations, which is the ability to apply string transformations on each element of a column without having to explicitly write for-loops.

Sheet 6: Combining Pandas DataFrames

The sixth reference sheet presents the tools for combining Series and DataFrames together, with SQL-type joins and concatenation. It then goes on to explain how to clean data with missing values, using different strategies to locate, remove, or replace them.

Sheet 7: Split/Apply/Combine with DataFrames

“Group by” operations involve splitting the data based on some criteria, applying a function to each group to aggregate, transform, or filter them and then combining the results. It’s an incredibly powerful and expressive tool. The reference sheet also highlights the similarity between “group by” operations and window functions, such as resample, rolling, and ewm (exponentially weighted functions).

Sheet 8: Reshaping Pandas DataFrames and Pivot Tables

Finally, this introduces the concept of “tidy data”, where each observation or sample is a row and each variable is a column. Tidy data is the optimal layout when working with Pandas. It illustrates various tools, such as stack, unstack, melt, and pivot_table, to reshape data into a tidy form or to a “wide” form.

Improve proficiency with Scikit-learn machine learning in Python

These sheets are designed to accelerate learning or serve as a refresher while you’re using scikit-learn. To learn more, consult the excellent scikit-learn documentation, or sign up for our Machine Learning Mastery Workshop if you’re already familiar with Python, or our Python for Machine Learning if you’d like to also solidify your Python knowledge.

Sheet 1: Classification Predict categorical data

This sheet focuses on predicting the class, or label, of a sample based on its features, such as recognizing hand-written digits or marking email as spam. You’ll also learn how to select the appropriate performance metric for your classification problem. The rest of the sheet explores when and how to use logistic regression, decision trees, ensemble methods, support vector classifiers, and neighbor classifiers.

Sheet 2: Clustering: Unsupervised Learning

This sheet shows you how to predict the underlying structure in features without the use of targets or labels, and split samples into groups called “clusters.” With no targets, models are trained by minimizing some definition of “distance” within a cluster. Models can be used for prediction or for transformation, by reducing multiple features into one with a smaller set of unique values. The rest of the sheet explores when and how to use k-means, mean shift, affinity propagation, DBSCAN, agglomerative clustering, BIRCH, and discusses performance metrics for evaluating clustering algorithms.

Sheet 3: Regression: Predict Continuous Data

This sheet illustrates how to predict how a dependent variable (output) changes when any of the independent variables (inputs, or features) change. For example, how house prices change as a function of neighborhood and size, or how time spent on a web page varies as a function of the number of ads and content type. The rest of the sheet goes into detail regarding linear models, when and how to use ridge, lasso, non-linear transformations, support vector regressor, and a stochastic gradient descent regressor.

About the Author

Alexandre Chabot-Leclerc, Ph.D., Vice President, Digital Transformation Solutions holds a Ph.D. in electrical engineering and a M.Sc. in acoustics engineering from the Technical University of Denmark and a B.Eng. in electrical engineering from the Université de Sherbrooke.

New Pandas and scikit-learn Sheets

Grow your coding skills with Pandas

Sheet 1: Reading and Writing Data with Pandas

Sheet 2: Pandas Data Structures: Series and DataFrames

Sheet 3: Plotting with Series and DataFrames

Sheet 4: Computation with Series and DataFrames

Sheet 5: Manipulating Dates and Times Using Pandas

Sheet 6: Combining Pandas DataFrames

Sheet 7: Split/Apply/Combine with DataFrames

Sheet 8: Reshaping Pandas DataFrames and Pivot Tables

Improve proficiency with Scikit-learn machine learning in Python

Sheet 1: Classification Predict categorical data

Sheet 2: Clustering: Unsupervised Learning

Sheet 3: Regression: Predict Continuous Data

Share this article:

Related Content

Reshaping Materials R&D: Navigating Margin Pressure in the Specialty Chemicals Industry

The Emergence of the AI Co-Scientist

Understanding Surrogate Models in Scientific R&D

R&D Innovation in 2025

Revolutionizing Materials R&D with “AI Supermodels”

What to Look for in a Technology Partner for R&D

Digital Transformation vs. Digital Enhancement: A Starting Decision Framework for Technology Initiatives in R&D

Digital Transformation in Practice

Leveraging AI for More Efficient Research in BioPharma

Utilizing LLMs Today in Industrial Materials and Chemical R&D

Industries

Solutions

Discover