New Pandas and scikit-learn Sheets

Author: Alexandre Chabot-Leclerc, Ph.D., Director, Training Solutions

The Enthought training team has prepared a series of 8 quick-reference guides for Pandas (the Python Data Analysis library) and 3 quick-reference guides for scikit-learn (machine learning for Python). The topics were selected based on the idea that 20% of the functionality provides 80% of the usage. They include simple illustrations of the different concepts, and summaries of each follow in this blog post. 

Pandas

  • Sheet 1: Reading and Writing Data with Pandas
  • Sheet 2: Pandas Data Structures: Series and DataFrames
  • Sheet 3: Plotting with Series and DataFrames 
  • Sheet 4: Computation with Series and DataFrames
  • Sheet 5: Manipulating Dates and Times Using Pandas 
  • Sheet 6: Combining Pandas DataFrames 
  • Sheet 7: Split/Apply/Combine with DataFrames 
  • Sheet 8: Reshaping Pandas DataFrames and Pivot Tables

Scikit-learn 

  • Sheet 1: Classification: Predict categorical data
  • Sheet 2: Clustering: Unsupervised Learning
  • Sheet 3: Regression: Predict Continuous Data

Pandas has recently released version 1.0.0. It includes a new number of new exciting features, such as using Numba in rolling.apply, a new DataFrame method for converting to Markdown, a new scalar for missing values, and dedicated extension types for string and nullable boolean data. Visit https://pandas.pydata.org/ to learn more. 

 

Grow your coding skills with Pandas

If you would like more hands-on experience with Pandas or are looking for additional guidance, Enthought offers a number of Python training courses. Enthought’s Pandas Mastery Workshop, designed for experienced Python users, and Python for Data Analysis classes, for those newer to Python, are ideal for those who work heavily with data. Sign up for these training sessions through our website, or contact us to learn more about our on-site corporate classes.

 

Sheet 1: Reading and Writing Data with Pandas

This document presents common usage patterns when reading data from text files with read_table, from Excel documents with read_excel, from databases with read_sql, or when scraping web pages with read_html. It also introduces how to write data to disk as text files, into an HDF5 file, or into a database.

 

Sheet 2: Pandas Data Structures: Series and DataFrames

This reference sheet focuses on the two main data structures: the DataFrame, and the Series. It explains how to think about them in terms of common Python data structure and how to create them. It gives guidelines about how to select subsets of rows and columns, with clear explanations of the difference between label-based indexing, with .loc, and position-based indexing, with .iloc.

 

Sheet 3: Plotting with Series and DataFrames 

This presents some of the most common kinds of plots together with their arguments. It also explains the relationship between Pandas and matplotlib and how to use them effectively. It highlights the similarities and differences of plotting data stored in Series or DataFrames.

 

Sheet 4: Computation with Series and DataFrames 

This codifies the behavior of DataFrames and Series as following three rules: alignment first, element-by-element mathematical operations, and column-based reduction operations. It covers the built-in methods for most common statistical operations, such as mean or sum. It also covers how missing values are handled by Pandas.

 

Sheet 5: Manipulating Dates and Times Using Pandas 

The first part of this reference sheet describes how to create and manipulate time series data. Having a Series or DataFrame with a Datetime index allows for easy time-based indexing and slicing, as well as for powerful resampling and data alignment. The second part covers “vectorized” string operations, which is the ability to apply string transformations on each element of a column without having to explicitly write for-loops.

 

Sheet 6: Combining Pandas DataFrames 

The sixth reference sheet presents the tools for combining Series and DataFrames together, with SQL-type joins and concatenation. It then goes on to explain how to clean data with missing values, using different strategies to locate, remove, or replace them.

 

Sheet 7: Split/Apply/Combine with DataFrames 

“Group by” operations involve splitting the data based on some criteria, applying a function to each group to aggregate, transform, or filter them and then combining the results. It’s an incredibly powerful and expressive tool. The reference sheet also highlights the similarity between “group by” operations and window functions, such as resample, rolling, and ewm (exponentially weighted functions).

 

Sheet 8: Reshaping Pandas DataFrames and Pivot Tables 

Finally, this introduces the concept of “tidy data”, where each observation or sample is a row and each variable is a column. Tidy data is the optimal layout when working with Pandas. It illustrates various tools, such as stack, unstack, melt, and pivot_table, to reshape data into a tidy form or to a “wide” form.

 

Improve proficiency with Scikit-learn machine learning in Python 

These sheets are designed to accelerate learning or serve as a refresher while you’re using scikit-learn. To learn more, consult the excellent scikit-learn documentation, or sign up for our Machine Learning Mastery Workshop if you’re already familiar with Python, or our Python for Machine Learning if you’d like to also solidify your Python knowledge. 

 

Sheet 1: Classification Predict categorical data

This sheet focuses on predicting the class, or label, of a sample based on its features, such as recognizing hand-written digits or marking email as spam. You’ll also learn how to select the appropriate performance metric for your classification problem. The rest of the sheet explores when and how to use logistic regression, decision trees, ensemble methods, support vector classifiers, and neighbor classifiers.

 

Sheet 2: Clustering: Unsupervised Learning

This sheet shows you how to predict the underlying structure in features without the use of targets or labels, and split samples into groups called “clusters.” With no targets, models are trained by minimizing some definition of “distance” within a cluster. Models can be used for prediction or for transformation, by reducing multiple features into one with a smaller set of unique values. The rest of the sheet explores when and how to use k-means, mean shift, affinity propagation, DBSCAN, agglomerative clustering, BIRCH, and discusses performance metrics for evaluating clustering algorithms. 

 

Sheet 3: Regression: Predict Continuous Data

This sheet illustrates how to predict how a dependent variable (output) changes when any of the independent variables (inputs, or features) change. For example, how house prices change as a function of neighborhood and size, or how time spent on a web page varies as a function of the number of ads and content type. The rest of the sheet goes into detail regarding linear models, when and how to use ridge, lasso, non-linear transformations, support vector regressor, and a stochastic gradient descent regressor. 

 

About the Author

Alexandre Chabot-Leclerc, Ph.D., Director, Training Solutions holds a Ph.D. in electrical engineering and a M.Sc. in acoustics engineering from the Technical University of Denmark and a B.Eng. in electrical engineering from the Université de Sherbrooke.

Share this article:

Related Content

Prospecting for Data on the Web

Introduction At Enthought we teach a lot of scientists and engineers about using Python and the ecosystem of scientific Python packages for processing, analyzing, and…

Read More

True DX in the Pharma R&D Lab Defined by Enthought

Enthought’s team in Japan exhibited at the Pharma IT & Digital Health Expo 2022 life sciences conference in Tokyo, to meet with pharmaceutical industry leaders…

Read More

Life Sciences Labs Optimize with New Digital Technologies and Upskilling

Labs are resetting the trajectory for drug development: reducing timelines from years to months; decreasing costs from billions to millions; and gaining an advantage by…

Read More

Configuring a Neural Network Output Layer

Introduction If you have used TensorFlow before, you know how easy it is to create a simple neural network model using the Keras API. Just…

Read More

No Zero Padding with strftime()

One of the best features of Python is that it is platform independent. You can write code on Linux, Windows, and MacOS and it works…

Read More

Digital Transformation of the Materials Science R&D Lab

“Digital transformation”, “machine learning”, and “artificial intelligence” are buzzwords heard in every industry, from the boardroom to the lab. We asked Dr. Michael Heiber, lead…

Read More

Got Data?

Introduction So, you have data and want to get started with machine learning. You’ve heard that machine learning will help you make sense of that…

Read More

Sorting Out .sort() and sorted()

Sorting Out .sort() and sorted() Sometimes sorting a Python list can make it mysteriously disappear.  This happens even to experienced Python programmers who use .sort()…

Read More

A Beginner’s Guide to Deep Learning

Deep learning. By this point, we’ve all heard of it. It’s the magic silver bullet that can fix any complex problem. It’s the special ingredient…

Read More

Takeaways from SEMICON West 2021

SEMICON West 2021 lived up to its status as the signature conference for the extended microelectronics supply chain. Business and technology leaders, researchers, and analysts…

Read More

Join Our Mailing List!

Sign up below to receive email updates including the latest news, insights, and case studies from our team.