New Pandas and scikit-learn Sheets

Author: Alexandre Chabot-Leclerc, Ph.D., Vice President, Digital Transformation Solutions

The Enthought training team has prepared a series of 8 quick-reference guides for Pandas (the Python Data Analysis library) and 3 quick-reference guides for scikit-learn (machine learning for Python). The topics were selected based on the idea that 20% of the functionality provides 80% of the usage. They include simple illustrations of the different concepts, and summaries of each follow in this blog post. 

Pandas

  • Sheet 1: Reading and Writing Data with Pandas
  • Sheet 2: Pandas Data Structures: Series and DataFrames
  • Sheet 3: Plotting with Series and DataFrames 
  • Sheet 4: Computation with Series and DataFrames
  • Sheet 5: Manipulating Dates and Times Using Pandas 
  • Sheet 6: Combining Pandas DataFrames 
  • Sheet 7: Split/Apply/Combine with DataFrames 
  • Sheet 8: Reshaping Pandas DataFrames and Pivot Tables

Scikit-learn 

  • Sheet 1: Classification: Predict categorical data
  • Sheet 2: Clustering: Unsupervised Learning
  • Sheet 3: Regression: Predict Continuous Data

Pandas has recently released version 1.0.0. It includes a new number of new exciting features, such as using Numba in rolling.apply, a new DataFrame method for converting to Markdown, a new scalar for missing values, and dedicated extension types for string and nullable boolean data. Visit https://pandas.pydata.org/ to learn more. 

 

Grow your coding skills with Pandas

If you would like more hands-on experience with Pandas or are looking for additional guidance, Enthought offers a number of Python training courses. Enthought’s Pandas Mastery Workshop, designed for experienced Python users, and Python for Data Analysis classes, for those newer to Python, are ideal for those who work heavily with data. Sign up for these training sessions through our website, or contact us to learn more about our on-site corporate classes.

 

Sheet 1: Reading and Writing Data with Pandas

This document presents common usage patterns when reading data from text files with read_table, from Excel documents with read_excel, from databases with read_sql, or when scraping web pages with read_html. It also introduces how to write data to disk as text files, into an HDF5 file, or into a database.

 

Sheet 2: Pandas Data Structures: Series and DataFrames

This reference sheet focuses on the two main data structures: the DataFrame, and the Series. It explains how to think about them in terms of common Python data structure and how to create them. It gives guidelines about how to select subsets of rows and columns, with clear explanations of the difference between label-based indexing, with .loc, and position-based indexing, with .iloc.

 

Sheet 3: Plotting with Series and DataFrames 

This presents some of the most common kinds of plots together with their arguments. It also explains the relationship between Pandas and matplotlib and how to use them effectively. It highlights the similarities and differences of plotting data stored in Series or DataFrames.

 

Sheet 4: Computation with Series and DataFrames 

This codifies the behavior of DataFrames and Series as following three rules: alignment first, element-by-element mathematical operations, and column-based reduction operations. It covers the built-in methods for most common statistical operations, such as mean or sum. It also covers how missing values are handled by Pandas.

 

Sheet 5: Manipulating Dates and Times Using Pandas 

The first part of this reference sheet describes how to create and manipulate time series data. Having a Series or DataFrame with a Datetime index allows for easy time-based indexing and slicing, as well as for powerful resampling and data alignment. The second part covers “vectorized” string operations, which is the ability to apply string transformations on each element of a column without having to explicitly write for-loops.

 

Sheet 6: Combining Pandas DataFrames 

The sixth reference sheet presents the tools for combining Series and DataFrames together, with SQL-type joins and concatenation. It then goes on to explain how to clean data with missing values, using different strategies to locate, remove, or replace them.

 

Sheet 7: Split/Apply/Combine with DataFrames 

“Group by” operations involve splitting the data based on some criteria, applying a function to each group to aggregate, transform, or filter them and then combining the results. It’s an incredibly powerful and expressive tool. The reference sheet also highlights the similarity between “group by” operations and window functions, such as resample, rolling, and ewm (exponentially weighted functions).

 

Sheet 8: Reshaping Pandas DataFrames and Pivot Tables 

Finally, this introduces the concept of “tidy data”, where each observation or sample is a row and each variable is a column. Tidy data is the optimal layout when working with Pandas. It illustrates various tools, such as stack, unstack, melt, and pivot_table, to reshape data into a tidy form or to a “wide” form.

 

Improve proficiency with Scikit-learn machine learning in Python 

These sheets are designed to accelerate learning or serve as a refresher while you’re using scikit-learn. To learn more, consult the excellent scikit-learn documentation, or sign up for our Machine Learning Mastery Workshop if you’re already familiar with Python, or our Python for Machine Learning if you’d like to also solidify your Python knowledge. 

 

Sheet 1: Classification Predict categorical data

This sheet focuses on predicting the class, or label, of a sample based on its features, such as recognizing hand-written digits or marking email as spam. You’ll also learn how to select the appropriate performance metric for your classification problem. The rest of the sheet explores when and how to use logistic regression, decision trees, ensemble methods, support vector classifiers, and neighbor classifiers.

 

Sheet 2: Clustering: Unsupervised Learning

This sheet shows you how to predict the underlying structure in features without the use of targets or labels, and split samples into groups called “clusters.” With no targets, models are trained by minimizing some definition of “distance” within a cluster. Models can be used for prediction or for transformation, by reducing multiple features into one with a smaller set of unique values. The rest of the sheet explores when and how to use k-means, mean shift, affinity propagation, DBSCAN, agglomerative clustering, BIRCH, and discusses performance metrics for evaluating clustering algorithms. 

 

Sheet 3: Regression: Predict Continuous Data

This sheet illustrates how to predict how a dependent variable (output) changes when any of the independent variables (inputs, or features) change. For example, how house prices change as a function of neighborhood and size, or how time spent on a web page varies as a function of the number of ads and content type. The rest of the sheet goes into detail regarding linear models, when and how to use ridge, lasso, non-linear transformations, support vector regressor, and a stochastic gradient descent regressor. 

 

About the Author

Alexandre Chabot-Leclerc, Ph.D., Vice President, Digital Transformation Solutions holds a Ph.D. in electrical engineering and a M.Sc. in acoustics engineering from the Technical University of Denmark and a B.Eng. in electrical engineering from the Université de Sherbrooke.

Share this article:

Related Content

Leveraging AI for More Efficient Research in BioPharma

In the rapidly-evolving landscape of drug discovery and development, traditional approaches to R&D in biopharma are no longer sufficient. Artificial intelligence (AI) continues to be a...

Read More

Utilizing LLMs Today in Industrial Materials and Chemical R&D

Leveraging large language models (LLMs) in materials science and chemical R&D isn't just a speculative venture for some AI future. There are two primary use...

Read More

Top 10 AI Concepts Every Scientific R&D Leader Should Know

R&D leaders and scientists need a working understanding of key AI concepts so they can more effectively develop future-forward data strategies and lead the charge...

Read More

Why A Data Fabric is Essential for Modern R&D

Scattered and siloed data is one of the top challenges slowing down scientific discovery and innovation today. What every R&D organization needs is a data...

Read More

Jupyter AI Magics Are Not ✨Magic✨

It doesn’t take ✨magic✨ to integrate ChatGPT into your Jupyter workflow. Integrating ChatGPT into your Jupyter workflow doesn’t have to be magic. New tools are…

Read More

Top 5 Takeaways from the American Chemical Society (ACS) 2023 Fall Meeting: R&D Data, Generative AI and More

By Mike Heiber, Ph.D., Materials Informatics Manager Enthought, Materials Science Solutions The American Chemical Society (ACS) is a premier scientific organization with members all over…

Read More

Real Scientists Make Their Own Tools

There’s a long history of scientists who built new tools to enable their discoveries. Tycho Brahe built a quadrant that allowed him to observe the…

Read More

How IT Contributes to Successful Science

With the increasing importance of AI and machine learning in science and engineering, it is critical that the leadership of R&D and IT groups at...

Read More

From Data to Discovery: Exploring the Potential of Generative Models in Materials Informatics Solutions

Generative models can be used in many more areas than just language generation, with one particularly promising area: molecule generation for chemical product development.

Read More

7 Pro-Tips for Scientists: Using LLMs to Write Code

Scientists gain superpowers when they learn to program. Programming makes answering whole classes of questions easy and new classes of questions become possible to answer….

Read More