Author: Alexandre Chabot-Leclerc, Ph.D., Vice President, Digital Transformation Solutions
The Enthought training team has prepared a series of 8 quick-reference guides for Pandas (the Python Data Analysis library) and 3 quick-reference guides for scikit-learn (machine learning for Python). The topics were selected based on the idea that 20% of the functionality provides 80% of the usage. They include simple illustrations of the different concepts, and summaries of each follow in this blog post.
Pandas
- Sheet 1: Reading and Writing Data with Pandas
- Sheet 2: Pandas Data Structures: Series and DataFrames
- Sheet 3: Plotting with Series and DataFrames
- Sheet 4: Computation with Series and DataFrames
- Sheet 5: Manipulating Dates and Times Using Pandas
- Sheet 6: Combining Pandas DataFrames
- Sheet 7: Split/Apply/Combine with DataFrames
- Sheet 8: Reshaping Pandas DataFrames and Pivot Tables
Scikit-learn
- Sheet 1: Classification: Predict categorical data
- Sheet 2: Clustering: Unsupervised Learning
- Sheet 3: Regression: Predict Continuous Data
Pandas has recently released version 1.0.0. It includes a new number of new exciting features, such as using Numba in rolling.apply, a new DataFrame method for converting to Markdown, a new scalar for missing values, and dedicated extension types for string and nullable boolean data. Visit https://pandas.pydata.org/ to learn more.
Grow your coding skills with Pandas
If you would like more hands-on experience with Pandas or are looking for additional guidance, Enthought offers a number of Python training courses. Enthought’s Pandas Mastery Workshop, designed for experienced Python users, and Python for Data Analysis classes, for those newer to Python, are ideal for those who work heavily with data. Sign up for these training sessions through our website, or contact us to learn more about our on-site corporate classes.
Sheet 1: Reading and Writing Data with Pandas
This document presents common usage patterns when reading data from text files with read_table, from Excel documents with read_excel, from databases with read_sql, or when scraping web pages with read_html. It also introduces how to write data to disk as text files, into an HDF5 file, or into a database.
Sheet 2: Pandas Data Structures: Series and DataFrames
This reference sheet focuses on the two main data structures: the DataFrame, and the Series. It explains how to think about them in terms of common Python data structure and how to create them. It gives guidelines about how to select subsets of rows and columns, with clear explanations of the difference between label-based indexing, with .loc, and position-based indexing, with .iloc.
Sheet 3: Plotting with Series and DataFrames
This presents some of the most common kinds of plots together with their arguments. It also explains the relationship between Pandas and matplotlib and how to use them effectively. It highlights the similarities and differences of plotting data stored in Series or DataFrames.
Sheet 4: Computation with Series and DataFrames
This codifies the behavior of DataFrames and Series as following three rules: alignment first, element-by-element mathematical operations, and column-based reduction operations. It covers the built-in methods for most common statistical operations, such as mean or sum. It also covers how missing values are handled by Pandas.
Sheet 5: Manipulating Dates and Times Using Pandas
The first part of this reference sheet describes how to create and manipulate time series data. Having a Series or DataFrame with a Datetime index allows for easy time-based indexing and slicing, as well as for powerful resampling and data alignment. The second part covers “vectorized” string operations, which is the ability to apply string transformations on each element of a column without having to explicitly write for-loops.
Sheet 6: Combining Pandas DataFrames
The sixth reference sheet presents the tools for combining Series and DataFrames together, with SQL-type joins and concatenation. It then goes on to explain how to clean data with missing values, using different strategies to locate, remove, or replace them.
Sheet 7: Split/Apply/Combine with DataFrames
“Group by” operations involve splitting the data based on some criteria, applying a function to each group to aggregate, transform, or filter them and then combining the results. It’s an incredibly powerful and expressive tool. The reference sheet also highlights the similarity between “group by” operations and window functions, such as resample, rolling, and ewm (exponentially weighted functions).
Sheet 8: Reshaping Pandas DataFrames and Pivot Tables
Finally, this introduces the concept of “tidy data”, where each observation or sample is a row and each variable is a column. Tidy data is the optimal layout when working with Pandas. It illustrates various tools, such as stack, unstack, melt, and pivot_table, to reshape data into a tidy form or to a “wide” form.
Improve proficiency with Scikit-learn machine learning in Python
These sheets are designed to accelerate learning or serve as a refresher while you’re using scikit-learn. To learn more, consult the excellent scikit-learn documentation, or sign up for our Machine Learning Mastery Workshop if you’re already familiar with Python, or our Python for Machine Learning if you’d like to also solidify your Python knowledge.
Sheet 1: Classification Predict categorical data
This sheet focuses on predicting the class, or label, of a sample based on its features, such as recognizing hand-written digits or marking email as spam. You’ll also learn how to select the appropriate performance metric for your classification problem. The rest of the sheet explores when and how to use logistic regression, decision trees, ensemble methods, support vector classifiers, and neighbor classifiers.
Sheet 2: Clustering: Unsupervised Learning
This sheet shows you how to predict the underlying structure in features without the use of targets or labels, and split samples into groups called “clusters.” With no targets, models are trained by minimizing some definition of “distance” within a cluster. Models can be used for prediction or for transformation, by reducing multiple features into one with a smaller set of unique values. The rest of the sheet explores when and how to use k-means, mean shift, affinity propagation, DBSCAN, agglomerative clustering, BIRCH, and discusses performance metrics for evaluating clustering algorithms.
Sheet 3: Regression: Predict Continuous Data
This sheet illustrates how to predict how a dependent variable (output) changes when any of the independent variables (inputs, or features) change. For example, how house prices change as a function of neighborhood and size, or how time spent on a web page varies as a function of the number of ads and content type. The rest of the sheet goes into detail regarding linear models, when and how to use ridge, lasso, non-linear transformations, support vector regressor, and a stochastic gradient descent regressor.
About the Author
Alexandre Chabot-Leclerc, Ph.D., Vice President, Digital Transformation Solutions holds a Ph.D. in electrical engineering and a M.Sc. in acoustics engineering from the Technical University of Denmark and a B.Eng. in electrical engineering from the Université de Sherbrooke.
Related Content
Digital Transformation vs. Digital Enhancement: A Starting Decision Framework for Technology Initiatives in R&D
Leveraging advanced technology like generative AI through digital transformation (not digital enhancement) is how to get the biggest returns in scientific R&D.
Digital Transformation in Practice
There is much more to digital transformation than technology, and a holistic strategy is crucial for the journey.
Leveraging AI for More Efficient Research in BioPharma
In the rapidly-evolving landscape of drug discovery and development, traditional approaches to R&D in biopharma are no longer sufficient. Artificial intelligence (AI) continues to be a...
Utilizing LLMs Today in Industrial Materials and Chemical R&D
Leveraging large language models (LLMs) in materials science and chemical R&D isn't just a speculative venture for some AI future. There are two primary use...
Top 10 AI Concepts Every Scientific R&D Leader Should Know
R&D leaders and scientists need a working understanding of key AI concepts so they can more effectively develop future-forward data strategies and lead the charge...
Why A Data Fabric is Essential for Modern R&D
Scattered and siloed data is one of the top challenges slowing down scientific discovery and innovation today. What every R&D organization needs is a data...
Jupyter AI Magics Are Not ✨Magic✨
It doesn’t take ✨magic✨ to integrate ChatGPT into your Jupyter workflow. Integrating ChatGPT into your Jupyter workflow doesn’t have to be magic. New tools are…
Top 5 Takeaways from the American Chemical Society (ACS) 2023 Fall Meeting: R&D Data, Generative AI and More
By Mike Heiber, Ph.D., Materials Informatics Manager Enthought, Materials Science Solutions The American Chemical Society (ACS) is a premier scientific organization with members all over…
Real Scientists Make Their Own Tools
There’s a long history of scientists who built new tools to enable their discoveries. Tycho Brahe built a quadrant that allowed him to observe the…
How IT Contributes to Successful Science
With the increasing importance of AI and machine learning in science and engineering, it is critical that the leadership of R&D and IT groups at...