4 Reasons to Learn Xarray and Awkward Array—for NumPy and Pandas Users

You know it. We know it.
NumPy is cool. Pandas is cool.

We can bend them to our will, but sometimes they’re not the right tools for the job.
Enter Xarray and Awkward Array.

Read on for the four reasons why you need to learn these Python packages.

 

Reason 1:  You need labeled arrays of more than two dimensions

NumPy arrays are great for handling multi-dimensional data. Relying on numerical values to refer to dimensions and positions works. It’s predictable and fast but can also be tedious and error-prone.

Pandas labeled columns and rows are wonderfully expressive. You get the data you want without knowing where it is in the array. It also allows for neat and ergonomic use cases, like slicing and resampling time series. You can put multi-dimensional data in a DataFrame with multi-indices, but it’s not particularly natural.

Enter Xarray: multi-dimensional labeled arrays. It’s like a cross between NumPy and Pandas. Like NumPy, it has vectorized operations and broadcasting. And like Pandas, it can do GroupBy operations, labeled indexing, and database-like operations. Xarray is domain agnostic. It’s excellent at gridded datasets in geosciences but also great for other physical sciences, genomics, and finance. Any time your multi-dimensional data has labels that encode information about how the array values map to locations in space or time, think about Xarray.

    >>> import numpy as np
    >>> import pandas as pd
    >>> import xarray as xr

    >>> rng = np.random.default_rng(seed=42)
    >>> times = pd.date_range('2023-01-01', periods=3)

    # A single experiment containing 3 trials
    # across 4 different specimens
    >>> xr_array = xr.DataArray(
    ...     data=rng.integers(100, size=(3,4)),
    ...     dims=('time', 'sample'),
    ...     coords={
    ...         'time': times,
    ...         'sample': list('abcd')
    ...     },
    ...     attrs={'created_by': 'alex'},
    ...     name='experiment1'
    ... )

    >>> xr_array
    <xarray.DataArray 'experiment1'
      (time: 3, sample: 4)>
    array([[ 8, 77, 65, 43],
           [43, 85, 8, 69],
           [20, 9, 52, 97]])
    Coordinates:
    * time     (time) datetime64[ns]
        2023-01-01 2023-01-02 2023-01-03
    * sample   (sample) <U1 'a' 'b' 'c' 'd'
    Attributes:
        created_by: alex

Reason 2:  Sometimes your arrays are not all the same length

NumPy arrays can be of any size and dimension, but NumPy requires that the number of elements in each dimension is constant. In other words, the arrays cannot be jagged or ragged. You can put sequences of different lengths in a NumPy array, but you end up with an array of objects, which negates the speed advantages of using NumPy. Also, NumPy requires that all the elements be of the same type. Even with structured arrays, each record must have the same structure.

Enter Awkward Array: nested, variable-sized and mixed-type data using NumPy-like idioms. Like NumPy, it has powerful indexing capabilities and fast computation (aka universal functions). But, Awkward Arrays generalizes to the tricky kinds of data that NumPy struggles to work with: it can perform reductions through varying length lists.

Like Xarray, Awkward Array is domain agnostic, but it came out of high-energy physics and is particularly well suited to their use cases.

    >>> import awkward as ak
    >>> ak_array = ak.Array(
    ...     [
    ...         [1, 2, 3],
    ...         [4],
    ...         [5, 6]
    ...     ]
    ... )

    >>> ak_array.show()
    [[1, 2, 3],
     [4],
     [5, 6]]

    # NumPy behavior for reduction operations
    >>> ak.sum(ak_array)
    21

    # Across rows, one per column
    >>> print(ak.sum(ak_array, axis=0))
    [10, 8, 3]

    # Across columns, one per row
    >>> print(ak.sum(ak_array, axis=1))
    [6, 4, 11]

Reason 3:  You (sort of) already know them

Xarray and Awkward Array use Numpy and Pandas idioms for slicing, indexing, and reductions, so the learning curve is low. Xarray offers the familiar slicing by integer position (from NumPy) and by labels (from Pandas), but it also lets you select a dimension by name, which is nice. Awkward Array provides the usual positional indexing and adds awkward indexing to let you pull out a different number of items for each sublist.

Xarray uses methods for reduces, like da.sum() and da.argsort(). Awkward Array uses functions where you must pass your array, like ak.sum(array), but otherwise uses the same names.

Reason 4:  Because they integrate with the tools you know (and love)

Xarray and Awkard Array have methods to convert or extract their data to NumPy and Pandas. Creating new objects from NumPy arrays and Pandas DataFrames is also easy when needed.

For plotting, Xarray integrates directly with matplotlib using a thin wrapper. It’s another case of “you already know it.” Awkward Array doesn’t have such a tight integration but implements the __array__ protocol, so any library that expects NumPy arrays, such as Matplotlib, can use Awkward Arrays without any changes.

Conclusion:  Don’t you want a free lunch?

In conclusion, NumPy and Pandas are powerful generic tools. Xarray and Awkward Array are slightly more specialized, yet they hit the sweet spot of solving problems that a large swath of scientists and engineers face in their data analysis work. As a result, they’re worth adding to your scientific computing toolbox. Learning them feels like an extension of what you already know rather than a brand-new thing.

 

Are you interested in learning about Xarray and Awkward Array but still trying to figure out Pandas? Then check out our upcoming course Data Analysis with Pandas for Scientists and Engineers. It’s a hands-on guide through the data analysis workflow with Pandas at its core.

 

Author: Alexandre Chabot-Leclerc, Vice President, Digital Transformation Solutions, holds a Ph.D. in electrical engineering and a M.Sc. in acoustics engineering from the Technical University of Denmark and a B.Eng. in electrical engineering from the Université de Sherbrooke. He is passionate about transforming people and the work they do. He has taught the scientific Python stack and machine learning to hundreds of scientists, engineers, and analysts at the world’s largest corporations and national laboratories. After seven years in Denmark, Alexandre is totally sold on commuting by bicycle. If you have any free time you’d like to fill, ask him for a book, music, podcast, or restaurant recommendation.

 

Share this article:

Related Content

Leveraging AI for More Efficient Research in BioPharma

In the rapidly-evolving landscape of drug discovery and development, traditional approaches to R&D in biopharma are no longer sufficient. Artificial intelligence (AI) continues to be a...

Read More

Utilizing LLMs Today in Industrial Materials and Chemical R&D

Leveraging large language models (LLMs) in materials science and chemical R&D isn't just a speculative venture for some AI future. There are two primary use...

Read More

Top 10 AI Concepts Every Scientific R&D Leader Should Know

R&D leaders and scientists need a working understanding of key AI concepts so they can more effectively develop future-forward data strategies and lead the charge...

Read More

Why A Data Fabric is Essential for Modern R&D

Scattered and siloed data is one of the top challenges slowing down scientific discovery and innovation today. What every R&D organization needs is a data...

Read More

Jupyter AI Magics Are Not ✨Magic✨

It doesn’t take ✨magic✨ to integrate ChatGPT into your Jupyter workflow. Integrating ChatGPT into your Jupyter workflow doesn’t have to be magic. New tools are…

Read More

Top 5 Takeaways from the American Chemical Society (ACS) 2023 Fall Meeting: R&D Data, Generative AI and More

By Mike Heiber, Ph.D., Materials Informatics Manager Enthought, Materials Science Solutions The American Chemical Society (ACS) is a premier scientific organization with members all over…

Read More

Real Scientists Make Their Own Tools

There’s a long history of scientists who built new tools to enable their discoveries. Tycho Brahe built a quadrant that allowed him to observe the…

Read More

How IT Contributes to Successful Science

With the increasing importance of AI and machine learning in science and engineering, it is critical that the leadership of R&D and IT groups at...

Read More

From Data to Discovery: Exploring the Potential of Generative Models in Materials Informatics Solutions

Generative models can be used in many more areas than just language generation, with one particularly promising area: molecule generation for chemical product development.

Read More

7 Pro-Tips for Scientists: Using LLMs to Write Code

Scientists gain superpowers when they learn to program. Programming makes answering whole classes of questions easy and new classes of questions become possible to answer….

Read More