Got Data?


So, you have data and want to get started with machine learning. You’ve heard that machine learning will help you make sense of that data; that it will help you find the hidden gold within.

Before you start sifting through your metaphorical gold mine, you realize you still have some unanswered questions:  How do I get started? Who do I need to hire? Do I need to hire — can we learn this ourselves? What is the best machine learning approach out there? Should we do deep learning with neural networks? Oh, and by the way, how much is it going to cost?

These are all good questions. However, not the right ones when you are just getting started. Instead, ask yourself three key questions:

  1. Do I have a good problem defined?
  2. Do I have the right data for my problem?
  3. Do I have enough data?

Once you have answers for these key questions, you need to make one key decision: Am I going to be model-centric or data-centric?

Do I have a good problem defined?

One of the key insights of Machine Learning is that it works best in a narrow, focused domain. Overly broad questions like “What is the best formulation for paint?” do not work very well. To answer such a question, a human expert would respond that “it depends”. A machine learning algorithm will gamely make predictions for whatever problem and data you throw at it. But, before making any decisions, you will need to examine the results and will most likely find that it is hard to interpret the results of overly broad questions.  In the end, after all of the data collection, time spent, and computer resources expended, the real answer will still likely be “it depends”.

To get good results in machine learning, just like in science and engineering, you have to focus your field of inquiry as much as possible, especially in an area that is not already well studied and understood.  You need to have a narrow question like “What is the best formulation for indoor paint on static metal surfaces in an industrial warehouse built from steel, that stores non-volatile chemicals in tightly sealed containers, sees temperature swings ranging from -20°C to 50°C, and changes in relative humidity from 15% to 70%.”

As you develop your precise problem statement, remember that the goal of a machine learning project is to build a model that solves some type of business problem. Make sure that the business aspects of that problem are focused and clear.  What is the business problem you are trying to solve?  Is it increasing the life expectancy of the steel structure of the warehouse?  Is it reducing the time needed between repaintings?  Is it reducing cracks and peeling in the paint itself?  Is it improving the visibility of the structure to avoid damage from forklifts in dimly lit conditions?  Defining the goal of the machine learning project through business-appropriate metrics and objectives, before writing a single line of code, is an extremely valuable exercise. Articulating the business focus and the economic stakes involved will not only assist your data scientists in making the appropriate model, but it will help them measure success.  Quantifying success in this way will allow the business  to calculate an ROI on the machine learning project and help determine a project budget.

In the future, if the project is successful, you can consider expanding your problem domain.  Now, you will have a much clearer idea of what is involved and what it takes to be successful with machine learning.  Start your problem small, focused, narrow, and practical.

To implement the above techniques, attend Enthought’s

Machine Learning Mastery Workshop.


Do I have the right data for my problem?

After you have articulated a technically precise problem and the business stakes involved, it is time to consider the data that will be needed.  With a clear problem statement, figuring out where to pan for gold — what data to use —  is often facilitated.

Start with thinking about the data needed to measure success.  If I am interested in the life expectancy of my warehouse, how do I plan to measure that?  Perhaps I can measure the number of rust spots of at least a certain size.  This may mean identifying key locations in the structure and monitoring them.  Or, if I am trying to improve visibility and avoid damage from everyday activities, perhaps I need a count of warehouse accidents over a period of time?  Whatever you select, keep in mind that this will be what you want the machine learning model to predict — the target.  Ideally, knowing this value or having a good proxy for it, gives you something actionable that your business can use to make decisions.  The important questions here are:  Is the target measurable?  Is it something meaningful to all of the stakeholders?  Can I turn a prediction of the target value into a decision?

Once you have a target identified, you need to figure out what data can be used to predict that target.  These are your machine learning features.  If your interest is measuring rust, data elements like temperature, humidity, underlying materials, and other physical attributes are likely to be good predictors and therefore should be collected.  If your interest is structural visibility, perhaps you need to measure light intensity in various areas of the warehouse, a level of contrast between the paint color and the background, or some values related to how well people can see the building elements.  In all cases, think about how the features you plan to collect relate to the target of interest and how to measure them.  It is not enough to simply gather the data that happens to be available. Collecting the right data is a pivotal aspect of any machine learning project and the relationship of the features to the target variable must be considered.

Another important thing to think about is making sure you can gather data that includes all possible targets (the predicted outcomes you are interested in).  In the painted warehouse scenario, do you have enough warehouses (or locations inside the warehouses you do have) with enough variation in the data to actually distinguish between the possible outcomes?  If the variance in every feature you gather is within the margin of your measurement error, do you really have any data?  If all of the outcomes are mostly the same, with only a few outliers of real interest, can you really predict those outliers?

Simply having data is not enough — you have to have the right data. Make sure each feature collected has at least some predictive power or notional relationship to the problem you are trying to solve.  If this is not the case, you may need to put your ML algorithms down and start collecting the right data (or at least supplementing the data set you already have).

Interested in the tools that can be used to collect, wrangle, and analyze your data? Consider signing up for Enthought’s Python for Data Analysis or Pandas Mastery Workshop.


Do I have enough data?

For the target and the set of features that constitute the right data, you are going to need a number of samples that can be fed into a machine learning model to tune and test it.  Each sample (or row of data) connects the features (the columns) to the target.  We need to have enough of these samples to effectively train our model to make good predictions.  So, the question becomes, how do I know when I have enough data?  Is fifty samples enough?  What about a hundred?

Here, there is no simple, one size fits all answer.  At the low end, for a traditional linear model, the rule of thumb that is often quoted is that we need at least 100 samples for each feature that we are using in our predictions.  So, if the features we are collecting are temperature, humidity, surface material, and age of paint — a total of 4 features — we will need at least 400 samples collected.  Without this minimum, our model is unlikely to generalize well enough to make predictions we can use reliably with new data.  In this case, the potential relationship between the features and the target (rust spots of a certain size) are fairly direct and understandable.  If we can collect a data set with enough variation and balanced across the targets we are trying to predict, you will have a reasonable possibility of creating a useful model.  It may not be perfectly accurate, but it should be able to make predictions that are better than a simple guess, and therefore of value.

As the number of features increases, the number of samples needed will increase as well.  Furthermore, as the relationships between the features and the targets becomes less direct (and, presumably, more complex), the number of samples needed will increase even more.  In this latter case, we may also need to increase the complexity of our model.  We may need to move into a computationally intensive deep learning, neural network architecture that will be able to unravel the complexity of the non-linear relationships found in the data. This, however, comes with a catch — such a model can only do so with orders of magnitude larger numbers of samples.   For a group of 4 features to model, you may need 40,000 or 400,000 samples depending on the complexities in the underlying relationships.  

Once you get into the realm of tens or hundreds of thousands of samples, you will need to consider the cost of collecting that data and the time needed to do so.  When costs and time are prohibitive, you still have options, but they will require more time, expertise, and resources to explore.  You may need to start with a core set of data and see if you can supplement that with data from simulations.  If you are working with a problem that has a small initial set of data that can be increased over time with further measurements, it is also possible to put a model into production to make predictions earlier than you might like, but then continually retrain the model as new data become available.

Unless you have a really simple problem (and probably don’t need machine learning at all), fifty or a hundred samples is not really enough data.  While you might start exploring the possibilities and building some skills, your best approach is to figure out how to build a larger data set related to your problem before you go any further.

Want to know more about deep learning and the neural network architectures?

Attend Enthought’s Practical Deep Learning for Scientists and Engineers course.


Am I going to be model-centric or data-centric?

Early on you will face a choice between being model-centric or data-centric.  A model-centric approach is often driven by the current state-of-the-art in machine learning, simply because its practitioners are excited and inspired by cutting edge approaches.  They are often eager to try new models and see how they work.  This is like having a tool you really want to use and attempting to find problems that it can solve.  While great for pure research, it can encourage you to ignore perfectly good, well-understood techniques for solving your problem.

Right now, deep learning approaches are extremely popular and desired by many organizations. However, these require extensive computational power and deep datasets (to go with that deep learning).  In some cases — for instance with a data set that has many samples, but is not well understood — this may be the only practical approach.  However, if you are not in that situation, it can be a detour away from a varied menu of possible, cheaper, and less data hungry solutions. 

Instead of being model-centric, be data-centric.  Focus on the data you have available and can collect; focus on the business problems you want to solve.  Then, when those are well-defined, select the appropriate machine learning modeling approach.  You will have more options, ranging from old but well-understood techniques, to new cutting-edge deep learning research barely out of the lab.

Put your initial efforts into understanding the data you have and what problems it might help you solve. Data can be expensive and time-consuming to collect, so getting started may be a matter of making use of the data at hand for a proof of concept and then expanding your data collection efforts as your problem statements evolve and your successes rack up.

If, on the other hand, you don’t already have a lot of data, being data-centric means getting very focused on a problem and figuring out the right data to collect for that problem. Focus your efforts on a narrow problem and collecting data in a sustainable way. Any automation you can implement can improve both the quality and quantity of your data.  

Finally, build curated data sets.  Take the data you have or are actively collecting and make sure that it is well understood. This means building out the metadata so that someone new can understand the nature of that data and its limitations. It also means doing feature engineering in a documented and consistent way. Having curated data sets will allow you to then focus on trying different models and seeing which machine learning approaches work best with the curated data you have. Good data sets can be used with both old, established algorithms and the latest deep learning architectures.

Quality data can be used for years to come, with lots of different machine learning technologies. In contrast,  machine learning is a relatively young field that is changing rapidly. The techniques and tools used today are expected to change and evolve over time. Good data — like a curated and cleaned dataset collected for a specific business problem — can help us find gold for years to come. Got data?

About the Authors

Logan Thomas holds a M.S. in mechanical engineering with a minor in statistics from the University of Florida and a B.S. in mathematics from Palm Beach Atlantic University. Logan has worked as a data scientist and machine learning engineer in both the digital media and protective engineering industries. His experience includes discovering relationships in large datasets, synthesizing data to influence decision making, and creating/deploying machine learning models.

Eric Olsen holds a Ph.D. in history from the University of Pennsylvania, a M.S. in software engineering from Pennsylvania State University, and a B.A. in computer science from Utah State University. Eric spent three decades working in software development in a variety of fields, including atmospheric physics research, remote sensing and GIS, retail, and banking. In each of these fields, Eric focused on building software systems to automate and standardize the many repetitive, time-consuming, and unstable processes that he encountered.

Share this article:

Related Content

Digital Transformation vs. Digital Enhancement: A Starting Decision Framework for Technology Initiatives in R&D

Leveraging advanced technology like generative AI through digital transformation (not digital enhancement) is how to get the biggest returns in scientific R&D.

Read More

Digital Transformation in Practice

There is much more to digital transformation than technology, and a holistic strategy is crucial for the journey.

Read More

Leveraging AI for More Efficient Research in BioPharma

In the rapidly-evolving landscape of drug discovery and development, traditional approaches to R&D in biopharma are no longer sufficient. Artificial intelligence (AI) continues to be a...

Read More

Utilizing LLMs Today in Industrial Materials and Chemical R&D

Leveraging large language models (LLMs) in materials science and chemical R&D isn't just a speculative venture for some AI future. There are two primary use...

Read More

Top 10 AI Concepts Every Scientific R&D Leader Should Know

R&D leaders and scientists need a working understanding of key AI concepts so they can more effectively develop future-forward data strategies and lead the charge...

Read More

Why A Data Fabric is Essential for Modern R&D

Scattered and siloed data is one of the top challenges slowing down scientific discovery and innovation today. What every R&D organization needs is a data...

Read More

Jupyter AI Magics Are Not ✨Magic✨

It doesn’t take ✨magic✨ to integrate ChatGPT into your Jupyter workflow. Integrating ChatGPT into your Jupyter workflow doesn’t have to be magic. New tools are…

Read More

Top 5 Takeaways from the American Chemical Society (ACS) 2023 Fall Meeting: R&D Data, Generative AI and More

By Mike Heiber, Ph.D., Materials Informatics Manager Enthought, Materials Science Solutions The American Chemical Society (ACS) is a premier scientific organization with members all over…

Read More

Real Scientists Make Their Own Tools

There’s a long history of scientists who built new tools to enable their discoveries. Tycho Brahe built a quadrant that allowed him to observe the…

Read More

How IT Contributes to Successful Science

With the increasing importance of AI and machine learning in science and engineering, it is critical that the leadership of R&D and IT groups at...

Read More