Feature Engineering: Secret to data science success

by | Nov 14, 2017

Prior to starting Feature Labs, I researched data science automation in the Data to AI Lab at MIT. Unlike most data scientists who work in a single domain, our group had sponsors from a wide range of industries. This gave us the unique opportunity to develop innovative solutions to use with the diverse problems we worked on.

Yet, regardless of the problem, we realized that our biggest challenge was cutting down the amount of time it took to reach a solution. To expedite the process, we utilized databases to store and query data, open-source machine learning libraries to train our AI systems, and scalable clouds to operate our platform effectively.

Although we employed all available tools at our disposal, it still took 3 months on average to develop a single end-to-end solution. As engineers, we knew one thing was certain — there must be a better way.

The Feature Engineering Bottleneck

The data we received was not only extremely detailed, but often distributed across multiple files. or database tables. While we could easily describe its potential, we still had to manually prepare the data for the machine learning algorithms. These algorithms need data to be in a single table, with training examples in the rows and the explanatory variables (also known as features) in the columns.

This data representation for machine learning is called the “feature matrix.” And “feature engineering” is the process of identifying and extracting predictive features in the complex data that enterprises typically work with.

Feature engineering is challenging because it depends on leveraging human intuition to interpret implicit signals in datasets that machine learning algorithms use. Consequently, feature engineering is often the determining factor in whether a data science project is successful or not. Stanford Professor Andrew Ng accurately said, “…applied machine learning is basically feature engineering.”

The importance of feature engineering is clear. Unfortunately, it is frequently the bottleneck in the data science process because it requires both domain expertise to brainstorm ideas and the technical expertise to implement them.

This means that the best predictive models are only capable of being developed by a select group of people in most enterprise organizations today.

Inventing Feature Engineering Automation

Renowned machine learning professor Pedro Domingos says, “one of the holy grails of machine learning is to automate more and more of the feature engineering process.” In 2014, we made it our goal to create an approach to automate feature engineering for real-world datasets.

Our work at MIT became the “Data Science Machine,” which was used to compete in key data science competitions. The results at the time showed the potential of what we were building — our system outperformed 615 of the 906 human teams that we competed against.

We accomplished this by using an approach that took hours, instead of weeks, to run. While this system deployed many advanced techniques to its advantage, the integral innovation to our success was Deep Feature Synthesis, our algorithm for automated feature engineering. We demonstrated that our system could reach human-level performance. But our goal was never to replace human data scientists. Rather, we sought to augment their work.

According to Gartner, even if an “organization already has a data science team … it may need to be enhanced with even more specialized data science skills specific to machine learning, such as feature engineering and feature extraction.” With Deep Feature Synthesis, we could make it easier for people to not only learn data science, but apply it too.

Feature Labs: making machine learning more accessible

Companies today have a growing number of machine learning needs, but they face a shortage of data scientists who can solve them. Paradoxically, more people than ever before are interested in learning the data science process, but they don’t have access to the tools that make it easy or even feasible to learn.

Feature Labs is committed to dramatically increasing access to machine learning by making automated feature engineering a core part of our product. In the last few months, we have:

Through these initiatives we want to lower the barrier to entry for those who want to help companies address their machine learning difficulties and become the data scientists the business world so desperately needs. With Feature Labs, building high-performance machine learning models to generate value for businesses is accessible to everyone.




Stay up to date

Get the latest updates from Feature Labs

Get in touch

Feature Labs is a predictive analytics platform created to make data science automation a strategic component of any organization. Contact us to learn how we can help you succeed with data science and predictive modeling endeavors.

Follow us on

Feature Engineering vs Feature Selection

All machine learning workflows depend on feature engineering and feature selection. However, they are often erroneously equated by the data science and machine learning communities. Although they share some overlap, these two ideas have different objectives. Knowing...

read more

Learn Feature Engineering in MIT’s Big Data Analytics Course

Feature Labs is pleased to share that our open source library, Featuretools, is being used in a new MIT course on Data Science and Big Data Analytics. Feature engineering is a vital skill for all data scientists, so we are excited to provide the library that enables teaching it alongside other important machine learning topics for the first time.

read more

Applying Data Science Automation to Better Predict Credit Card Fraud

If you use a credit card, you probably know the feeling of having your card declined due to a suspected fraudulent transaction. An industry report from 2015 found that one out of every six legitimate cardholders experienced at least one declined transaction because of inaccurate fraud detection in the past year. That makes fraud detection an expensive problem for issuers: Those declined transactions lead to nearly $118 billion dollars in losses on an annual basis.

Even though numerous machine learning approaches have been developed in the past to address fraud, newly introduced data science automation platforms like Feature Labs give us a reason to revisit the problem. And now, any organization can see the power of automation for themselves using our just announced developer library, Featuretools.

read more

Featuretools at CMU’s Learn Lab

Feature Labs visited Carnegie Mellon University this past July to participate in the 17th annual Simon Initiative’s LearnLab Summer School on Educational Data Mining. During the program we introduced teams to Featuretools, our open source feature engineering library. You can find the complete details in the Featuretools blog post, but here the highlights:

read more

Open Sourcing Featuretools

I created Deep Feature Synthesis two years ago while I was a student at MIT. My intention from the very beginning was to one day share that technology with the world. That day has finally come, and Featuretools is now available for anyone to use for free.

read more

About this blog

Thoughts, reflections, and examples of how organizations can take advantage of data science technologies today from the minds behind Feature Labs.

Follow us on

 

Get in touch

 

Feature Labs is a predictive analytics platform created to make data science automation a strategic component of any organization. Contact us to learn how we can help you succeed with data science and predictive modeling endeavors.

Follow us on