Key lessons on “Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists” by Alice Zheng, Amanda Casari – Data With Purpose

My Thoughts

Feature engineering is the art of selecting, transforming, and creating features that empower machine learning models to deliver remarkable results. In “Feature Engineering for Machine Learning” by Alice Zheng and Amanda Casari, is a guide through the intricacies of feature engineering and how it can shape the performance of machine learning models. The key insights from the book, highlighting the importance of feature selection, scaling, and the role of features in enhancing model performance.

This book offers a rich tapestry of insights into the world of feature engineering. It underscores the importance of understanding the specific needs of different machine learning models, the role of research-driven ideas, and the need to make informed decisions regarding feature selection and scaling. By mastering feature engineering, you can optimize the performance of your machine learning models and unlock their true potential in solving real-world problems. The book invites practitioners to explore the ever-evolving landscape of feature engineering, highlighting the dynamic nature of this field and the need for adaptability and informed choices in the quest for superior machine learning outcomes.

After all the things I read, to me became clear that the models we run are just the cherry on the top. Most of the work should be directed to Data Cleaning steps and Feature Engeneering. If good data comes in, any model you choose will give you good results, maybe not optimal, but good enough. If bad data comes in, no matter what you do or which model do you use, crap will ways come out.

Key lessons

The Role of Research Ideas:

Feature engineering involves exploring various research ideas, such as random projections, complex text featurization models like word2vec and Brown clustering, and latent space models like Latent Dirichlet Allocation and matrix factorization. These concepts demonstrate the continuous evolution of feature engineering in response to the ever-changing landscape of machine learning.

Scaling for Specific Models:

The book emphasizes the importance of understanding which models benefit from scaled features. For models that rely on distance metrics like k-means clustering, nearest neighbors methods, radial basis function (RBF) kernels, and those using Euclidean distance, feature scaling is crucial. Normalizing the features ensures that the model’s output remains on an expected scale.

Space-Partitioning Trees:

Certain models, such as decision trees, gradient-boosted machines, and random forests, are not sensitive to feature scale. Understanding which models require scaled features and which do not is key to efficient feature engineering.

Numeric Feature Distribution:

Feature engineering also involves considering the distribution of numeric features. The distribution summarizes the probability of a feature taking on a particular value, and this knowledge can influence feature selection and transformation.

Quantization for Feature Scaling:

To manage feature scale effectively, the book introduces the concept of quantization. Quantization involves grouping counts into bins, effectively mapping continuous numeric values to discrete ones. This technique can help represent data more efficiently and manage feature scale.

Handling Sparse Features with Caution:

Sparse features, often encountered in real-world datasets, require careful consideration during scaling. The book advises caution when performing min-max scaling and standardization on sparse features to avoid unintended consequences.

Feature Selection and Model Scoring Time:

Feature selection is not solely about reducing training time but primarily aimed at reducing model scoring time. It helps ensure that the model can make predictions swiftly, which is crucial in many real-time applications.

The Broader Landscape of Feature Selection:

The book acknowledges that a comprehensive exploration of feature selection is beyond its scope. Interested readers are encouraged to refer to the survey paper by Guyon and Elisseeff (2003) for a more in-depth look at this critical aspect of feature engineering.

Importance of Semantics in Feature Engineering:

The book touches on the significance of using meaningful features. Individual pixels in images, for instance, do not carry enough semantic information for meaningful analysis. Feature engineering should focus on creating features that capture the essential characteristics of the data.

Next steps: Fourier Analysis , Eigen Analysis and Random Features

The book doesn’t delve into Fourier analysis for audio data. This connection highlights the importance of understanding the mathematical foundations that underlie feature engineering.
Random features are mentioned but skipped in the book. They are closely related to Fourier analysis, emphasizing that feature engineering is about making strategic choices when selecting features for a given problem.

This provides a new line of study. It seems I need to explore more about these topics in order to develop myself into a better feature engineer.

Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists