Feature Engineering

build_circle

Feature Engineering Frameworks

FeaturetoolsAutomated feature engineering with deep feature synthesis

Key Features
  • Deep feature synthesis (DFS) for automated feature generation
  • Entity relationship modeling
  • Temporal aggregations across entities
  • Feature primitives library
  • Parallel computation support
  • Custom primitive creation
Use Cases
  • Automated feature discovery
  • Relational data feature engineering
  • Time-series feature generation
  • Rapid prototyping and experimentation
Similar Technologies
tsfreshKatsManual engineeringAutoML features
tsfreshTime-series feature extraction library

Key Features
  • 794 time-series features out-of-the-box
  • Statistical significance testing
  • Parallel execution with Dask
  • Feature selection based on relevance
  • Comprehensive statistical features
  • Pandas and Dask integration
Use Cases
  • Time-series ML problems
  • Sensor data analysis
  • Financial time-series
  • Anomaly detection features
Similar Technologies
FeaturetoolsKatsSktimeManual feature engineering
Category EncodersScikit-learn compatible categorical encoders

Key Features
  • 15+ encoding methods
  • Target encoding with regularization
  • Leave-one-out encoding
  • CatBoost encoder
  • Bayesian target encoding
  • Prevents target leakage
Use Cases
  • High-cardinality categorical features
  • Scikit-learn ML pipelines
  • Preventing target leakage
  • Tree-based and linear models
Similar Technologies
Scikit-learn encodersPandas get_dummiesManual encoding
SktimeUnified time-series ML framework

Key Features
  • Time-series forecasting algorithms
  • Classification and regression
  • Feature transformations
  • Composable pipelines
  • Probabilistic predictions
  • Model selection utilities
Use Cases
  • Time-series forecasting
  • Time-series feature engineering
  • Probabilistic predictions
  • Unified time-series workflows
Similar Technologies
ProphetstatsmodelstsfreshARIMA libraries
transform

Common Feature Transformations

TechniqueDescriptionUse CaseImplementationLibraries
Scaling/NormalizationStandardScaler, MinMaxScaler, RobustScalerGradient-based models, distance-based algorithms(x - mean) / stdScikit-learn
Encoding CategoricalsOne-hot, target, ordinal, hash, embeddingConvert categories to numerical representationsLabel → VectorCategory Encoders, Pandas
Binning/DiscretizationGroup continuous values into binsCapture non-linear relationships, reduce noiseAge → Age_groupPandas cut, qcut
Log Transformationlog(x), log1p(x) for skewed dataReduce skewness, normalize distributionsnp.log1p(x)NumPy, Pandas
Polynomial Featuresx1, x2 → x1², x1*x2, x2²Capture non-linear relationships in linear modelsDegree-2 expansionScikit-learn
Interaction FeaturesCombine 2+ features multiplicativelyCapture feature relationshipsPrice * QuantityManual, PolynomialFeatures
AggregationsSum, mean, count, std over groupsSummarize entity behavior patternsUser → Avg_purchasePandas groupby, SQL
Time-based FeaturesHour, day, month, is_weekend, time_sinceCapture temporal patterns and seasonalityDatetime extractionPandas dt accessor
filter_list

Feature Selection Methods

Filter Methods (Fast)

Statistical tests independent of model

  • Correlation analysis: Remove highly correlated features
  • Chi-squared test: Categorical features with categorical target
  • ANOVA F-test: Numerical features with categorical target
  • Mutual information: Non-linear relationships
  • Variance threshold: Remove low-variance features

Tools: Scikit-learn feature_selection

Wrapper Methods (Accurate)

Search feature subsets using model performance

  • Recursive Feature Elimination (RFE): Iteratively remove features
  • Forward selection: Add features one by one
  • Backward elimination: Remove features one by one
  • Sequential Feature Selection: Scikit-learn implementation
  • Genetic algorithms: Evolutionary feature selection

Tools: Scikit-learn RFE, mlxtend

Embedded Methods (Efficient)

Feature selection during model training

  • L1 regularization (Lasso): Sparse feature selection
  • Tree-based importance: Random Forest, XGBoost, LightGBM
  • Linear model coefficients: Feature weights
  • Permutation importance: Measure impact by shuffling
  • SHAP feature importance: Game theory-based

Tools: XGBoost, LightGBM, SHAP

pattern

Feature Engineering Patterns

Domain-Driven FeaturesLeverage domain expertise for feature creation

Key Features
  • Expert-guided feature design
  • Interpretable and explainable
  • High signal-to-noise ratio
  • Domain-specific ratios and calculations
  • Business logic encoded as features
  • Collaboration with domain experts
Use Cases
  • Finance: debt-to-income ratio, credit utilization
  • Healthcare: BMI, risk scores, disease indicators
  • E-commerce: cart abandonment rate, customer lifetime value
  • Manufacturing: equipment efficiency, defect rates
Similar Technologies
Automated feature engineeringStatistical features
Temporal AggregationsRolling windows and time-based features

Key Features
  • Rolling mean/sum/std over time windows
  • Exponential weighted moving averages
  • Lag features (previous values)
  • Rate of change and trends
  • Seasonal features (day of week, month)
  • Time since last event
Use Cases
  • User behavior patterns over time
  • Trend detection and forecasting
  • Seasonality capture
  • Financial technical indicators
Similar Technologies
Static featuresSnapshot features
Entity EmbeddingsNeural network embeddings for categorical features

Key Features
  • Dense vector representations
  • Capture semantic similarity
  • Dimensionality reduction
  • Transfer learning capability
  • Pre-trained embeddings
  • Unsupervised learning of relationships
Use Cases
  • High-cardinality categoricals
  • NLP features (word embeddings)
  • User/item embeddings for recommendations
  • Product category embeddings
Similar Technologies
One-hot encodingTarget encodingHash encoding
Feature CrossesCombine features for interaction effects

Key Features
  • Polynomial combinations
  • Categorical feature crosses
  • Binned feature interactions
  • Capture non-linear relationships
  • Sparse feature spaces
  • Explicit interaction modeling
Use Cases
  • Linear models with interactions
  • Recommendation systems
  • CTR prediction
  • Wide & Deep learning architectures
Similar Technologies
Tree-based models (implicit)Neural networks
Ratio & Relative FeaturesNormalized and comparative features

Key Features
  • Ratios and percentages
  • Z-scores and standardization
  • Percentile ranks
  • Relative to group average
  • Year-over-year changes
  • Proportion of total
Use Cases
  • Cross-entity comparisons
  • Normalized metrics
  • Removing scale effects
  • Relative performance indicators
Similar Technologies
Absolute valuesRaw features
Automated Feature EngineeringUse AutoML tools for feature generation

Key Features
  • Deep feature synthesis
  • Genetic programming
  • AutoML pipeline features
  • Feature importance ranking
  • Rapid iteration
  • Baseline establishment
Use Cases
  • Rapid prototyping
  • Feature discovery
  • Baseline model creation
  • Large datasets with many tables
Similar Technologies
Manual engineeringDomain-driven features
verified

Feature Engineering Best Practices

PrincipleDescriptionWhy It MattersExample
Avoid Data LeakageNo future information in training dataInflates metrics, model fails in productionDon't use target-derived features before train/test split
Train-Test ConsistencySame transformations on train and testPrevents distribution shift and errorsFit scaler on train only, transform both train and test
Handle Missing ValuesImputation or missing indicator featuresPrevents errors, captures missingness signalMean imputation + is_missing boolean feature
Feature VersioningTrack feature definitions and code versionsReproducibility, debugging, collaborationGit commit hash in feature metadata
Feature DocumentationDescribe calculation and business rationaleTeam collaboration, maintenance, knowledge transferDocstrings, feature registry, wiki
Monitor Feature DriftTrack feature distributions over timeDetect data quality issues, concept driftStatistical tests (KS), distribution plots
Validate FeaturesCheck ranges, types, null rates in pipelineCatch pipeline errors early before trainingAssert age in [0, 120], null_rate < 5%
Consider LatencyFeature computation cost for inferenceReal-time inference constraints, costPre-compute expensive features, cache results
compare

Feature Engineering Approaches Comparison

ApproachSpeedAccuracyMaintenanceBest For
Manual EngineeringMediumHigh (with expertise)HighDomain-specific features, interpretability requirements, production systems
Library-Based (Featuretools)HighMediumMediumRapid prototyping, relational data, exploration phase
AutoML FeaturesVery HighMediumLowBaseline establishment, large datasets, time-constrained projects
Deep Learning EmbeddingsLow (training)HighMediumHigh-cardinality categoricals, NLP, recommendation systems
Hybrid (Manual + Auto)MediumVery HighMediumProduction systems, competitive ML, best possible results