Feature Engineering
Feature Engineering Frameworks
FeaturetoolsAutomated feature engineering with deep feature synthesis
Key Features
- Deep feature synthesis (DFS) for automated feature generation
- Entity relationship modeling
- Temporal aggregations across entities
- Feature primitives library
- Parallel computation support
- Custom primitive creation
Use Cases
- Automated feature discovery
- Relational data feature engineering
- Time-series feature generation
- Rapid prototyping and experimentation
Similar Technologies
tsfreshTime-series feature extraction library
Key Features
- 794 time-series features out-of-the-box
- Statistical significance testing
- Parallel execution with Dask
- Feature selection based on relevance
- Comprehensive statistical features
- Pandas and Dask integration
Use Cases
- Time-series ML problems
- Sensor data analysis
- Financial time-series
- Anomaly detection features
Similar Technologies
Category EncodersScikit-learn compatible categorical encoders
Key Features
- 15+ encoding methods
- Target encoding with regularization
- Leave-one-out encoding
- CatBoost encoder
- Bayesian target encoding
- Prevents target leakage
Use Cases
- High-cardinality categorical features
- Scikit-learn ML pipelines
- Preventing target leakage
- Tree-based and linear models
Similar Technologies
SktimeUnified time-series ML framework
Key Features
- Time-series forecasting algorithms
- Classification and regression
- Feature transformations
- Composable pipelines
- Probabilistic predictions
- Model selection utilities
Use Cases
- Time-series forecasting
- Time-series feature engineering
- Probabilistic predictions
- Unified time-series workflows
Similar Technologies
Common Feature Transformations
| Technique | Description | Use Case | Implementation | Libraries |
|---|---|---|---|---|
| Scaling/Normalization | StandardScaler, MinMaxScaler, RobustScaler | Gradient-based models, distance-based algorithms | (x - mean) / std | Scikit-learn |
| Encoding Categoricals | One-hot, target, ordinal, hash, embedding | Convert categories to numerical representations | Label → Vector | Category Encoders, Pandas |
| Binning/Discretization | Group continuous values into bins | Capture non-linear relationships, reduce noise | Age → Age_group | Pandas cut, qcut |
| Log Transformation | log(x), log1p(x) for skewed data | Reduce skewness, normalize distributions | np.log1p(x) | NumPy, Pandas |
| Polynomial Features | x1, x2 → x1², x1*x2, x2² | Capture non-linear relationships in linear models | Degree-2 expansion | Scikit-learn |
| Interaction Features | Combine 2+ features multiplicatively | Capture feature relationships | Price * Quantity | Manual, PolynomialFeatures |
| Aggregations | Sum, mean, count, std over groups | Summarize entity behavior patterns | User → Avg_purchase | Pandas groupby, SQL |
| Time-based Features | Hour, day, month, is_weekend, time_since | Capture temporal patterns and seasonality | Datetime extraction | Pandas dt accessor |
Feature Selection Methods
Filter Methods (Fast)
Statistical tests independent of model
- Correlation analysis: Remove highly correlated features
- Chi-squared test: Categorical features with categorical target
- ANOVA F-test: Numerical features with categorical target
- Mutual information: Non-linear relationships
- Variance threshold: Remove low-variance features
Tools: Scikit-learn feature_selection
Wrapper Methods (Accurate)
Search feature subsets using model performance
- Recursive Feature Elimination (RFE): Iteratively remove features
- Forward selection: Add features one by one
- Backward elimination: Remove features one by one
- Sequential Feature Selection: Scikit-learn implementation
- Genetic algorithms: Evolutionary feature selection
Tools: Scikit-learn RFE, mlxtend
Embedded Methods (Efficient)
Feature selection during model training
- L1 regularization (Lasso): Sparse feature selection
- Tree-based importance: Random Forest, XGBoost, LightGBM
- Linear model coefficients: Feature weights
- Permutation importance: Measure impact by shuffling
- SHAP feature importance: Game theory-based
Tools: XGBoost, LightGBM, SHAP
Feature Engineering Patterns
Domain-Driven FeaturesLeverage domain expertise for feature creation
Key Features
- Expert-guided feature design
- Interpretable and explainable
- High signal-to-noise ratio
- Domain-specific ratios and calculations
- Business logic encoded as features
- Collaboration with domain experts
Use Cases
- Finance: debt-to-income ratio, credit utilization
- Healthcare: BMI, risk scores, disease indicators
- E-commerce: cart abandonment rate, customer lifetime value
- Manufacturing: equipment efficiency, defect rates
Similar Technologies
Temporal AggregationsRolling windows and time-based features
Key Features
- Rolling mean/sum/std over time windows
- Exponential weighted moving averages
- Lag features (previous values)
- Rate of change and trends
- Seasonal features (day of week, month)
- Time since last event
Use Cases
- User behavior patterns over time
- Trend detection and forecasting
- Seasonality capture
- Financial technical indicators
Similar Technologies
Entity EmbeddingsNeural network embeddings for categorical features
Key Features
- Dense vector representations
- Capture semantic similarity
- Dimensionality reduction
- Transfer learning capability
- Pre-trained embeddings
- Unsupervised learning of relationships
Use Cases
- High-cardinality categoricals
- NLP features (word embeddings)
- User/item embeddings for recommendations
- Product category embeddings
Similar Technologies
Feature CrossesCombine features for interaction effects
Key Features
- Polynomial combinations
- Categorical feature crosses
- Binned feature interactions
- Capture non-linear relationships
- Sparse feature spaces
- Explicit interaction modeling
Use Cases
- Linear models with interactions
- Recommendation systems
- CTR prediction
- Wide & Deep learning architectures
Similar Technologies
Ratio & Relative FeaturesNormalized and comparative features
Key Features
- Ratios and percentages
- Z-scores and standardization
- Percentile ranks
- Relative to group average
- Year-over-year changes
- Proportion of total
Use Cases
- Cross-entity comparisons
- Normalized metrics
- Removing scale effects
- Relative performance indicators
Similar Technologies
Automated Feature EngineeringUse AutoML tools for feature generation
Key Features
- Deep feature synthesis
- Genetic programming
- AutoML pipeline features
- Feature importance ranking
- Rapid iteration
- Baseline establishment
Use Cases
- Rapid prototyping
- Feature discovery
- Baseline model creation
- Large datasets with many tables
Similar Technologies
Feature Engineering Best Practices
| Principle | Description | Why It Matters | Example |
|---|---|---|---|
| Avoid Data Leakage | No future information in training data | Inflates metrics, model fails in production | Don't use target-derived features before train/test split |
| Train-Test Consistency | Same transformations on train and test | Prevents distribution shift and errors | Fit scaler on train only, transform both train and test |
| Handle Missing Values | Imputation or missing indicator features | Prevents errors, captures missingness signal | Mean imputation + is_missing boolean feature |
| Feature Versioning | Track feature definitions and code versions | Reproducibility, debugging, collaboration | Git commit hash in feature metadata |
| Feature Documentation | Describe calculation and business rationale | Team collaboration, maintenance, knowledge transfer | Docstrings, feature registry, wiki |
| Monitor Feature Drift | Track feature distributions over time | Detect data quality issues, concept drift | Statistical tests (KS), distribution plots |
| Validate Features | Check ranges, types, null rates in pipeline | Catch pipeline errors early before training | Assert age in [0, 120], null_rate < 5% |
| Consider Latency | Feature computation cost for inference | Real-time inference constraints, cost | Pre-compute expensive features, cache results |
Feature Engineering Approaches Comparison
| Approach | Speed | Accuracy | Maintenance | Best For |
|---|---|---|---|---|
| Manual Engineering | Medium | High (with expertise) | High | Domain-specific features, interpretability requirements, production systems |
| Library-Based (Featuretools) | High | Medium | Medium | Rapid prototyping, relational data, exploration phase |
| AutoML Features | Very High | Medium | Low | Baseline establishment, large datasets, time-constrained projects |
| Deep Learning Embeddings | Low (training) | High | Medium | High-cardinality categoricals, NLP, recommendation systems |
| Hybrid (Manual + Auto) | Medium | Very High | Medium | Production systems, competitive ML, best possible results |
