Link to our repo: https://github.com/NnennaN123/AI4ALL-Project
Link to Streamlit App: https://cranberryai4allproject.streamlit.app/
CRANBerry Team’s Project!
This project builds machine learning models to predict whether a location (from NREL’s Wind Toolkit) has a wind turbine based on various features. The workflow includes data loading, spatial matching, feature engineering, model training, and evaluation.
This project demonstrates an iterative machine learning workflow, progressively improving model performance through algorithm selection, feature engineering, and hyperparameter optimization.
Location: logistic_regression/logistic_regression.ipynb
The initial baseline model uses Logistic Regression to establish a performance benchmark. This linear model was chosen for its interpretability and computational efficiency.
Model Configuration:
fraction_of_usable_area: Fraction of grid cell usable for wind developmentcapacity: Potential capacity of the sitewind_speed: Average wind speed at the sitecapacity_factor: Expected capacity factor (efficiency)max_iter=1000class_weight="balanced" (to handle class imbalance)n_jobs=-1 (parallel processing)Performance Metrics:
| Precision (No Turbine): 0.779 | Recall: 0.591 | F1-Score: 0.672 |
| Precision (Turbine): 0.522 | Recall: 0.728 | F1-Score: 0.608 |
Key Finding: Good recall for detecting turbines (72.8%) but lower precision (52.2%), indicating many false positives. This established the baseline for comparison.
Location: random_forest/random_forest.ipynb
The second iteration uses Random Forest, an ensemble method that combines multiple decision trees to capture non-linear relationships and improve prediction accuracy.
Model Configuration:
n_estimators=500 (number of trees)max_leaf_nodes=16 (limits tree depth)n_jobs=-1 (parallel processing)random_state=42 (reproducibility)Performance Metrics:
| Precision (No Turbine): 0.717 | Recall: 0.860 | F1-Score: 0.782 |
| Precision (Turbine): 0.663 | Recall: 0.448 | F1-Score: 0.535 |
Improvement over Logistic Regression:
Key Finding: Better overall performance but more conservative in predicting turbines (44.8% recall vs 72.8%), suggesting the need for more sophisticated algorithms.
Location: xgboost/xgboost.ipynb
The third iteration uses XGBoost (Extreme Gradient Boosting), a powerful gradient boosting framework known for superior performance in structured data problems.
Model Configuration:
n_estimators=300 (number of boosting rounds)learning_rate=0.05 (step size shrinkage)max_depth=6 (maximum tree depth)subsample=0.8 (row subsampling ratio)colsample_bytree=0.8 (column subsampling ratio)objective='binary:logistic' (binary classification)eval_metric='logloss' (evaluation metric)Performance Metrics:
| Precision (No Turbine): 0.796 | Recall: 0.839 | F1-Score: 0.817 |
| Precision (Turbine): 0.708 | Recall: 0.645 | F1-Score: 0.675 |
Improvement over Previous Models:
Key Finding: XGBoost showed the best performance, confirming it as the optimal algorithm. However, further improvements were possible through feature engineering.
Location: feature_engineering.ipynb
Conducted systematic feature engineering to identify the optimal feature combination. Tested 12 different feature configurations (X_train_1 through X_train_12) including:
fraction_of_usable_area, capacity_factorwind_speed, capacitywind_speed_category, capacity_category (converted to numeric)State (one-hot encoded, ~50+ columns) or Region (one-hot encoded, ~7 columns)combined_wind_rescource = wind_speed × capacity_factorpotential_with_constraints = capacity × fraction_of_usable_areaKey Findings:
Documentation: See X_TRAIN_COMBINATIONS_README.md for complete analysis of all 12 feature combinations.
Location: xgboost/xgboost_hyperparameter_tuning.ipynb
Model File: xgboost/xgboost_tuned_feat_eng_wind_model.pkl
The final production model combines the best feature engineering (X_train_5) with systematic hyperparameter optimization using GridSearchCV.
Model Configuration:
fraction_of_usable_area, capacity_factorwind_speed, capacityState (one-hot encoded, ~50+ columns)combined_wind_rescource, potential_with_constraintsmax_depth: 8learning_rate: 0.1n_estimators: 300subsample: 0.7colsample_bytree: 1.0scale_pos_weight: 1.6201 (handles class imbalance)Final Performance Metrics:
| Precision (No Turbine): 0.960 | Recall: 0.840 | F1-Score: 0.896 |
| Precision (Turbine): 0.780 | Recall: 0.943 | F1-Score: 0.854 |
Improvement Journey:
Key Achievements:
| Metric | Logistic Regression | Random Forest | XGBoost (Initial) | XGBoost (Final) | Winner |
|---|---|---|---|---|---|
| ROC-AUC | 0.732 | 0.770 | 0.847 | 0.9545 | Final XGBoost |
| Accuracy | 0.643 | 0.703 | 0.766 | 0.8783 | Final XGBoost |
| F1-Score | 0.640 | 0.658 | 0.746 | 0.8537 | Final XGBoost |
| Turbine Recall | 0.728 | 0.448 | 0.645 | 0.943 | Final XGBoost |
| Turbine Precision | 0.522 | 0.663 | 0.708 | 0.780 | Final XGBoost |
| No Turbine F1 | 0.672 | 0.782 | 0.817 | 0.896 | Final XGBoost |
Performance Evolution:
Key Insights:
The project uses:
Spatial matching is performed using geospatial joins to match turbines to NREL grid cells within a 25 km radius.
logistic_regression/: Baseline Logistic Regression modelrandom_forest/: Random Forest model for comparisonxgboost/:
xgboost.ipynb: Initial XGBoost model (4 features)xgboost_hyperparameter_tuning.ipynb: Final optimized model with feature engineeringxgboost_tuned_feat_eng_wind_model.pkl: Production modelxgboost_tuned_feat_eng_model_metrics.json: Final model metricsfeature_engineering.ipynb: Systematic feature engineering explorationX_TRAIN_COMBINATIONS_README.md: Complete documentation of 12 feature combinations testeddatasets/: Training and source data filesstreamlit/: Production Streamlit applicationThe final model is deployed in a Streamlit web application: