AI4ALL-Project

Link to our repo: https://github.com/NnennaN123/AI4ALL-Project

Link to Streamlit App: https://cranberryai4allproject.streamlit.app/

CRANBerry Team’s Project!

Overview

This project builds machine learning models to predict whether a location (from NREL’s Wind Toolkit) has a wind turbine based on various features. The workflow includes data loading, spatial matching, feature engineering, model training, and evaluation.

Project Timeline & Model Evolution

This project demonstrates an iterative machine learning workflow, progressively improving model performance through algorithm selection, feature engineering, and hyperparameter optimization.

Phase 1: Baseline Models (Initial Exploration)

1. Logistic Regression Model

Location: logistic_regression/logistic_regression.ipynb

The initial baseline model uses Logistic Regression to establish a performance benchmark. This linear model was chosen for its interpretability and computational efficiency.

Model Configuration:

Performance Metrics:

Key Finding: Good recall for detecting turbines (72.8%) but lower precision (52.2%), indicating many false positives. This established the baseline for comparison.


2. Random Forest Model

Location: random_forest/random_forest.ipynb

The second iteration uses Random Forest, an ensemble method that combines multiple decision trees to capture non-linear relationships and improve prediction accuracy.

Model Configuration:

Performance Metrics:

Improvement over Logistic Regression:

Key Finding: Better overall performance but more conservative in predicting turbines (44.8% recall vs 72.8%), suggesting the need for more sophisticated algorithms.


3. XGBoost Model (Initial)

Location: xgboost/xgboost.ipynb

The third iteration uses XGBoost (Extreme Gradient Boosting), a powerful gradient boosting framework known for superior performance in structured data problems.

Model Configuration:

Performance Metrics:

Improvement over Previous Models:

Key Finding: XGBoost showed the best performance, confirming it as the optimal algorithm. However, further improvements were possible through feature engineering.


Phase 2: Feature Engineering & Optimization

4. Feature Engineering Exploration

Location: feature_engineering.ipynb

Conducted systematic feature engineering to identify the optimal feature combination. Tested 12 different feature configurations (X_train_1 through X_train_12) including:

Key Findings:

Documentation: See X_TRAIN_COMBINATIONS_README.md for complete analysis of all 12 feature combinations.


5. Final Model: Feature-Engineered XGBoost with Hyperparameter Tuning

Location: xgboost/xgboost_hyperparameter_tuning.ipynb

Model File: xgboost/xgboost_tuned_feat_eng_wind_model.pkl

The final production model combines the best feature engineering (X_train_5) with systematic hyperparameter optimization using GridSearchCV.

Model Configuration:

Final Performance Metrics:

Improvement Journey:

Key Achievements:


Complete Model Comparison

Metric Logistic Regression Random Forest XGBoost (Initial) XGBoost (Final) Winner
ROC-AUC 0.732 0.770 0.847 0.9545 Final XGBoost
Accuracy 0.643 0.703 0.766 0.8783 Final XGBoost
F1-Score 0.640 0.658 0.746 0.8537 Final XGBoost
Turbine Recall 0.728 0.448 0.645 0.943 Final XGBoost
Turbine Precision 0.522 0.663 0.708 0.780 Final XGBoost
No Turbine F1 0.672 0.782 0.817 0.896 Final XGBoost

Performance Evolution:

Key Insights:


Technical Highlights

Feature Engineering Process

Hyperparameter Optimization

Model Selection Strategy

  1. Algorithm comparison: Logistic Regression → Random Forest → XGBoost
  2. Feature engineering: Systematic evaluation of 12 configurations
  3. Hyperparameter tuning: Grid search on best feature set
  4. Final validation: Test set performance confirms production readiness

Data

The project uses:

Spatial matching is performed using geospatial joins to match turbines to NREL grid cells within a 25 km radius.


Project Structure


Deployment

The final model is deployed in a Streamlit web application: