GHGs Emissions data future predictions using machine learning
Introduction
In this blog, I discuss my machine learning capstone project where I apply algorithms (for both classification and regression) to identify the most reliable algorithm in each of them that depicts the highest performance on both training and test data and that can be considered for the future dataset.
The original data file as well as the jupyter notebooks files have been made public on my Github here.
Dataset
The data file “GHG_Emission.csv” has been retrieved from Alberta Energy Regulators (AER) website; where the locations of the wells have been changed, and some key properties are generated synthetically or are greatly manipulated for confidential reasons.
The exploitation of oil and gas reserves leads to an increase in greenhouse gas emissions. Methane sequestered CO2 can migrate and leak from wellbores to water aquifers, ground surface, and atmosphere. This can cause significant environmental issues. Regular field monitoring should be applied to detect serious leakage through exiting oil and gas wells. Leakiest wells need to be prioritized for amendment. The Alberta Energy Regulator (AER) in Alberta, Canada, operates such field tests for energy wells within the province. The Alberta Energy Regulators (AER) applies field tests for measuring leakage rate(m3/day) and classifying existing wells as serious or non-serious in terms of leakage.
The following figure depicts the location of 1500 hydrocarbon wells. As per the dataset provided 49% of wells have been classified as serious and the remaining wells have been classified as non-series.
Regression and Classification
Gathering Data
First, the dataset was imported and read using pandas. The data was shuffled and then random. seed (42) was used to save the state of a random function. The index of the data was reset.
from numpy import random
np.random.seed(42)
df = pd.read_csv('GHG_Emission.csv',na_values=['NA','?',' ', 'NaN'])
df.reset_index(inplace=True, drop=True) # Reset indexdf[0:5] # Display top five rows
Data Processing
Stratified sampling was performed for even distribution of data. The test and training data were split based on that.
The outliers were removed for instances out of the range of 𝜇 ± 2.5𝜎, imputation (with median) was performed, text handling using one-hot encoding and standardization.
CLASSIFICATION
Model Training for Classification
Binary classification was applied using the following Machine Learning models below.
- Dummy Classifier
- Stochastic Gradient Descent
- Logistic Regression
- Support Vector Machine: Linear
- Support Vector Machine: Polynomial Kernel
- Decision Trees
- Random Forest
- Adaptive Boosting with Linear SVM
- Adaptive Boosting
- Hard and Soft Voting
- Shallow Neural Network (with 3 layers)
- Deep Neural Network ( with 6 layers)
The hyperparameters were fine-tuned using RandomizedSearchCV based on accuracy. The optimized parameters were used to predict accuracy. K-fold cross-validation with 5-folds (cv=5) was applied and then the mean of 5 accuracies for each classifier was calculated.
These optimized hyper-parameters for all the above-mentioned algorithms were used for finding the performance on the test dataset as well.
Model Performance for Classification
TRAINING DATASET
Here are the summary table and graph depicting the performance on the training dataset.
Clearly, Random Forest is the best algorithm on the training dataset.
TESTING DATASET
Here are the summary table and graph depicting the performance on the testing dataset.
Clearly, Random Forest is the best algorithm on the testing dataset.
Conclusion for Classification
Random Forest should be used for future datasets as it gives the best performance on both testing and training data.
REGRESSION
Model Training for Regression
Similar to Binary classification, Regression was applied with the following Machine Learning models below.
- Linear Regression
- Support Vector Machine: Polynomial Kernel
- Decision Trees
- Random Forest
- Gradient Boosting
- Shallow Neural Network (with 3 layers)
The hyperparameters were fine-tuned using RandomizedSearchCV based on RMSE. The optimized parameters were used to predict RMSE. K-fold cross-validation with 5-folds (cv=5) was applied and then the mean of 5 RMSEs for each regressor was calculated.
These optimized hyper-parameters for all the above-mentioned algorithms were used for finding the performance on the test dataset as well.
Model Performance for Regression
TRAINING DATASET
Here are the summary table and graph depicting the performance on the training dataset.
Clearly, Random Forest is the best algorithm on the training dataset.
TESTING DATASET
Here are the summary table and graph depicting the performance on the testing dataset.
Clearly, Random Forest is the best algorithm on the testing dataset.
Conclusion for Regression
Random Forest should be used for future datasets as it gives the best performance on both testing and training data.
The dataset used in this model is available on my Github along with my code that is available for public use. If you have any questions or comments or need any further clarifications please don’t hesitate to contact me at aditimukerjee33@gmail.com or reach me at 403–671–7296. If you are interested in collaborating on any project, feel free to reach out to me without any hesitation.