GHGs Emissions data future predictions using machine learning

Aditi Mukerjee
5 min readDec 16, 2020

--

Introduction

In this blog, I discuss my machine learning capstone project where I apply algorithms (for both classification and regression) to identify the most reliable algorithm in each of them that depicts the highest performance on both training and test data and that can be considered for the future dataset.

The original data file as well as the jupyter notebooks files have been made public on my Github here.

Dataset

The data file “GHG_Emission.csv” has been retrieved from Alberta Energy Regulators (AER) website; where the locations of the wells have been changed, and some key properties are generated synthetically or are greatly manipulated for confidential reasons.

The exploitation of oil and gas reserves leads to an increase in greenhouse gas emissions. Methane sequestered CO2 can migrate and leak from wellbores to water aquifers, ground surface, and atmosphere. This can cause significant environmental issues. Regular field monitoring should be applied to detect serious leakage through exiting oil and gas wells. Leakiest wells need to be prioritized for amendment. The Alberta Energy Regulator (AER) in Alberta, Canada, operates such field tests for energy wells within the province. The Alberta Energy Regulators (AER) applies field tests for measuring leakage rate(m3/day) and classifying existing wells as serious or non-serious in terms of leakage.

The following figure depicts the location of 1500 hydrocarbon wells. As per the dataset provided 49% of wells have been classified as serious and the remaining wells have been classified as non-series.

Location of 1500 wells and classification of existing wells

Regression and Classification

Gathering Data

First, the dataset was imported and read using pandas. The data was shuffled and then random. seed (42) was used to save the state of a random function. The index of the data was reset.

from numpy import random
np.random.seed(42)
df = pd.read_csv('GHG_Emission.csv',na_values=['NA','?',' ', 'NaN'])
df.reset_index(inplace=True, drop=True) # Reset index
df[0:5] # Display top five rows

Data Processing

Stratified sampling was performed for even distribution of data. The test and training data were split based on that.

The outliers were removed for instances out of the range of 𝜇 ± 2.5𝜎, imputation (with median) was performed, text handling using one-hot encoding and standardization.

CLASSIFICATION

Model Training for Classification

Binary classification was applied using the following Machine Learning models below.

  • Dummy Classifier
  • Stochastic Gradient Descent
  • Logistic Regression
  • Support Vector Machine: Linear
  • Support Vector Machine: Polynomial Kernel
  • Decision Trees
  • Random Forest
  • Adaptive Boosting with Linear SVM
  • Adaptive Boosting
  • Hard and Soft Voting
  • Shallow Neural Network (with 3 layers)
  • Deep Neural Network ( with 6 layers)

The hyperparameters were fine-tuned using RandomizedSearchCV based on accuracy. The optimized parameters were used to predict accuracy. K-fold cross-validation with 5-folds (cv=5) was applied and then the mean of 5 accuracies for each classifier was calculated.

These optimized hyper-parameters for all the above-mentioned algorithms were used for finding the performance on the test dataset as well.

Model Performance for Classification

TRAINING DATASET

Here are the summary table and graph depicting the performance on the training dataset.

Performance on the training dataset.

Clearly, Random Forest is the best algorithm on the training dataset.

TESTING DATASET

Here are the summary table and graph depicting the performance on the testing dataset.

Performance on the test dataset

Clearly, Random Forest is the best algorithm on the testing dataset.

Conclusion for Classification

Random Forest should be used for future datasets as it gives the best performance on both testing and training data.

REGRESSION

Model Training for Regression

Similar to Binary classification, Regression was applied with the following Machine Learning models below.

  • Linear Regression
  • Support Vector Machine: Polynomial Kernel
  • Decision Trees
  • Random Forest
  • Gradient Boosting
  • Shallow Neural Network (with 3 layers)

The hyperparameters were fine-tuned using RandomizedSearchCV based on RMSE. The optimized parameters were used to predict RMSE. K-fold cross-validation with 5-folds (cv=5) was applied and then the mean of 5 RMSEs for each regressor was calculated.

These optimized hyper-parameters for all the above-mentioned algorithms were used for finding the performance on the test dataset as well.

Model Performance for Regression

TRAINING DATASET

Here are the summary table and graph depicting the performance on the training dataset.

Performance on the training dataset

Clearly, Random Forest is the best algorithm on the training dataset.

TESTING DATASET

Here are the summary table and graph depicting the performance on the testing dataset.

Performance on the testing dataset

Clearly, Random Forest is the best algorithm on the testing dataset.

Conclusion for Regression

Random Forest should be used for future datasets as it gives the best performance on both testing and training data.

The dataset used in this model is available on my Github along with my code that is available for public use. If you have any questions or comments or need any further clarifications please don’t hesitate to contact me at aditimukerjee33@gmail.com or reach me at 403–671–7296. If you are interested in collaborating on any project, feel free to reach out to me without any hesitation.

If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment below.

--

--

Aditi Mukerjee
Aditi Mukerjee

Written by Aditi Mukerjee

Engineer. Data Analyst. Machine Learning enthusiast

No responses yet