Machine Learning Pipelines with Azure ML Studio

Aditi Mukerjee
4 min readDec 16, 2020

--

Introduction

Here I built an ML pipeline using Microsoft Azure Studio.

Dataset

I used the Adult Census dataset within Microsoft Azure ML Studio for predicting accuracies. This dataset is from the Machine Learning repository.

Data selection and cleaning

  • The Adult Census dataset was imported into the dashboard.
  • To account for the missing data, all missing values were substituted by 0 using the Clean Missing Data module.
  • Next using the Select Columns in the Dataset module, irrelevant and redundant columns were excluded from the data. This was done to reduce the clutter during analysis.
  • Once the final set of features is ready, the Edit Metadata module to convert the specific columns from String types to Categorical Feature types.

Accounting for Class Imbalance

  • Upon data visualization, it was realized that there is a class imbalance in the dataset.
  • The number of people earning less than $50K/yr is more than twice of the people earning greater than $50K/yr. Upsampling of minority class for the entire dataset was avoided since upsampling can affect the generalization ability of a model. Since one of the primary goals of model validation is to estimate how it will perform on unseen data, upsampling correctly is critical.
  • The right way is to first create the training and test sets and only upsample the training data.
  • Two models were trained. One model was trained on the upsampled data, and the other with just the original pre-processed data.
  • Later on, the performance of both models was analyzed to come to a conclusion about the efficacy of creating synthetic observations by upsampling the minority class.
  • So, the SMOTE function was applied on the income column that showed class imbalance.

Training a Two-Class Boosted Decision Tree Model and Hyperparameter Tuning

  • Two-class boosted decision trees were modeled to predict the income.
  • Hyperparameter tuning was done using the Tune Model Hyperparameters module.

Scoring and Evaluation

  • The two models were compared using the Score Model and Evaluate Model modules. AOC and ROC metrics were used to evaluate and diagnose the models.

The performance of both original and upsampled data (shown as the blue curve) is depicted below:

The performance of the upsampled data (shown as the blue curve) is depicted below:

As per the above ROC graph, it is evident that the performance of the upsampled data is great since the Area Under the Curve (AUC) of the upsampled data.

Here is the snapshot of the Azure ML dashboard created so far.

Predictions

  • When the experiment run completes successfully, the next step was to create a Scoring or Prediction Experiment.
  • The prediction experiment is automatically be created by clicking on predictive web services.
  • The predictions along with the accuracy for various data points can be viewed there.

Here is a screenshot of the prediction made by the software for the Adult Census data:

Prediction

We can say with relatively low probability that the salary of that person is more than 50k.

We can say with relatively high probability that the salary of that person is more than 50k.

Conclusions

Microsoft Azure is easy to use to built a ML pipeline for making predictions.

If you have any questions or comments or need any further clarifications please don’t hesitate to contact me at aditimukerjee33@gmail.com or reach me at 403–671–7296. If you are interested in collaborating on any project, feel free to reach out to me without any hesitation.

If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment below.

--

--

Aditi Mukerjee
Aditi Mukerjee

Written by Aditi Mukerjee

Engineer. Data Analyst. Machine Learning enthusiast

No responses yet