Exploring Wine Reviews using Pandas

Introduction

There are a lot of wine enthusiasts all over the world

In this blog, I explored the wine review data using Pandas.

The code for the “Wine Review” data has been made public on my Github here.

I am a Professional Engineer(P.Eng) and a Project Management Professional (PMP)with a strong engineering background in Thermal Energy, Oil and Gas Processing, Water and Wastewater Treatment Industry, and over 8 years of experience working in engineering consultancies.

I am a self-motivated, lifetime learner critical thinker who is passionate about data with skills in programming, and statistical analysis, my greatest strength lies in squeezing information out of data to derive insight, come up with creative out-of-the-box solutions and add values.

Data Set

For this blog, we have used the Kaggle data set —Wine Reviews.

The dataset contains 130k wine reviews with variety, location, winery, price, and description.

Wine — https://unsplash.com/photos/EQSPI11rf68

Data Analysis

Here is a glimpse of the dataset.

reviews.shapeOut[37]:(129971, 13)

There are approximately 13k rows of data and 13 columns to analyze.

reviews.loc[(reviews.country == 'Italy') | (reviews.country == 'French') & (reviews.points >= 90)]

There are a lot of wines approximately 19.5k rows of data and 13 columns to analyze.

For an economical wine buyer, what is the “best bargain”? Find the best bargain wine with the highest points-to-price ratio in the dataset.

bargain_idx = (reviews.points / reviews.price).idxmax()bargain_wine = reviews.loc[bargain_idx, 'title']bargain_wine

Out[55]:

'Bandit NV Merlot (California)'

There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be “tropical” or “fruity”? Count the number of times each of these two words appears in the description column in the dataset.

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
descriptor_counts
Out[56]:tropical 3607
fruity 9090
dtype: int64

Who are the most common wine reviewers in the dataset?

reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()Out[58]:taster_twitter_handle
@AnneInVino 3685
@JoeCz 5147
...
@winewchristina 6
@worldwineguys 1005
Name: taster_twitter_handle, Length: 15, dtype: int64

Which is the most expensive wine in the dataset and what is its price?

expensive_idx = (reviews.price).idxmax()#By default, it returns the index for the maximum value in each column.
expensive_wine = reviews.loc[bargain_idx, 'title']
expensive_wine

Out[10]:

'Château les Ormes Sorbet 2013  Médoc'

(reviews.price).max()

Out[11]:

3300.0

Re-checking the wine and ist cost

reviews.loc[reviews.title == ‘Château les Ormes Sorbet 2013 Médoc’]

k=reviews.groupby(['variety']).price.agg([min])
k.sort_values(by=['min'], ascending = True)

What is the average review score is given out by that reviewer along with the reviewer?

reviews.groupby('taster_name').points.mean()Out[27]:taster_name
Alexander Peartree 85.855422
Anna Lee C. Iijima 88.415629
...
Susan Kostrzewa 86.609217
Virginie Boone 89.213379
Name: points, Length: 19, dtype: float64
k=reviews.groupby(['country', 'variety']).variety.agg([len])k.sort_values(by='len', ascending=False)
k=reviews.groupby(['taster_name']).taster_name.count()
k.sort_values(ascending=False)
Out[34]:taster_name
Roger Voss 25514
Michael Schachner 15134
...
Fiona Adams 27
Christina Pickard 6
Name: taster_name, Length: 19, dtype: int64

What are the most common wine-producing regions? Create a Series counting the number of times each value occurs in the region_1 field. This field is often missing data, so replace missing values with Unknown. Sort in descending order.

k=reviews.region_1.fillna("Unknown")
m=k.value_counts()
m.sort_values(ascending=False)
Out[35]:Unknown 21247
Napa Valley 4480
...
Southern Highlands 1
Maury Sec 1
Name: region_1, Length: 1230, dtype: int64

Conclusion

Wine review is an awesome dataset and I learned a lot after analyzing it. I hope you give the dataset a try and analyze it.

Please share your experience in the comments below.

If you have any questions or comments or need any further clarifications please don’t hesitate to contact me at aditimukerjee33@gmail.com or reach me at 403–671–7296. If you are interested in collaborating on any project, feel free to reach out to me without any hesitation.

If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment below.

Engineer. Data Analyst. Machine Learning enthusiast