Exploring Wine Reviews using Pandas
Introduction
There are a lot of wine enthusiasts all over the world
In this blog, I explored the wine review data using Pandas.
The code for the “Wine Review” data has been made public on my Github here.
I am a Professional Engineer(P.Eng) and a Project Management Professional (PMP)with a strong engineering background in Thermal Energy, Oil and Gas Processing, Water and Wastewater Treatment Industry, and over 8 years of experience working in engineering consultancies.
I am a self-motivated, lifetime learner critical thinker who is passionate about data with skills in programming, and statistical analysis, my greatest strength lies in squeezing information out of data to derive insight, come up with creative out-of-the-box solutions and add values.
Data Set
For this blog, we have used the Kaggle data set —Wine Reviews.
The dataset contains 130k wine reviews with variety, location, winery, price, and description.
Data Analysis
Looking at the dataset
Here is a glimpse of the dataset.
reviews.shapeOut[37]:(129971, 13)
There are approximately 13k rows of data and 13 columns to analyze.
Focussing on the Italian or French wines with 90+ points
reviews.loc[(reviews.country == 'Italy') | (reviews.country == 'French') & (reviews.points >= 90)]
There are a lot of wines approximately 19.5k rows of data and 13 columns to analyze.
Bargain Wine
For an economical wine buyer, what is the “best bargain”? Find the best bargain wine with the highest points-to-price ratio in the dataset.
bargain_idx = (reviews.points / reviews.price).idxmax()bargain_wine = reviews.loc[bargain_idx, 'title']bargain_wine
Out[55]:
'Bandit NV Merlot (California)'
How many times have the wines have been mentioned as tropical or fruity
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be “tropical” or “fruity”? Count the number of times each of these two words appears in the description column in the dataset.
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
descriptor_countsOut[56]:tropical 3607
fruity 9090
dtype: int64
Most common wine review
Who are the most common wine reviewers in the dataset?
reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()Out[58]:taster_twitter_handle
@AnneInVino 3685
@JoeCz 5147
...
@winewchristina 6
@worldwineguys 1005
Name: taster_twitter_handle, Length: 15, dtype: int64
Most expensive wine and what does it cost
Which is the most expensive wine in the dataset and what is its price?
expensive_idx = (reviews.price).idxmax()#By default, it returns the index for the maximum value in each column.
expensive_wine = reviews.loc[bargain_idx, 'title']
expensive_wine
Out[10]:
'Château les Ormes Sorbet 2013 Médoc'
(reviews.price).max()
Out[11]:
3300.0
Re-checking the wine and ist cost
reviews.loc[reviews.title == ‘Château les Ormes Sorbet 2013 Médoc’]
Least expensive wine variety
k=reviews.groupby(['variety']).price.agg([min])
k.sort_values(by=['min'], ascending = True)
Average reviewer score
What is the average review score is given out by that reviewer along with the reviewer?
reviews.groupby('taster_name').points.mean()Out[27]:taster_name
Alexander Peartree 85.855422
Anna Lee C. Iijima 88.415629
...
Susan Kostrzewa 86.609217
Virginie Boone 89.213379
Name: points, Length: 19, dtype: float64
Combination of countries and varieties is the common
k=reviews.groupby(['country', 'variety']).variety.agg([len])k.sort_values(by='len', ascending=False)
Country with maximum points for its wine
k=reviews.groupby(['taster_name']).taster_name.count()
k.sort_values(ascending=False)Out[34]:taster_name
Roger Voss 25514
Michael Schachner 15134
...
Fiona Adams 27
Christina Pickard 6
Name: taster_name, Length: 19, dtype: int64
Most common wine producing region
What are the most common wine-producing regions? Create a Series counting the number of times each value occurs in the region_1 field. This field is often missing data, so replace missing values with Unknown. Sort in descending order.
k=reviews.region_1.fillna("Unknown")
m=k.value_counts()
m.sort_values(ascending=False)Out[35]:Unknown 21247
Napa Valley 4480
...
Southern Highlands 1
Maury Sec 1
Name: region_1, Length: 1230, dtype: int64
Conclusion
Wine review is an awesome dataset and I learned a lot after analyzing it. I hope you give the dataset a try and analyze it.
Please share your experience in the comments below.
If you have any questions or comments or need any further clarifications please don’t hesitate to contact me at aditimukerjee33@gmail.com or reach me at 403–671–7296. If you are interested in collaborating on any project, feel free to reach out to me without any hesitation.