| Contents |
|---|
| Dataset Description |
| Columns Descreption |
| EDA Questions |
| Data Wrangling |
| Data Cleaning |
| Data Visualization |
| Conclusion |
| Built with |
There are two datasets that provide information on samples of red and white variants of the Portuguese "Vinho Verde" wine. Each sample of wine was rated for quality by wine experts and examined with physicochemical tests. Due to privacy and logistic issues, only data on these physicochemical properties and quality ratings are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). data is originaly from UCI Machine Learning Repository.
fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar tastecitric acid: found in small quantities, citric acid can add 'freshness' and flavor to winesresidual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweetchlorides: the amount of salt in the winefree sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of winetotal sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of winedensity: the density of water is close to that of water depending on the percent alcohol and sugar contentpH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scalesulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidantalcohol: the percent alcohol content of the winequality: (score between 0 and 10)
- Q1: What chemical characteristics are most important in predicting the quality of wine?
- Q2: Is a certain type of wine (red or white) associated with higher quality?
- Q3: Do wines with higher alcoholic content receive better ratings?
- Q4: Do sweeter wines (more residual sugar) receive better ratings?
- Q5: What level of acidity (pH) is associated with the highest quality?
Our data can be found on wineQualityReds.csv and wineQualityWhites.csv files provided on this repository,
downloaded from Kaggle
and originaly from UCI Machine Learning Repository.
- red dataframe consists of 1599 records and 13 attributes, while white dataframe consists of 4898 records and the same attributes.
- both data frames has no NaNs nor duplicated values.
- we woul combine the two dataframes and append a new categorical column to indecate the wine color for better analysis.
- columns data types are consistant.
Unnamed: 0column would be dropped.
We endded up with with 13 columns and 6497 rows for our data to begin the analysis with.
a new csv file containing our full data is saved in wine_full.csv.
Using Matplotlib and Seaborn, we made several meaningful visuals and charts to help us gain informative insights regarding any correlation between attributes in our dataset, that'll be discussed in the next section.
These are derived conclusions after completing our data visualisation phase.
-
the vast majority of the wine has a
qualityof 6, while less numbers has aqualityof 9. -
using correlation plot, we can easily see if certain attributes are correlated more strongly to wine
qualitythan some others.- strong correlated attributes:
alcoholandquality, and it's clear that this is the highest relation that affects winequality.
- weak correlated attributes (do not depend on each other):
densityandalcohol.free.sulphur.dioxideandcitric.acidhas almost no correlation with quality
densityhas strong positive correlation withresidual.sugarand strong negative correlation withalcohol.
- strong correlated attributes:
- there is noticable deviation between
whiteandredwine counts. whitewine formes the vast majority of our dataset as it appears in more than 75% of the times.- most of the
whitewine has aqualityof 6, while most of theredwine has aqualityof 5. - the mean
qualityofredandwhitewine are ve`ry close. whitewine has the best meanqualityhigher thanredwine.
- we have the highst
alcoholcontent at 14.9. - most of the wine has
alcoholiccontent around 10.4. - most of our dataset that has a
qualityof 6 appears to have relatively lowacoholiccontent, but it's still above the mean. - high
alcoholiccontent only appears in our dataset with highqualitywine.
- we can see that the highest
sugarcontent is tied to aqualityof 5, while lowersugarcontent appears to have respectively higherquality.
- most of the wine in our dataset has high
acidity level - it's clear that all four acidity levels has close mean
quality, but theLow aciditylevel has the highestqualityin our dataset.
- JupyterLab
- Python3
- Pandas
- Numpy
- Matplotlib
- Seaborn