Explore Red Wine Data by Huda Rezq

Univariate Analysis

geting the data structure and then each attribute distribution.

## [1] 1599   12
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

The red wine dataset contains 1,599 observations with 13 variables.

Univariate Plots Section

Quality

It seems that overall the wine dataset is normally distributed with an average of approximately 6, this is an indication that it’s a collection of fairly good-quality wines, where 0 (very bad) and 10 (very excellent). I chose to use geom bars to represent wine quaility data because quality is a discrete value.

Fixed Acidity

Most acids involved with wine are fixed or nonvolatile (do not evaporate readily).

Fixed acidity values range between 4 and 16, with most values range between 7 and 9. The distribution is slightly positively skewed. Transforming the x-axis into log scale can make it more normally distributed.

Volatile Acidity

Fixed acidity and volatile acidity appear to be long tailed as well, and transforming their log appears to make them closer to a normal distribution. Of course, since pH is a logarithmic term, and is normal in our data set, then it would be sense for the log of acidity levels to also be approximately normal. Variances are confirmed to be a relevant decrease for fixed acidity but not entirely relevant for volatile acidity.

Citric Acid

## [1] 132

Most red wines are of a citric acid, which adds ‘freshness’ and flavor to wines, between [0.1 - 0.5] g/dm^3: mean is about 0.27 g/dm^3 and median is about 0.26 g/dm^3, which is reasonable as citric acid is usually found in small quantities.

Residual Sugar

Most residual sugar values range between 1.5 and 2.5. There are a few outliers with large values. When zoom in and look at values below 5, the distribution appears normal.

Chlorides

Most chlorides values range between 0.05 to 0.1. The histogram is positively skewed. There are a few outliers with large values. When zoom in and look at values below 0.2, the distribution appears normal.

Free Sulfur Dioxide

The distribution of free sulfur dioxide is highly positively skewed.

Total Sulfur Dioxide

The distribution of total sulfur dioxide is higly positively skewed. And there are a few outliers with very large values. Transforming the x-axis into log scale can make it more normally distributed.

Density

Density values range between 0.990 and 1.004 with most values range from 0.995 and 0.998. The distribution of density values are symmetrical centered around 0.9965.

pH

Most pH values range between 3.15 and 3.45. The distribution of pH is symmetrical centered around 3.3.

Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Most sulphates values range between 0.5 and 0.75. The distribution is positively skewed. There are a few ourliers with large sulphates values. Transforming the x-axis into log scale can make it more normally distributed.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The alcohol values range between 8.5 and 15. mean and median are about 10%. The distribution of alcohol value is positively skewed.

What is the structure of your dataset?

There are 11 attributes in the dataset + output (quality rating) between 0 = very bad and 10 = very excellent where at least 3 wine experts rated the quality. Each row corresponds to one particular wine with total 1599 different red wines in the data set. ### What is/are the main feature(s) of interest in your dataset? The main feature of interest is the output attribute quality. I am trying to figure out which of the 11 input attribute contribute to a high quality value. ### What other features in the dataset do you think will help support your investigation into your feature(s) of interest? alcohol, volatile acidity, sulphates, and maybe density. ### Did you create any new variables from existing variables in the dataset? No. ### Of the features you investigated, were there any unusual distributions?Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The possible quality values are from 0 to 10, but our data set only has quality values from 3 to 8, which means there are no extremely bad red wines or extrememly good wines in out data set. The vast majority of red wines in the data set has a quality value either 5 or 6, with very fewer wines with quality values 3, 4, 7 or 8, which makes the data set unbalanced.

Bivariate Plots Section

I chose to show mainly the chemical features that perhaps has a meaningful correlation with wine quality. from the above correlation matrix, quality correlates positivly with alcohol, with a correlation coefficient of about 0.48. On the other hand, it correlates negatively with volatile acid, with a -0.39 coefficient. Citric and volatile acids tend to correlate negatively.

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
##                         sulphates     alcohol
## fixed.acidity         0.183005664 -0.06166827
## volatile.acidity     -0.260986685 -0.20228803
## citric.acid           0.312770044  0.10990325
## residual.sugar        0.005527121  0.04207544
## chlorides             0.371260481 -0.22114054
## free.sulfur.dioxide   0.051657572 -0.06940835
## total.sulfur.dioxide  0.042946836 -0.20565394
## density               0.148506412 -0.49617977
## pH                   -0.196647602  0.20563251
## sulphates             1.000000000  0.09359475
## alcohol               0.093594750  1.00000000

Quality correlates highly with alcohol and volatile acidity (correlation coefficient > 0.3), but also there seems to be interesting correlations with some of the supporting variables. Free sulfur dioxide correlates highly with total sulfur dixoide, fixed acidity with both pH and density, density with both alcohol and residual sugar, sulphates and chlorides. Let me generate a correlation matrix to have a better insight.

Quality vs Fixed Acidity

There isn’t a clear trend between fixed acidity and quality.

Quality vs Volatile Acidity

The higher the quality, the lower the volatile acidity.

Quality vs Citric Acid

The higher the quality, the higher the citric acid.

Quality vs Residual Sugar

There isn’t a clear trend between residual sugar and quality.

Quality vs Chlorides

After zoom in, one can see the higher the quality, the lower the chlorides.

Quality vs Free Sulfur Dioxide

There isn’t a clear trend between free sulfur dioxide and quality.

Quality vs Total Sulfur Dioxide

There isn’t a clear trend between total sulfur dioxide and quality.

Quality vs Density

The higher the quality, the lower the density.

Quality vs pH

The higher the quality, the lower the pH.

Quality vs Sulphates

The higher the quality, the higher the sulphates.

Quality vs Alcohol

The higher the quality, the higher the alcohol.

As citric acid level increases, sulphates level tend to increase as well.

There’s an interesting negative correlation between citric and volatile acid that can be clearly shown using geom_smooth function.

As density increases, residual sugar amount increases as well. Geom_smooth helped in showing the positive correlation.

Bivariate Analysis

Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There are a few attributes exhibit some trends that look promising to be used to predict quality.

  • Quality is positively correlated with citric acid, sulphates, and alcohol.
  • Quality increases is negatively correlated with volatile acidity, chlorides, density, and pH.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Fixed acidity and citric acid are positively correlated because the fixed acidity includes citric acid. * Total sulfur dioxide and free sulfur dioxide are positively correlated because total sulfur dioxide includes free sulfur dioxide. * Fixed acidity and pH are negatively correlated because higher concentration of fixed acidity makes the wine more acidic, therefore the wine has a lower pH. * Citric acid and pH are negatively correlated because higher concentration of citric acid, which is non-volatile, makes the wine more acidic, therefore the wine has a lower pH. * Density and alcohol are negatively correlated because alcohol has a lower density than water, therefore wines that contain more alcohol have a lower density. * Density and fixed acidity are positively correlated because the main fixed acids in wine, tartaric acid, has a higher density than water, therefore wines that contain more tartaric acid have a higher density.

What was the strongest relationship you found?

The quality of the wine is positivley and highly correlated with alcohol. Moreover, alcohol correlates very highly with the pH levels of the wine. On the other hand, the citric acid levels of the wine correlates highly and negatively with volatile acidity levels which in return correlates with wine quality as well.

Multivariate Plots Section

Plot I

It looks like the higher quality red wines tend to be concentrated in the top left of the plot. This tends to be where the higher alcohol content (larger dots) are concentrated as well.

Plot II

Let’s try summarizing quality using a contour plot of alcohol and sulphate content: This shows that higher quality red wines are generally located near the upper right of the scatter plot (darker contour lines) wheras lower quality red wines are generally located in the bottom right.

Let’s make a similar plot but this time quality will be visualized using density plots along the x and y axis and color :

Again, this clearly illustrates that higher quality wines are found near the top right of the plot.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

By combining the most promising attribute from bivariate section, volatile acidity, with one of the other attributes (citric acid, sulphates, alcohol, chlorides, density and pH), one can further separate high quality wines and low quality wines.

Were there any interesting or surprising interactions between features?

By looking at density vs fixed acidity and alcohol, one can see that fixed acidity has a larger impact on the density of the wine than alcohol.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots

Plot I

The possible quality values are ranging from 0 to 10, however, all red wines in the dataset have quality values between 3 and 8. There is no any really bad wine with quality below 3 or any really good wine with quality above 8. Also, most of the red wines have quality 5 or 6, which make the dataset not well balanced.

The strongest correlation coefficient was found between alcohol and quality. Now let’s look at the alcohol content by red wine quality using a density plot function:

Plot two

Clearly we see that the density plots for higher quality red wines (as indicated by the red plots) are right shifted, meaning they have a comparatively high alcohol content, compared to the lower quality red wines. However, the main anomoly to this trend appears to be red wines having a quality ranking of 5.

Plot three

Plot four

We see a clear trend where higher quality red wines (red dots), are concentrated in the upper right of the figure, while their also tends to be larger dots concentrated in this area.

Reflection

The above analysis considered the relationship of a number of red wine attributes with the quality rankings of different wines. Melting the dataframe and using facet grids was really helpful for visualizing the distribution of each of the parameters with the use of boxplots and histograms. Most of the parameters were found to be normally distributed while citirc acid, free sulfur dioxide and total sulfur dioxide and alcohol had more of a lognormal distribution.

Using the insights from correlation coefficients provided by the paired plots, it was interesting exploring quality using density plots with a different color for each quality. Once I had this plotted it was interesting to build up the multivariate scatter plots to visualize the relationship of different variables with quality by also varying the point size, using density plots on the x and y axis, and also using density plots.

A next step would be to develop a statistical model to predict red wine quality based on the data in this dataset.