geting the data structure and then each attribute distribution.
## [1] 1599 12
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
The red wine dataset contains 1,599 observations with 13 variables.
It seems that overall the wine dataset is normally distributed with an average of approximately 6, this is an indication that it’s a collection of fairly good-quality wines, where 0 (very bad) and 10 (very excellent). I chose to use geom bars to represent wine quaility data because quality is a discrete value.
Most acids involved with wine are fixed or nonvolatile (do not evaporate readily).
Fixed acidity values range between 4 and 16, with most values range between 7 and 9. The distribution is slightly positively skewed. Transforming the x-axis into log scale can make it more normally distributed.
Fixed acidity and volatile acidity appear to be long tailed as well, and transforming their log appears to make them closer to a normal distribution. Of course, since pH is a logarithmic term, and is normal in our data set, then it would be sense for the log of acidity levels to also be approximately normal. Variances are confirmed to be a relevant decrease for fixed acidity but not entirely relevant for volatile acidity.
## [1] 132
Most red wines are of a citric acid, which adds ‘freshness’ and flavor to wines, between [0.1 - 0.5] g/dm^3: mean is about 0.27 g/dm^3 and median is about 0.26 g/dm^3, which is reasonable as citric acid is usually found in small quantities.
Most residual sugar values range between 1.5 and 2.5. There are a few outliers with large values. When zoom in and look at values below 5, the distribution appears normal.
Most chlorides values range between 0.05 to 0.1. The histogram is positively skewed. There are a few outliers with large values. When zoom in and look at values below 0.2, the distribution appears normal.
The distribution of free sulfur dioxide is highly positively skewed.
The distribution of total sulfur dioxide is higly positively skewed. And there are a few outliers with very large values. Transforming the x-axis into log scale can make it more normally distributed.
Density values range between 0.990 and 1.004 with most values range from 0.995 and 0.998. The distribution of density values are symmetrical centered around 0.9965.
Most pH values range between 3.15 and 3.45. The distribution of pH is symmetrical centered around 3.3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Most sulphates values range between 0.5 and 0.75. The distribution is positively skewed. There are a few ourliers with large sulphates values. Transforming the x-axis into log scale can make it more normally distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The alcohol values range between 8.5 and 15. mean and median are about 10%. The distribution of alcohol value is positively skewed.
There are 11 attributes in the dataset + output (quality rating) between 0 = very bad and 10 = very excellent where at least 3 wine experts rated the quality. Each row corresponds to one particular wine with total 1599 different red wines in the data set. ### What is/are the main feature(s) of interest in your dataset? The main feature of interest is the output attribute quality. I am trying to figure out which of the 11 input attribute contribute to a high quality value. ### What other features in the dataset do you think will help support your investigation into your feature(s) of interest? alcohol, volatile acidity, sulphates, and maybe density. ### Did you create any new variables from existing variables in the dataset? No. ### Of the features you investigated, were there any unusual distributions?Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
The possible quality values are from 0 to 10, but our data set only has quality values from 3 to 8, which means there are no extremely bad red wines or extrememly good wines in out data set. The vast majority of red wines in the data set has a quality value either 5 or 6, with very fewer wines with quality values 3, 4, 7 or 8, which makes the data set unbalanced.
I chose to show mainly the chemical features that perhaps has a meaningful correlation with wine quality. from the above correlation matrix, quality correlates positivly with alcohol, with a correlation coefficient of about 0.48. On the other hand, it correlates negatively with volatile acid, with a -0.39 coefficient. Citric and volatile acids tend to correlate negatively.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## sulphates alcohol
## fixed.acidity 0.183005664 -0.06166827
## volatile.acidity -0.260986685 -0.20228803
## citric.acid 0.312770044 0.10990325
## residual.sugar 0.005527121 0.04207544
## chlorides 0.371260481 -0.22114054
## free.sulfur.dioxide 0.051657572 -0.06940835
## total.sulfur.dioxide 0.042946836 -0.20565394
## density 0.148506412 -0.49617977
## pH -0.196647602 0.20563251
## sulphates 1.000000000 0.09359475
## alcohol 0.093594750 1.00000000
Quality correlates highly with alcohol and volatile acidity (correlation coefficient > 0.3), but also there seems to be interesting correlations with some of the supporting variables. Free sulfur dioxide correlates highly with total sulfur dixoide, fixed acidity with both pH and density, density with both alcohol and residual sugar, sulphates and chlorides. Let me generate a correlation matrix to have a better insight.
There isn’t a clear trend between fixed acidity and quality.
The higher the quality, the lower the volatile acidity.
The higher the quality, the higher the citric acid.
There isn’t a clear trend between residual sugar and quality.
After zoom in, one can see the higher the quality, the lower the chlorides.
There isn’t a clear trend between free sulfur dioxide and quality.
There isn’t a clear trend between total sulfur dioxide and quality.
The higher the quality, the lower the density.
The higher the quality, the lower the pH.
The higher the quality, the higher the sulphates.
The higher the quality, the higher the alcohol.
As citric acid level increases, sulphates level tend to increase as well.
There’s an interesting negative correlation between citric and volatile acid that can be clearly shown using geom_smooth function.
As density increases, residual sugar amount increases as well. Geom_smooth helped in showing the positive correlation.
Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.
There are a few attributes exhibit some trends that look promising to be used to predict quality.
Fixed acidity and citric acid are positively correlated because the fixed acidity includes citric acid. * Total sulfur dioxide and free sulfur dioxide are positively correlated because total sulfur dioxide includes free sulfur dioxide. * Fixed acidity and pH are negatively correlated because higher concentration of fixed acidity makes the wine more acidic, therefore the wine has a lower pH. * Citric acid and pH are negatively correlated because higher concentration of citric acid, which is non-volatile, makes the wine more acidic, therefore the wine has a lower pH. * Density and alcohol are negatively correlated because alcohol has a lower density than water, therefore wines that contain more alcohol have a lower density. * Density and fixed acidity are positively correlated because the main fixed acids in wine, tartaric acid, has a higher density than water, therefore wines that contain more tartaric acid have a higher density.
The quality of the wine is positivley and highly correlated with alcohol. Moreover, alcohol correlates very highly with the pH levels of the wine. On the other hand, the citric acid levels of the wine correlates highly and negatively with volatile acidity levels which in return correlates with wine quality as well.
It looks like the higher quality red wines tend to be concentrated in the top left of the plot. This tends to be where the higher alcohol content (larger dots) are concentrated as well.
Let’s try summarizing quality using a contour plot of alcohol and sulphate content: This shows that higher quality red wines are generally located near the upper right of the scatter plot (darker contour lines) wheras lower quality red wines are generally located in the bottom right.
Let’s make a similar plot but this time quality will be visualized using density plots along the x and y axis and color :
Again, this clearly illustrates that higher quality wines are found near the top right of the plot.
By combining the most promising attribute from bivariate section, volatile acidity, with one of the other attributes (citric acid, sulphates, alcohol, chlorides, density and pH), one can further separate high quality wines and low quality wines.
By looking at density vs fixed acidity and alcohol, one can see that fixed acidity has a larger impact on the density of the wine than alcohol.
The possible quality values are ranging from 0 to 10, however, all red wines in the dataset have quality values between 3 and 8. There is no any really bad wine with quality below 3 or any really good wine with quality above 8. Also, most of the red wines have quality 5 or 6, which make the dataset not well balanced.
The strongest correlation coefficient was found between alcohol and quality. Now let’s look at the alcohol content by red wine quality using a density plot function:
Clearly we see that the density plots for higher quality red wines (as indicated by the red plots) are right shifted, meaning they have a comparatively high alcohol content, compared to the lower quality red wines. However, the main anomoly to this trend appears to be red wines having a quality ranking of 5.
We see a clear trend where higher quality red wines (red dots), are concentrated in the upper right of the figure, while their also tends to be larger dots concentrated in this area.
The above analysis considered the relationship of a number of red wine attributes with the quality rankings of different wines. Melting the dataframe and using facet grids was really helpful for visualizing the distribution of each of the parameters with the use of boxplots and histograms. Most of the parameters were found to be normally distributed while citirc acid, free sulfur dioxide and total sulfur dioxide and alcohol had more of a lognormal distribution.
Using the insights from correlation coefficients provided by the paired plots, it was interesting exploring quality using density plots with a different color for each quality. Once I had this plotted it was interesting to build up the multivariate scatter plots to visualize the relationship of different variables with quality by also varying the point size, using density plots on the x and y axis, and also using density plots.
A next step would be to develop a statistical model to predict red wine quality based on the data in this dataset.