Project 4: What Similarities does Wine have?
- Will Chiappetta
- Nov 16, 2022
- 5 min read
Updated: Nov 17, 2022
Introduction to the Problem
Many people around the world enjoy drinking wine. Wouldn't it be nice to learn more about how different features regarding wine correlate with each other? I am going to be looking into clustering different features of wine and learning how they relate to each other. Specifically, I will be looking at clustering the features of alcohol level and malic acid level in the wine.
What is clustering and how does it work?
Clustering is a technique that is used in machine learning to group data points. When given a set of data points you can use a clustering algorithm to classify all of the data points into specific groups. Each data point should be sorted into groups of data points that have similar properties and features. We can use clustering to give us insight into the data by being able to see what groups the data points fall into. Agglomerative clustering works in two different ways: a top-down or bottom-up approach. The Bottom-up approach works by putting each data point as a single cluster at the outset. Then it merges the pairs of clusters until the clusters have been merged into a single cluster that holds all the data points. The Bottom-up approach is represented as a tree where the root of the tree is the unique cluster that gathers all the data. The leaves are the beginning of the clusters and are only one sample. The algorithm works by finding the distance between each cluster and combining the two clusters that are the closest. Then it continues to combine the closest cluster until it reaches the root of the tree and all the data is in one cluster.
Introduce the Data
The data that I chose to use for my project came from the website Kaggle. This website has many different websites that are there for the public to use. This data came from chemical analysis from wine that was grown in a region in Italy. They came from three different cultivars. The different variables that we will use are Alcohol, Malic_Acid, Ash, Ash_Alcanity, Magnesium, Total_Phenols, Flavonoids, Nonflavanoid_Phenols , Proanthocyanins, Color_Intensity, Hue, OD280, and Proline.
Pre-processing/Data Understanding and Visualization
The first thing I need to do is clean the data. I need to first check if there are any nulls in the data set.

There are no nulls so we can continue to use this data. Next we need to standardize the data so it can be used in our clustering model.

Now that the data is standardized we can start to create visualizations with our data.
First, I want to look at all of the relationships between the data.

I want to look at clustering alcohol and malic acid because I think that the relationship between them could be interesting. They could have a good relationship because alcohol is acidic just like malic acid. First, I want to see how they look on a scatter plot.

There is no clear correlation between them but the clustering might help see the relationship. We can see specifically their correlation with a heat map.

It looks like they have around a 9% correlation. It will be good to see their relationships with the clustering.
Next, we need to make a PCA model to use for our clustering. We can make our variables we want to use into X and the put them in the model to make our cluster.

Next, we can see what our PCA graph looks like.

There are no obvious clusters from our PCA. Since there are no obvious observations we need to look at how it compares to other features in the data set.

It looks like the bottom, the left, and the right could all be different clusters.

It looks the same for this graph, too.
Now it is time to start modeling.
Modeling
First, we will start with k-means clustering. We need to pick out our k value. This is the number of centroids in our cluster.


From this we can see that the optimal number for k is three.This is because that is where the elbow of the graph is. Next, we need to fit our model to look at making three centroids.


We can now start to visualize our cluster.

We can see that the cluster did work and that the model separated it into three groups of data points. The optimal number of centroids is probably three because the wine came from three different cultivars. We can also take a look at how other features look.

You can see that the model did group the other features a little, but the data is mostly all over the place. This could be because there is just not that much correlation between the different contents that make up wine.
Now lets add the cluster back to the original data frame to look at the other items compared to it.

I think we should make an interactive scatterplot to get a better look at the comparison. We have to concat the data frame along with the PCA values to visualize the model properly

I do not know how to get the graph to be interactive on this website, but you can see it on the code linked below. From what I could tell from the data the only feature that seemed to correlate at all with the centroids was Flavonoids. This is probably because the malic acid and alcohol have a big effect on the flavor of a wine.
Next, I want to make an agglomerative cluster.
First we need to find out how many clusters is optimal for this model. We can see this by making a dendrogram to find out.

You can clearly tell from this that the optimal amount of clusters is three. This tracks with the k-means cluster. After that we can now implement the agglomerative cluster.

We have now created the clusters and now we can visualize using PCA again.

We can compare this to the k-means graph we created earlier.

To me it looks like the agglomerative cluster is better because cluster one and two both look tighter than they did in the k-means cluster. However, both models did well clustering the data. We can also look at specific features again like last time.

We can also compare this to the k-means graph we made.

These two clusters look exactly the same to me and I see no differences.
Finally, we can use the interactive graph again to see better what correlations there are in these clusters.

There is nothing new that I can tell from the interactive data. It still seems that the only feature that correlates with the clusters is flavonoids. The agglomerative cluster just made the clusters tighter because it clusters by combining with the closest data points.
Impact
There could be a few impacts from my analysis. One benefit of my research could be informing people how alcohol percentage is related to malic acid in wine. You could learn about patterns with these two features. A negative impact from my research could be more people could get good at making wine by learning from this research. More people making wine could hurt companies that have been making wine for a long time that learned by trial and error. This could also help lower the price of wine.
Conclusion
My results have showed me that you can cluster the features alcohol level and malic acid to find how they are related. What I found was that not many other features of wine are related to alcohol level and malic acid. This makes me think that the clustering model was just guessing when it put together each cluster. It does make sense to me that it would have chosen three clusters for the model because the data came from three different wine cultivars in Italy. I think that this data shows us that there are not specific ways people make wine and that all wines have many differences.
This project has taught me more about Python and Jupyter Notebook. I have also learned how to cluster a data set to find relationships between different features. I learned more about analyzing data and how to write about it. I am excited to continue to learn more about Data Mining in this class.
References/Code
My Code
Comments