Project 2: Will Milk Taste Good or Bad?

Will Chiappetta
Sep 30, 2022
5 min read

Updated: Oct 2, 2022

Introduction to the Problem

Almost everyone around the world drinks milk and you never know if your milk will taste good or bad. Wouldn't it be nice if there was a way to predict whether your milk would taste good or bad? I want to look at creating a decision tree predictive model to see if you can predict whether your milk will taste good or bad.

Introduce the Data

The data set I used came from the website Kaggle. This is a website that has many data sets for anyone to use. The data set I used was manually collected and consists of many different observations on milk. The different qualities they observed was pH, Temperature, Taste, Odor, Fat, Turbidity, Color, and the Quality of the milk (low, medium, and high).

Pre-Processing the Data and Data Understanding/Visualization

Now I need to find out if my data needs any cleaning. First, I want to check if there are any Na's in my data set.

As you can see there are no Na's in the data set and it is actually very clean and ready to use.

First, I want to start by looking at creating some visualizations of the data before I start modeling. I used seaborn to make my graphs.

From this visualization you can see that the taste data is not completely balanced. There were more milks tasted that were good than milks that were bad.

Now I want to see how useful some other features might be to my predictive model.

This table is looking at the milk grade compared to the taste. I do not think that milk grade will be that useful because both the high grade and the low grade have more good tasting milks than bad even though they are opposite qualities.

This next table is looking at milk temperature compared to the taste. Temperature looks very useful to the for classification because there are many different temperatures that are one sided on the taste. Since the temperature is more decisive towards the taste it will be more useful for the model.

Last, I need to check the datatypes of my columns to make sure they are are useable.

As you can see the grade column is a string, so it needs to be converted to integers. I can do this by using get_dumies which will give each string value a grade in their own column. If they are in the column, they will be assigned a 1 and if they are not, then they receive a 0.

After that we are done with our pre-processing and visualizations and we can move on and start our modeling.

Modeling

To create our model we need to first set our feature and our target variable.

Our feature variable(x) is everything in the data set, but the taste column and our target variable(y) is just the taste column.

Now that we have our variables we need to separate our data into two different sets: a training set and a testing set. We need to do this to understand how our model performs. If we do not separate the data, then we will not know how well the data performed.

I separated the data to train 80% of the data and to test the other 20%. This should give us a good amount of data to test with our model.

Next we need to train our model. We are going to use a decision tree for the classification of our data. I want to create a decision tree because decision trees are good at classifying binary data which is what I am using. No other classification would be as good for my data as a decision tree.

Our data is now trained and we can move on to observing how the model performed.

Evaluation

Let's first take a look at the predicted values from the x test.

As expected all of the values are 1's and 0's.

Now lets see what our accuracy score is.

This is a very high accuracy score. This is a good and a bad thing. It is good because our model worked and is very accurate. However, it is a bad thing because we should definitely question why it is this high. This could be because of overfitting or high correlation. We will be able to inspect this on our decision tree.

From this tree I can see that the nodes seem pretty evenly spread out to different features. It does seem like I am seeing a lot of nodes that are temperature or pH. The node with the least entropy is fat which is the value at the top. This is the value with the most information. I am going to create a feature importance plot to look at this more.

This tells us we were right with what we saw in the tree. Almost all of the features are used and most of them have a big effect on the model. This is good because it shows that the model needed to use all of the features to classify variables. It is also interesting that fat is not the most important feature because it was the value of least entropy which usually means it is the most important. However, it is the 5th most important.

Now I am going to make a classification report to further analyze the data.

From this classification report you can see that the model was pretty even in it's predictions. All of the values across the board are a 99% in its predictions which is very high. This shows that the model was actually predicting and not just guessing.

Finally, I want to look at a confusion matrix to see where the model went wrong.

It seems that the model predicted one false positive and one true negative. This shows that the model was not one sided on it's predictions and predicted wrong both ways. I will talk more about the effects of this in the impact section.

Impact

There are a few impacts that could come from my analysis. A positive impact could be that people will be able to predict if their milk is going to taste good or bad. This could help someone not get sick from drinking spoiled milk. A negative impact could be that if there is a true negative then some people could drink milk that they think tastes good, but actually tastes bad. This could result in them getting sick. Also, the effect could be vise versa. If there is a false positive, then someone may accidently throw out milk they thought tasted bad that actually tastes good.

Conclusion

My results have showed how we classified whether milk would taste good or bad. We learned that a decision tree does very well at doing this and got a 99% accuracy score. Now my model can be used to predict if your milk will taste good or bad after a few tests.

This project has taught me more about Python and Jupyter Notebook. I learned how to make a classification model that had a high level of accuracy. I also learned how to better analyze data and understand it. I am excited to learn more about Data Mining through the rest of this class.

References/Code

Data: https://www.kaggle.com/datasets/cpluzshrijayan/milkquality

Code help: https://www.stackvidhya.com/plot-confusion-matrix-in-python-and-why/

My Code

Project 2: Will Milk Taste Good or Bad?

Recent Posts

Comments