Wine Quality Prediction Using Machine Learning
This project focuses on predicting the quality of wine based on various physicochemical properties such as acidity alcohol percentage pH level density and sulphate content. The goal is to understand which chemical factors influence wine quality and to build an accurate machine learning model using real world data.
Wine quality datasets are widely used in the food industry and research to ensure consistency improve production processes and evaluate wine without relying only on human tasters.
Project Overview
The dataset contains numerical features and a target variable called quality which is typically rated between 0 and 10.
The notebook demonstrates a complete workflow including exploratory analysis feature correlation data preprocessing model building and performance evaluation.
Libraries Used
The notebook uses the following libraries exactly as imported in the code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
Data Preprocessing
The dataset was checked for missing values and all columns were confirmed to be numeric.
Since the data did not require categorical encoding the preprocessing was focused on these steps
- Verified feature types and structure
- Selected quality as the target variable
- Standardized the data if needed for certain algorithms
- Ensured all values were clean and ready for analysis
Exploratory Data Analysis
Visualizations were used to understand patterns between wine features and quality.
Key observations include
- Higher alcohol content generally results in higher wine quality
- Increased volatile acidity lowers the quality
- Sulphates show a mild positive influence
- Citric acid also contributes slightly to improved scores
Bar charts and distribution plots helped reveal how each feature behaves and how they contribute to the final rating.
Correlation Insights
A heatmap was plotted to reveal how strongly each feature relates to the quality score.
Important findings from the correlation analysis include
- Alcohol has the strongest positive correlation
- Volatile acidity has a strong negative correlation
- Sulphates and citric acid show moderate positive effects
- Density shows a weak negative correlation
- pH residual sugar and chlorides have minimal impact
These insights helped guide feature selection and understanding of wine chemistry.
Feature Selection
Based on correlation values and overall relevance the following features were selected as the most impactful predictors of wine quality
alcohol
volatile acidity
sulphates
citric acid
density
Model Training
This project frames the problem as classification and trains a Random Forest classifier rather than a regression model. The training flow follows these steps
- Split data into training and test sets using train_test_split
- Instantiate RandomForestClassifier and fit on the training data
- Predict quality labels on the test set using the trained classifier
Example code snippet
model = RandomForestClassifier()
model.fit(X_train y_train)
predictions = model.predict(X_test)
Random Forest is used because it handles complex non linear relationships and works well with feature sets of this type.
Model Evaluation
Model performance is evaluated using accuracy which is appropriate for classification tasks and provides a clear measure of correct label prediction. Additional evaluation steps can include precision recall and F1 score for class imbalance analysis
- accuracy_score was used to compute the overall prediction accuracy
The Random Forest classifier achieved strong classification accuracy compared to a simple baseline.
Key Results
- Alcohol content is the most powerful indicator of wine quality
- Higher volatile acidity tends to reduce quality
- Sulphates and citric acid show positive influence
- Random Forest classifier delivered the best predictive performance in this workflow
Conclusion
The Wine Quality Prediction project successfully highlights which chemical characteristics matter most in determining wine quality. The Random Forest classifier proved to be an effective model for predicting quality labels and the insights gained align with real world wine chemistry. This project provides a strong foundation for predictive modeling in food science and quality control.
Future Improvements
- Test advanced classification models such as Gradient Boosting or XGBoost classifiers
- Apply cross validation for more reliable scoring
- Consider converting quality into categorical bins such as good average and poor to simplify interpretation
- Optimize hyperparameters using GridSearchCV or RandomizedSearchCV
- Deploy the classifier with Flask or Streamlit for real time use