Housing price prediction using machine learning is one of the most important applications in real estate analytics.
This project analyzes location based and property based features to estimate the median house value in California districts.
Using Python and machine learning the model learns complex relationships between geographic attributes income levels population and housing characteristics.
This helps in understanding market trends and making informed pricing decisions.
Project Overview
- Machine learning based regression system for predicting house prices
- Uses Linear Regression Decision Tree Regressor and Random Forest Regressor
- Includes full preprocessing pipeline with imputation scaling and encoding
- Built using the California Housing dataset and modern ML workflows
Libraries Used
- Pandas for data manipulation and exploration
- NumPy for numerical computation
- Scikit Learn for preprocessing transformation model building and evaluation
- SimpleImputer for handling missing values
- StandardScaler for feature scaling
- OneHotEncoder for categorical encoding
- ColumnTransformer for applying different transformations to numeric and categorical columns
- Cross validation for performance measurement
Dataset Details
The dataset represents California housing district level data.
Features include
- Longitude and latitude
- Housing median age
- Total rooms and total bedrooms
- Population and households
- Median income
- Ocean proximity categorical attribute
The target column is
- median house value
Preprocessing Steps
- Created income categories to perform stratified sampling ensuring balanced train test splits
- Separated dataset into features and target label
- Identified numeric and categorical feature groups
- Built a pipeline for numeric attributes including median imputation and standard scaling
- Built a pipeline for categorical attributes using one hot encoding
- Combined both pipelines using ColumnTransformer
- Transformed the dataset into a complete numerical and scaled feature set ready for modeling
Model Building
- Linear Regression
Learns linear relationships between features and house value
Provides baseline performance - Decision Tree Regressor
Learns nonlinear feature interactions
Can overfit but gives insight into complex patterns - Random Forest Regressor
Ensemble of multiple decision trees for higher accuracy and stability
Usually performs best for structured tabular datasets like housing data
Each model is trained on the processed dataset and evaluated using cross validation RMSE values for reliable comparison.
Performance and Accuracy
- Cross validation used to compute root mean squared error for each model
- Linear Regression gives a baseline error
- Decision Tree Regressor may show very low training error but higher cross validation error
- Random Forest Regressor generally produces the best accuracy due to ensemble averaging
Prediction Flow
1 Dataset is transformed using the preprocessing pipeline
2 Model is selected Linear Regression Decision Tree or Random Forest
3 Features are passed into the model to generate the predicted median house value
Deployment Possibilities
- Can be deployed using Flask or Streamlit for interactive prediction
- Useful for real estate companies analysts and housing market researchers
- Can be integrated into a full decision support dashboard
Key Takeaways
- Complete end to end regression workflow implemented successfully
- Demonstrates modern preprocessing using pipelines and column transformers
- Shows performance comparison between multiple regression models
- Highlights the effectiveness of Random Forest for house price prediction
Future Enhancements
- Apply hyperparameter tuning for Random Forest or Gradient Boosting models
- Implement advanced models such as XGBoost or LightGBM
- Add geospatial visualizations for deeper real estate insights
- Build a complete automated system for housing market analysis