California Housing Prediction

Introduction

Housing price prediction using machine learning is one of the most important applications in real estate analytics.
This project analyzes location based and property based features to estimate the median house value in California districts.

Using Python and machine learning the model learns complex relationships between geographic attributes income levels population and housing characteristics.
This helps in understanding market trends and making informed pricing decisions.

Project Overview

  • Machine learning based regression system for predicting house prices
  • Uses Linear Regression Decision Tree Regressor and Random Forest Regressor
  • Includes full preprocessing pipeline with imputation scaling and encoding
  • Built using the California Housing dataset and modern ML workflows

Libraries Used

  • Pandas for data manipulation and exploration
  • NumPy for numerical computation
  • Scikit Learn for preprocessing transformation model building and evaluation
  • SimpleImputer for handling missing values
  • StandardScaler for feature scaling
  • OneHotEncoder for categorical encoding
  • ColumnTransformer for applying different transformations to numeric and categorical columns
  • Cross validation for performance measurement

Dataset Details

The dataset represents California housing district level data.
Features include

  • Longitude and latitude
  • Housing median age
  • Total rooms and total bedrooms
  • Population and households
  • Median income
  • Ocean proximity categorical attribute

The target column is

  • median house value

Preprocessing Steps

  • Created income categories to perform stratified sampling ensuring balanced train test splits
  • Separated dataset into features and target label
  • Identified numeric and categorical feature groups
  • Built a pipeline for numeric attributes including median imputation and standard scaling
  • Built a pipeline for categorical attributes using one hot encoding
  • Combined both pipelines using ColumnTransformer
  • Transformed the dataset into a complete numerical and scaled feature set ready for modeling

Model Building

  • Linear Regression
    Learns linear relationships between features and house value
    Provides baseline performance
  • Decision Tree Regressor
    Learns nonlinear feature interactions
    Can overfit but gives insight into complex patterns
  • Random Forest Regressor
    Ensemble of multiple decision trees for higher accuracy and stability
    Usually performs best for structured tabular datasets like housing data

Each model is trained on the processed dataset and evaluated using cross validation RMSE values for reliable comparison.

Performance and Accuracy

  • Cross validation used to compute root mean squared error for each model
  • Linear Regression gives a baseline error
  • Decision Tree Regressor may show very low training error but higher cross validation error
  • Random Forest Regressor generally produces the best accuracy due to ensemble averaging

Prediction Flow

1 Dataset is transformed using the preprocessing pipeline
2 Model is selected Linear Regression Decision Tree or Random Forest
3 Features are passed into the model to generate the predicted median house value

Deployment Possibilities

  • Can be deployed using Flask or Streamlit for interactive prediction
  • Useful for real estate companies analysts and housing market researchers
  • Can be integrated into a full decision support dashboard

Key Takeaways

  • Complete end to end regression workflow implemented successfully
  • Demonstrates modern preprocessing using pipelines and column transformers
  • Shows performance comparison between multiple regression models
  • Highlights the effectiveness of Random Forest for house price prediction

Future Enhancements

  • Apply hyperparameter tuning for Random Forest or Gradient Boosting models
  • Implement advanced models such as XGBoost or LightGBM
  • Add geospatial visualizations for deeper real estate insights
  • Build a complete automated system for housing market analysis
Share this post:
Facebook
Twitter
LinkedIn

Web Development Projects

Interested in more? Check out my Machine Learning projects as well.

Machine Learning Projects

Interested in more? Check out my Machine Learning projects as well.

Python Projects

Interested in more? Check out my Python projects as well.