Student Project

HomeVision

Team members:
Dylan Jin

Problem & Motivation

This project aims to predict US property value based on property level (including property images), regional level, and macroeconomic features using predictive machine learning models. Our motivation for this project is from the following two perspectives. First, the housing and real estate market is essential for home buyers, sellers, and investors and dramatically impacts the US economy. A good predictive model would benefit all of the participants in this market. Second, the problem is interesting and challenging from a machine-learning perspective. Some researchers predict property value based on property-level textual features, and some have integrated property images into the model. Little research has been done on integrating property-level textual features, images, and regional and macroeconomic data into a model for predicting home value. 

Data Source & Approach

Data source: Our primary data source is the Trulia property listing dataset (January 2020 and September - October 2019) from kaggle.com, the most extensive dataset with textual features and images. Additionally, we took regional-level and macroeconomic data from Zillow, Redfin, American Community Survey, and Federal Reserve (FRED) data. 

Approach: For standard modeling/regression, we used Gradient Boosting, Light GBM, and XGBoost. We used API calls for downloading Google street view and satellite images for image data. We performed subsequent CNN feature extraction using EfficientNet.

Evaluation

We evaluate the performance by splitting the data into training and testing datasets. For our performance metric, we use the median percentage error of house prices, which helps us to compare our results with Zestimate (from Zillow), whose median percentage error is known.

Key Learnings & Impact

Our best model achieves a median percentage error of around 9%, close to the Zestimate model’s 8%. We found that overall, regional, and macro features improve the performance by four percentage points.

We found that missing data and outliers (especially homes whose prices are extremely high or low) can significantly impact model performance. Subsequently, we filtered out outliers to improve data consistency when training our models.

Our best-performing image model used Google street view images combined with GBM. That model performed better than the baseline GBM model when using a simplified set of features (15.9% vs 17.0%). The image models performed similarly when using all variables, with around 12% error. Looking at feature importance confirms that other data features overshadowed image features when including all features.

Acknowledgments

We would like to give special thanks to Puya Vahabi, Severin Perez, and other participants of Section 04 for providing knowledge on the current research and ideas for the modeling.

Last updated:

August 11, 2023