house price predictions
developing a product that will quickly price a consumer's house with an interpretable confidence range
We'd like to create a proof-of-concept model that demonstrates we can achieve a low error metric despite incomplete or missing data. The only currently available dataset is the Ames Housing Price dataset, so we must use the resources at hand to construct an accurate model to show to non-technical stakeholders within the company.
Our dataset is composed of ~3000 observations of 80 variables, plus our target variable: the final sale price of the house. Our final evaluation for the model is based on a test set that has been stripped of final sale prices, so in order to evaluate our model, we will need to conduct performance analysis against a holdout data set.
We will first reference the provided data dictionary to examine missing values in the data set and subjectively determine their most likely true value.
Next, we will tidy up variables that have small numbers of extreme observations (outliers) by binning the outliers with nearby values. We will also drop some features which we believe are too sparse (e.g. 2997 identical observations and 3 distinct ones) for models to interpret sensibly.
Third, we will conduct feature engineering to create combinations of our different variables based on subject-matter knowledge (For example, total interior square footage of the house may be more meaningful than the square footage of the basement and the square footage of non-basement interior areas on their own), then isolate variables which have high independent predictive power and emphasize them to our models by adding them raised to the 2nd, 3rd, and 1/2 powers.
Finally, we will select appropriate models to test plus a null model and compare them individually against a weighted ensemble model, which predicts the average of all the underlying model's predictions for each data point.
Our model evaluation metric has been chosen as the Root Mean Squared Error. In context, the RMSE of our models reflects the idea that "Our model has an average error of $______ across all predictions." This is a concise and useful metric that is easily representable to stakeholders or our fictional company's customers.
data science portfolio