Project 3 : House Prices

Varshita Yarabadi
Nov 8, 2022
11 min read

Introduction

As is common knowledge, hiring a valuation specialist is the standard procedure for determining house prices. However, there must be some quicker answers because of the world's population growth and housing shortages. The purpose of this project is to create a predictive mathematical model and use it to calculate the costs of new homes. We have information on recent home sales from Ames City. I used linear regression to build models after researching the situation. This competition asks you to estimate the final price of each residential property in Ames, Iowa, given the 79 explanatory factors that describe (nearly) every feature of residential properties there.

Introduction of Problem & Data

A variety of data sets are presented on the website Kaggle, which hosts data science competitions. Users can compete to build the model that performs the best across the most data sets. Here, we're utilizing housing data from Ames, Iowa, from 2000 to 2010. The testing data comes from purchases made in 2010, whereas the training data comes from sales made before 2010. The Ames collection is one of the more challenging ones among those that will work "out of the box" in Kaggle, which houses data sets with varied degrees of complexity and data cleanliness. It has a wide range of characteristics, some of which contain N/A values, some of which are categorical, and even some of which have features where a N/A value reflects something important rather than merely missing data. Our goal with this data set is to employ a number of variables and a variety of multiple linear regression models to forecast the sale price of a house in US dollars. Each model has specifics that we'll go over in more depth below, but they all employ statistical techniques to forecast a dependent variable using one or more independent factors.

What is regression and how does it work?

Regression is an analysis technique that helps make relationships between variables by estimating how one variable affects the other. Regression uses a mathematical method to predict a continuous outcome (y) using a predictor variable (x). There are many kinds of regression analysis such as linear regression, logistic regression, ridges regression, lasso regression and many more. For this project we will focus on linear regression but also experiment with the other models. Linear regression is a regression technique that is used to find linear relationships between variables, target and predictor variables. There are 2 types of linear regression; simple, multiple, and multivariate linear regression. For this project we will use simple linear regression which is used to find relationships between two continuous variables.

Experiment 1:

We chose to use an intuitive approach to the model for our initial run and then see how we could iterate and improve on it in subsequent attempts. The price of a house should be impacted by all qualities. For instance, it makes sense to assume that a relationship between a home's price and square footage exists. We'll only employ three features at most. We should make things straightforward so that we can concentrate on analyzing and comprehending the material. Each feature should be of a high caliber. We must stay away from collinear characteristics. It seems obvious that square footage and number of rooms in a home will be related to some extent and probably have an impact on value so it would do us well to steer clear of potential overfitting.

Data Understanding

Our training data set had 1,460 observations and 81 features in total. With that many features, it would be pointless to go through each one in detail, so we'll just focus on giving you helpful details when they apply. The complete list of variables is accessible if you're interested. We chose to identify our features, investigate the data, and then pre-process them with the awareness that we may have to reiterate through the cycle as we tackled it in a slightly unconventional way because working with over 118,000 data points can get a little computationally heavy.

INDEPENDENT VALUES

Distribution of garage sizes

The capacity of a given home's garage is indicated by a discrete variable that is quite simple to understand. A 0 indicates that there is no garage present, and it is devoid of any Nan values

Above grade living area

A constant that represents the amount of floor space a home has above grade. Above-grade refers to structures with levels above the first floor but excluding basements. There are no Nan values to take into mind because every home in the data set had measurements, and every property must have at least some indoor square footage.

Distribution of home quality

A discrete score from one through ten indicating the home's overall material and finish quality.

DEPENDENT VALUES

Sale prices

The price for which a given home is sold in US dollars.

Data Pre-processing:

It's a little bit different and results in much cleaner data to start with to use data from a Kaggle competition versus having to go out and find (or gather) it in the wild. As a result, our data pre-processing will be a little less involved than in earlier steps, but we still need to double-check everything to confirm that we're right about how clean the data is.

In order to make changes as needed without affecting the original training data, we started pulling all of our features into a new data frame. We also verified that all observations in each feature were using the same data type. Next, we looked for NA values that would suggest a missing observation, but we discovered that the original data contained all of the observations for our features. We started searching for duplicate values in the set as soon as we were certain that all the data was there. Our initial check led us to believe that there were some potential duplicates but we found that the function we'd used was checking for specific features to have matches, not for an entire observation to match another one.

If you go back and look at the selling price distribution graphic, you'll see that it is left-skewed and includes a fair amount of outliers. The hardest aspect of pre-processing, it seemed, would be cleaning up our target variable. We would run the risk of our model being skewed by outliers and performing badly with test data if we used the sale price data as-is. After doing some investigation, we decided to take the natural logarithm of each value in our sale price data (to base e). Fortunately, numpy has a very simple function for this operation, and the final Even while it isn't perfect, the number of outliers is substantially lower and they are dispersed more evenly between the high and low end. Although it would be worthwhile to test a range of various methods to standardize the data if we were to use a model like this in production, for the sake of this discussion, this appears adequate. It's important to note that any predictions provided by a model trained on natural logarithmic training data would need to be transformed using numpy exp function.

Modeling:

It was time to begin modeling with our data once we had it ready and understood it. We established a linear regression model, selected our target and predictor variables, and then fit the model. All of these steps can be found in our code. Given the other three factors, our model created a formula, which is shown below, that it utilizes to forecast the sale price. It is important to keep in mind that since this will forecast the sale price as a natural logarithm, we must use Numpy exponent function to produce values in US dollars.

We were able to make predictions for observations from the test set once our model had been fitted. In contrast to the training set, which included prices, the test set did not, thus the model had to predict values using our formula.

Evaluation:

We at last have reliable predictions! There is a minor snag, though. We didn't have access to any of the actual sale prices for the test set because we decided to enter the Kaggle competition rather than divide our training set into a train-test split. This indicates that our capacity to adequately assess our model was somewhat constrained. In an ANOVA (analysis of variation) table, we would typically examine a number of metrics, including r squared, mean squared error, and the p (or f) value. We tried searching the internet for the missing test results for the Ames data set for a while, but we were unable to locate anything that we were certain represented the genuine value.

Despite our analytical conundrum, we were able to produce a new dataframe with our new anticipated values (which were transformed using Numpy exp function) and the original identifier for each observation from the test data set. Root mean squared error, or RMSE, was used to score the Kaggle tournament. The average difference between our anticipated value for the observation and the actual value is what the RMSE basically measures. Our model produced an RMSE of.19001, placing us in the top three thousand out of just over four thousand people. It's not surprising that our model placed at the bottom given some of the competition's extremely complex assessments, but it was definitely demoralizing.

Experiment 2:

With experiment 2, we aimed to be a little more methodical and examine the data for characteristics that would be helpful in predicting the sale price of a home. Since home attributes are generally straightforward to comprehend, the intuitive model performed about as well as it could have, but this method would be more typical if we lacked domain experience and were merely attempting to get the lay of the land. One requirement for this experiment was that only quantitative variables would be used.

Data Exploration & Feature Engineering

In order to start this experiment, we first built a new data frame from the initial training set that comprised all continuous features. We had a ton of features, similar to our previous run, 39, so we'll hold off on explanations and specifics until we've determined which features are actually usable. We chose to look for correlation visually, so we made a pair plot of fall quantitative parameters and sales price, which, as you can see, had a lot of data.

The next logical step was to sort the data into several categories and consider more manageable pair plots because it was very evident that this was cumbersome and just produced noisy data. Although unimportant to our methodology, our categories were as follows:

House Information: Features which dealt with characteristics of the house, finish, and quality. At first glance, it appeared that overall quality, general condition, and perhaps lot frontage had some relationship and might be helpful. For this model, we removed the last few features from our master data frame.

Size: Features relevant to square footage in its different forms. The living space above grade, on the first floor, and on the second story all appeared to be immediately usable. We kept basement finished square footage number one for additional examination in a correlation matrix once all aspects had been discovered since it appeared to have potential application. After deciding which features to preserve, we removed the remaining columns from our master data frame by name. Since this process is repeated for each category, we won't repeat it in the following three sets.

Rooms: All of which deal with the number or rooms in different parts of a house. These plots are a little bit more difficult to understand than the majority of those we've seen so far. This is because, with the exception of bathrooms, all rooms in a house are discrete, meaning that data may be stacked vertically on full integers. Here, we're seeking for tighter clustering and upward movement as we move down the x axis, as opposed to prior sets where we were looking for a linear pattern. The rooms category was intriguing since it contained several interesting relationships, but they happened infrequently enough to raise concerns about over-fitting. Here, Full Bath, Bedrooms Above Grade, and Total Rooms Above Grade were our final options.

Exterior: Features dealing with outdoor living spaces and other outdoor features of a house. A fascinating blend of continuous and discrete numbers could be found in outdoor features. Fire Places, the garage, and the square footage of the wood deck were carried over. The Pool Area appeared to be quite handy, but we decided against it. We should take into account the context in which our model will be employed as data science practitioners. It makes little sense to develop a strategy that depends on pools in a cold market like Iowa where they will be the exception rather than the rule, even though pool area appears to be incredibly valuable.

Miscellaneous: Various other features that didn't fit neatly into other categories but still might be useful.

The Year Built feature was the only one we kept from the other features set. We checked the columns of the model 2 master data frame to make sure they were as predicted when we had done with the last category, and we were successful. We decided that it would be helpful to create a correlation matrix and check for any stragglers that fell through the cracks because there was still some intuition involved in the feature selection for each model. The original correlation matrix may be found in our Jupyter notebook. Despite the fact that our feature set appeared to be rather good, some features were still worth deleting. We chose to set a threshold of.45 and eliminated all correlations below that mark, including general condition, finished basement square footage, second floor square footage, bedrooms above grade, lot frontage, and square footage of wooden decks. The matrix of our model appeared as follows after cleaning:

Modeling and Evaluation:

It was time to start modeling after we had our entire set of features in hand. With the exception of the null value in our test set for the garage area, the procedure was basically the same as the previous iteration. There are a few alternative methods for filling in missing values, but we decided to use the category mean to complete the missing observation. After ensuring that everything was tidy and prepared, we made forecasts, converted sale prices to their natural logarithm, created a data frame for the predictions using the reversed log transformation, and submitted our DF to the Kaggle competition.

Experiment 3:

We had to separate and examine the categorical variables in order to decide which ones to employ first. We created a new data frame that we could subsequently scaffold onto our DF from model 2 by filtering out all characteristics that were of the "object" type, or non-numerical or boolean features. We next followed the standard procedure of checking the shape and column names to ensure that everything appeared to be reasonable. Needless to say, we were relieved to see that it did. The fact that you can't (or rather shouldn't) just plug categorical variables in as separate features is an intriguing aspect of employing them in multiple linear regression. Instead, you must employ a method called dummy encoding that distinguishes between each condition of a feature's presence or absence. For illustration, suppose we had a feature for roads that indicated if a house was on a street, avenue, or road. We convert the feature to three columns, such as road street, road road, and road avenue, before dummy encoding this variable. We would give that characteristic a 1 or a 0, depending on what kind of road our home was on. When all of the "sub-types" were present, our model would eventually fit such that it would take into account each one and assign a coefficient that would be multiplied by one. It is a crucial component of using categorical variables in regression, despite sounding more hard than it is.

Impact/Conclusion:

I believe that this project can provide a good impact on people who are looking to buy a house. Depending on what a person wants for their house, some features can be selected to determine the house price of the house they want. I think this can help people that might be on a budget but has certain features that the house must have. I will provide a way for them to be able to test out different features and an approximate price. Multiple linear regression exists, including basic regression and multivariable regression, as I have already mentioned. To readily anticipate the price of the house, I have primarily employed simple regression.

I was able to gain more knowledge about linear regression because of this research. I was able to learn how to use specific dataset features to examine how I could use the dataset to predict the SalePrice. I was able to observe the best fit line using the models, which depicts the linear relationship between each feature. For each experiment, I had the option to try out various features to look for patterns. I was able to see from the experiment that the attributes having a weaker association to the Sale Price would have a higher price.