What features do affect the price of Airbnb stay in Boston?

12 min readFeb 21, 2021

Business Understanding

Airbnb is an American vacation rental online marketplace company based in San Francisco, California. Airbnb maintains and hosts a marketplace, accessible to consumers on its website or app. Users can arrange lodging, primarily homestays, and tourism experiences or list their spare rooms, properties, or part of it for rental. On the other hand, users who are traveling and looking for stays search properties and rooms by neighborhood or location. Airbnb recommends the best price in the neighborhood and users book the best deal.

Thanks to Kaggle and Udacity that I got a chance to analyze Airbnb listings of Boston city. Boston Airbnb listings dataset has various features such as neighborhood, property type, bedrooms, bathrooms, beds, price, reviews, ratings, etc. It would be interesting to see what features are affecting the price in Boston city and draw interesting conclusions. I was more interested in training and evaluating the model and to see how the model has performed while predicting the prices in Boston city at Airbnb.

My primary goal would be to answer the following questions:

1. What Features are affecting the price most? name the features that affect the price most.
2. How do features affect the price of listings? Do experience and comfort cost more to the user?
3. Can we predict the price of a listing in Boston AirBnB?

Data Understanding

To understand the dataset we have to explore it. Thanks to Python, Pandas, NumPy, Matplot, Seaborn, and Sklearn aka scikit learn it made my life easy to perform data science activities. Pandas is been excellent when it comes to load, clean and transform the data sets. Seaborn is a handy package to visualize data concluded from pandas transformation functions. It offers high-level functions to plot bar charts, histograms, distributions, box plots, etc. I will be using all these packages to explore the data. I have performed the following data science activities to explore the data:

Import packages and read Boston Airbnb datasets
Data cleaning and transformation
Numerical features analysis
Categorical features analysis

Import packages

To explore and analyze the listings dataset we have to import certain python packages that take our pain away. In this study, I have imported NumPy and pandas for linear algebra and data processing respectively. Imported matplotlib and seaborn for plotting dataset. Imported sklearn packages for training and evaluating a model.

Reading listings data sets

After importing all the necessary packages let’s load the Boston Airbnb listings dataset into the memory. Pandas read_csv function made reading CSV files is way easy. It takes the file path including other optional parameters and returns a data frame object.

Exploring dataset

Exploring datasets is one of my favorite data science activities. It gives us lots of interesting and shocking facts about the features of the dataset. Moreover, it helps to identify the best features affecting the target variable. There are some cool functions such as a shape that returns the number of rows and columns of the dataset. Info function outputs a full list of columns, data type, and count of non-null values along with rows and columns. These functions help me understand the nature of features. Let’s have a look at all the features of the listings dataset.

Int64Index: 3585 entries, 12147973 to 14504422
Data columns (total 94 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   listing_url                       3585 non-null   object 
 1   scrape_id                         3585 non-null   int64  
 2   last_scraped                      3585 non-null   object 
 3   name                              3585 non-null   object 
 4   summary                           3442 non-null   object 
 5   space                             2528 non-null   object 
 6   description                       3585 non-null   object 
 7   experiences_offered               3585 non-null   object 
 8   neighborhood_overview             2170 non-null   object 
 9   notes                             1610 non-null   object 
 10  transit                           2295 non-null   object 
 11  access                            2096 non-null   object 
 12  interaction                       2031 non-null   object 
 13  house_rules                       2393 non-null   object 
 14  thumbnail_url                     2986 non-null   object 
 15  medium_url                        2986 non-null   object 
 16  picture_url                       3585 non-null   object 
 17  xl_picture_url                    2986 non-null   object 
 18  host_id                           3585 non-null   int64  
 19  host_url                          3585 non-null   object 
 20  host_name                         3585 non-null   object 
 21  host_since                        3585 non-null   object 
 22  host_location                     3574 non-null   object 
 23  host_about                        2276 non-null   object 
 24  host_response_time                3114 non-null   object 
 25  host_response_rate                3114 non-null   object 
 26  host_acceptance_rate              3114 non-null   object 
 27  host_is_superhost                 3585 non-null   object 
 28  host_thumbnail_url                3585 non-null   object 
 29  host_picture_url                  3585 non-null   object 
 30  host_neighbourhood                3246 non-null   object 
 31  host_listings_count               3585 non-null   int64  
 32  host_total_listings_count         3585 non-null   int64  
 33  host_verifications                3585 non-null   object 
 34  host_has_profile_pic              3585 non-null   object 
 35  host_identity_verified            3585 non-null   object 
 36  street                            3585 non-null   object 
 37  neighbourhood                     3042 non-null   object 
 38  neighbourhood_cleansed            3585 non-null   object 
 39  neighbourhood_group_cleansed      0 non-null      float64
 40  city                              3583 non-null   object 
 41  state                             3585 non-null   object 
 42  zipcode                           3547 non-null   object 
 43  market                            3571 non-null   object 
 44  smart_location                    3585 non-null   object 
 45  country_code                      3585 non-null   object 
 46  country                           3585 non-null   object 
 47  latitude                          3585 non-null   float64
 48  longitude                         3585 non-null   float64
 49  is_location_exact                 3585 non-null   object 
 50  property_type                     3582 non-null   object 
 51  room_type                         3585 non-null   object 
 52  accommodates                      3585 non-null   int64  
 53  bathrooms                         3571 non-null   float64
 54  bedrooms                          3575 non-null   float64
 55  beds                              3576 non-null   float64
 56  bed_type                          3585 non-null   object 
 57  amenities                         3585 non-null   object 
 58  square_feet                       56 non-null     float64
 59  price                             3585 non-null   object 
 60  weekly_price                      892 non-null    object 
 61  monthly_price                     888 non-null    object 
 62  security_deposit                  1342 non-null   object 
 63  cleaning_fee                      2478 non-null   object 
 64  guests_included                   3585 non-null   int64  
 65  extra_people                      3585 non-null   object 
 66  minimum_nights                    3585 non-null   int64  
 67  maximum_nights                    3585 non-null   int64  
 68  calendar_updated                  3585 non-null   object 
 69  has_availability                  0 non-null      float64
 70  availability_30                   3585 non-null   int64  
 71  availability_60                   3585 non-null   int64  
 72  availability_90                   3585 non-null   int64  
 73  availability_365                  3585 non-null   int64  
 74  calendar_last_scraped             3585 non-null   object 
 75  number_of_reviews                 3585 non-null   int64  
 76  first_review                      2829 non-null   object 
 77  last_review                       2829 non-null   object 
 78  review_scores_rating              2772 non-null   float64
 79  review_scores_accuracy            2762 non-null   float64
 80  review_scores_cleanliness         2767 non-null   float64
 81  review_scores_checkin             2765 non-null   float64
 82  review_scores_communication       2767 non-null   float64
 83  review_scores_location            2763 non-null   float64
 84  review_scores_value               2764 non-null   float64
 85  requires_license                  3585 non-null   object 
 86  license                           0 non-null      float64
 87  jurisdiction_names                0 non-null      float64
 88  instant_bookable                  3585 non-null   object 
 89  cancellation_policy               3585 non-null   object 
 90  require_guest_profile_picture     3585 non-null   object 
 91  require_guest_phone_verification  3585 non-null   object 
 92  calculated_host_listings_count    3585 non-null   int64  
 93  reviews_per_month                 2829 non-null   float64
dtypes: float64(18), int64(14), object(62)

Observations:

We can see Boston Airbnb listings dataset has 3585 rows and 94 columns. There are too many columns. We can see two types of columns i.e. Numeric and Object.
Some columns have very few or zero non-null values. I have removed these columns from the data sets.
There are columns such as host_url, medium_url, pricture_url, etc that are not useful thus should be removed thus I have ignored such columns from my studied dataset.
There are columns such as price, cleaning_fee, security_deposit, host_response_rate, etc that are of type object. These columns can be converted to number type thus I have converted them.

Data cleaning and transformations

Based on the above observations I have written a function that uses pandas high-level functions to drop columns that are not useful, drop columns having fewer values, fill NA values, and converting some object type columns to numeric columns. This activity will clean the data and will make more sense.

Numerical features analysis

With cleaned datasets let’s see how the price is distributed and try to analyze the price feature. To do that I have fetched the price column from the listings dataset and plotted a hist distribution using the seaborn histplot function.

Price distribution of Boston Airbnb Listings

Observations:

We can see that most of the samples from price columns are within the range of 220 USD and there are rare samples that are less than 20 USD and greater than 500. I have removed these rows since it is rare to expect such a price from Airbnb listings.
I can further take the log of price to normalize the distribution while training the model.

Removing outliers

Based on the above observation I had included rows where the price was between 20 to 500. After removing outliers we see the improved distribution of the price.

Improved price distribution of Boston Airbnb Listings

I can see now outliers the price is in better shape. The distribution of the log of the price is even better I will use this transformation later while preparing train and test splits.

Correlation Matrix of all Numerical columns

To plot a correlation matrix between numeric variables I fetched numeric columns from the listings dataset and then using the panda's correlation function I was able to get a correlation matrix between all variables. Using the seaborn heatmap function I plotted the correlation matrix as a heatmap. Positive values show a positive or strong relationship between two given feature whereas negative values shows week relationship. This heatmap helped me further to select columns for preparing our final train and test splits.

1. What Features are affecting the price most? It would be interesting to see how features are related to price? Which ones are having a hot relationship with the price?

Correlation of numeric features with price

Observations:

We can see that cleaning_fee, guests_included, security_deposit, beds, bedrooms, bathrooms, accommodates, longitude, latitude, etc are having a strong relationship with price. We must select these columns for our model.
Surprisingly the number of reviews and reviews per month has a negative relationship with price. We can omit such features.

Let’s see how bedrooms, bathrooms, and the number of beds are affecting the price. We can group by bathrooms and bedrooms followed by taking mean and then create a pivot keeping price as a response feature. We can plot the result as a heatmap using the seaborn heatmap function and see whether an extra bathroom or bed is costing more to the user or not.

2. How do features affect the price of listings? Do experience and comfort cost more to the user?

Heatmap of bedrooms, bathrooms, beds with respect to price

Observations:

It is seen that when the user seeks an extra bathroom it costs 40USD more on average. Comfort costs more!
Interestingly, beds, bedroom and price pattern is not consistent. Though it slightly indicates that if users seek an extra bed it might cost them extra.
Listings having zero beds or bedrooms is rare and not relevant can be dropped from the dataset.

After removing rows having zero beds or bedrooms I had a total of 3176 rows and 55 columns left with me.

Categorical Features Analysis

Boston listings have many categorical features, It was interesting to see which ones are affecting the price most? It was quite interesting to see the pattern of price by neighborhood, property type, room type, bed type, cancellation policy, host_is_superhost, etc.

I leveraged the seaborn boxplot function to visualize categorical columns concerning price. host_is_superhost, host_identity_verified, and instant_bookable have two types of values i.e. t (true) and f (false) which is more reasonable to say Yes and No or 1 and 0. I will transform it later. Amenities no doubt affect the pricing of a listing but transforming and analyzing it would be tricky. I kept my study simple and thought of having a separate notebook wherein I can explore more with textual columns.

2. How do features affect the price of listings? Do experience and comfort cost more to the user?

Airbnb Boston Listing price distribution across neighborhoods

Airbnb Boston Listing distribution across neighborhoods

Relationship of price with various categorical features

Observations:

Jamaika Plains, South End, and Back Bay have a high volume of listings. Lowest in Leather District and Longwood Medical Area.
It is also seen that Leather District, China Town, and Downtown are super expensive while Hyde Park, Mattapan, and Dorchester are the cheapest neighborhoods.
Guesthouses, boats, lofts, and villas are expensive as compare to homes and apartments. Experience matters and costs more!
As expected Entire home is expensive while private rooms and shared rooms are the cheapest for users.
Listings that offer Real Beds are comparatively expensive while Airbeds are cheaper. It’s all about comfort. As comfort level goes up price also tends to go up. Again comfort cots more!
Listings that have high prices tend to have a strict cancellation policy while listings of cheaper prices are super flexible.
Superhost lists their prices slightly higher than those who are not Superhost. Again experience matters which costs more to the user.
Listings that offer instant booking are slightly cheaper than listings that don’t offer.
Listings, where the host is verified, are slightly higher in price than listings where the host is not verified.
Listings that require guest phone verification are comparatively higher in price than listings that don’t require.

Data preparation for model

To build an accurate model we should always select features that are highly correlated and influence the price most. The above data exploration explains that few numeric features are highly correlated with price, I kept these features for the model. On the other hand, we have seen how categorical features influenced the price.

Categorical features such as host_is_superhost, instant_bookable, room_type, bed_type, cancellation policy, etc can be transformed into numeric features. Replacing values for each key with a unique number will make more sense and help our model understand it better. Pandas get_dummies function comes very handily to transform such categorical features into binary vectors.

Finally, I prepared a data frame containing selected numeric and categorical features. I have selected the following features for the model based on the study above.

1. What Features are affecting the price most? name the features that affect the price most.

Selected Numerical Features:

price, latitude,longitude, accommodates, bedrooms, bathrooms, beds, security_deposit, cleaning_fee, guests_included, availability_30, availability_60, availability_90, availability_365, review_score_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_location, review-scores_value, calculated_host_listings_count.

Selected Categorical Features:

host_response_time, host_is_superhost, room_type, bed_type, neighbourhood_cleansed, cancellation_policy, property_type, host_identity_verified, instant_bookable, host_has_profile_pic, require_guest_profile_picture, require_guest_phone_verification.

Train and Evaluate model

Here comes the most exciting part of my study. The training and Evaluating model is been always interesting to me. I divided this activity into the following steps:

Extracting input (X) and output (y) features from the dataset.
Split the X and y samples into training and testing samples. train_test_split function of sklearn does the job for us. It takes X and y and splits it into train and test datasets respectively.
Instantiate and Fit the model using x_train and y_train samples. sklearn’s fit function does the magic here by taking x_train and y_train as input.
Predict price using the given model by providing x_test samples. sklearn’s predict function takes x_test samples and returns prediction.
Calculate mean absolute error by providing y_test (actual output) and prediction samples. mean_absolute_error of sklearn helps to calculate the same.
Plot actual values vs predicted values and see how our model performed throughout its journey of learning.

I have used LinearRegressor and RandomForestRegressor from sklearn as my model.

3. Can we predict the price of a listing in Boston AirBnB?
Answer: Yes we can predict the price of a listing in Boston AirBnB. RandomForestRegressor did a better job for me as compare to LinearRegressor. The absolute mean error for RFR is 0.31 whereas 0.35 for LR.

Distribution of actual and predicted values for Random Forest Regressor

Distribution of actual and predicted values for Linear Regressor

Conclusions

Finally, I trained and evaluated two models and we saw RandomForestRegressor was a winner! we have seen the absolute mean error for RFR is 0.31 whereas 0.35 for LR. We have seen hot features that affected the price most. Experience and comfort cost more to the user.

I can see the scope of doing better here, nevertheless, Kaggle and Udacity will give me many more opportunities to perform better.

Here is the GitHub repository to refer to:

vishalpatidar00789/airbnb-boston-listings-price-predictor

This repository contains a python script and python notebook which requires a python environment to run. Boston Airbnb…

github.com

Here is the full Kaggle notebook to refer to:

What features do affect the price of Airbnb stay in Boston?

Business Understanding

Data Understanding

Import packages

Reading listings data sets

Exploring dataset

Data cleaning and transformations

Numerical features analysis

Categorical Features Analysis

Data preparation for model

Train and Evaluate model

Conclusions

vishalpatidar00789/airbnb-boston-listings-price-predictor

This repository contains a python script and python notebook which requires a python environment to run. Boston Airbnb…

Boston AirBnB Price Predictor

Explore and run machine learning code with Kaggle Notebooks | Using data from Boston Airbnb Open Data

Written by Vishal Patidar