What features do affect the price of Airbnb stay in Boston?

Vishal Patidar
12 min readFeb 21, 2021
image source: Business Insider

Business Understanding

Airbnb is an American vacation rental online marketplace company based in San Francisco, California. Airbnb maintains and hosts a marketplace, accessible to consumers on its website or app. Users can arrange lodging, primarily homestays, and tourism experiences or list their spare rooms, properties, or part of it for rental. On the other hand, users who are traveling and looking for stays search properties and rooms by neighborhood or location. Airbnb recommends the best price in the neighborhood and users book the best deal.

Thanks to Kaggle and Udacity that I got a chance to analyze Airbnb listings of Boston city. Boston Airbnb listings dataset has various features such as neighborhood, property type, bedrooms, bathrooms, beds, price, reviews, ratings, etc. It would be interesting to see what features are affecting the price in Boston city and draw interesting conclusions. I was more interested in training and evaluating the model and to see how the model has performed while predicting the prices in Boston city at Airbnb.

My primary goal would be to answer the following questions:

1. What Features are affecting the price most? name the features that affect the price most.
2. How do features affect the price of listings? Do experience and comfort cost more to the user?
3. Can we predict the price of a listing in Boston AirBnB?

Data Understanding

To understand the dataset we have to explore it. Thanks to Python, Pandas, NumPy, Matplot, Seaborn, and Sklearn aka scikit learn it made my life easy to perform data science activities. Pandas is been excellent when it comes to load, clean and transform the data sets. Seaborn is a handy package to visualize data concluded from pandas transformation functions. It offers high-level functions to plot bar charts, histograms, distributions, box plots, etc. I will be using all these packages to explore the data. I have performed the following data science activities to explore the data:

  1. Import packages and read Boston Airbnb datasets
  2. Data cleaning and transformation
  3. Numerical features analysis
  4. Categorical features analysis

Import packages

To explore and analyze the listings dataset we have to import certain python packages that take our pain away. In this study, I have imported NumPy and pandas for linear algebra and data processing respectively. Imported matplotlib and seaborn for plotting dataset. Imported sklearn packages for training and evaluating a model.

Reading listings data sets

After importing all the necessary packages let’s load the Boston Airbnb listings dataset into the memory. Pandas read_csv function made reading CSV files is way easy. It takes the file path including other optional parameters and returns a data frame object.

Exploring dataset

Exploring datasets is one of my favorite data science activities. It gives us lots of interesting and shocking facts about the features of the dataset. Moreover, it helps to identify the best features affecting the target variable. There are some cool functions such as a shape that returns the number of rows and columns of the dataset. Info function outputs a full list of columns, data type, and count of non-null values along with rows and columns. These functions help me understand the nature of features. Let’s have a look at all the features of the listings dataset.

Int64Index: 3585 entries, 12147973 to 14504422
Data columns (total 94 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 listing_url 3585 non-null object
1 scrape_id 3585 non-null int64
2 last_scraped 3585 non-null object
3 name 3585 non-null object
4 summary 3442 non-null object
5 space 2528 non-null object
6 description 3585 non-null object
7 experiences_offered 3585 non-null object
8 neighborhood_overview 2170 non-null object
9 notes 1610 non-null object
10 transit 2295 non-null object
11 access 2096 non-null object
12 interaction 2031 non-null object
13 house_rules 2393 non-null object
14 thumbnail_url 2986 non-null object
15 medium_url 2986 non-null object
16 picture_url 3585 non-null object
17 xl_picture_url 2986 non-null object
18 host_id 3585 non-null int64
19 host_url 3585 non-null object
20 host_name 3585 non-null object
21 host_since 3585 non-null object
22 host_location 3574 non-null object
23 host_about 2276 non-null object
24 host_response_time 3114 non-null object
25 host_response_rate 3114 non-null object
26 host_acceptance_rate 3114 non-null object
27 host_is_superhost 3585 non-null object
28 host_thumbnail_url 3585 non-null object
29 host_picture_url 3585 non-null object
30 host_neighbourhood 3246 non-null object
31 host_listings_count 3585 non-null int64
32 host_total_listings_count 3585 non-null int64
33 host_verifications 3585 non-null object
34 host_has_profile_pic 3585 non-null object
35 host_identity_verified 3585 non-null object
36 street 3585 non-null object
37 neighbourhood 3042 non-null object
38 neighbourhood_cleansed 3585 non-null object
39 neighbourhood_group_cleansed 0 non-null float64
40 city 3583 non-null object
41 state 3585 non-null object
42 zipcode 3547 non-null object
43 market 3571 non-null object
44 smart_location 3585 non-null object
45 country_code 3585 non-null object
46 country 3585 non-null object
47 latitude 3585 non-null float64
48 longitude 3585 non-null float64
49 is_location_exact 3585 non-null object
50 property_type 3582 non-null object
51 room_type 3585 non-null object
52 accommodates 3585 non-null int64
53 bathrooms 3571 non-null float64
54 bedrooms 3575 non-null float64
55 beds 3576 non-null float64
56 bed_type 3585 non-null object
57 amenities 3585 non-null object
58 square_feet 56 non-null float64
59 price 3585 non-null object
60 weekly_price 892 non-null object
61 monthly_price 888 non-null object
62 security_deposit 1342 non-null object
63 cleaning_fee 2478 non-null object
64 guests_included 3585 non-null int64
65 extra_people 3585 non-null object
66 minimum_nights 3585 non-null int64
67 maximum_nights 3585 non-null int64
68 calendar_updated 3585 non-null object
69 has_availability 0 non-null float64
70 availability_30 3585 non-null int64
71 availability_60 3585 non-null int64
72 availability_90 3585 non-null int64
73 availability_365 3585 non-null int64
74 calendar_last_scraped 3585 non-null object
75 number_of_reviews 3585 non-null int64
76 first_review 2829 non-null object
77 last_review 2829 non-null object
78 review_scores_rating 2772 non-null float64
79 review_scores_accuracy 2762 non-null float64
80 review_scores_cleanliness 2767 non-null float64
81 review_scores_checkin 2765 non-null float64
82 review_scores_communication 2767 non-null float64
83 review_scores_location 2763 non-null float64
84 review_scores_value 2764 non-null float64
85 requires_license 3585 non-null object
86 license 0 non-null float64
87 jurisdiction_names 0 non-null float64
88 instant_bookable 3585 non-null object
89 cancellation_policy 3585 non-null object
90 require_guest_profile_picture 3585 non-null object
91 require_guest_phone_verification 3585 non-null object
92 calculated_host_listings_count 3585 non-null int64
93 reviews_per_month 2829 non-null float64
dtypes: float64(18), int64(14), object(62)

Observations:

  1. We can see Boston Airbnb listings dataset has 3585 rows and 94 columns. There are too many columns. We can see two types of columns i.e. Numeric and Object.
  2. Some columns have very few or zero non-null values. I have removed these columns from the data sets.
  3. There are columns such as host_url, medium_url, pricture_url, etc that are not useful thus should be removed thus I have ignored such columns from my studied dataset.
  4. There are columns such as price, cleaning_fee, security_deposit, host_response_rate, etc that are of type object. These columns can be converted to number type thus I have converted them.

Data cleaning and transformations

Based on the above observations I have written a function that uses pandas high-level functions to drop columns that are not useful, drop columns having fewer values, fill NA values, and converting some object type columns to numeric columns. This activity will clean the data and will make more sense.

Numerical features analysis

With cleaned datasets let’s see how the price is distributed and try to analyze the price feature. To do that I have fetched the price column from the listings dataset and plotted a hist distribution using the seaborn histplot function.

Price distribution of Boston Airbnb Listings

Observations:

  1. We can see that most of the samples from price columns are within the range of 220 USD and there are rare samples that are less than 20 USD and greater than 500. I have removed these rows since it is rare to expect such a price from Airbnb listings.
  2. I can further take the log of price to normalize the distribution while training the model.

Removing outliers

Based on the above observation I had included rows where the price was between 20 to 500. After removing outliers we see the improved distribution of the price.

Improved price distribution of Boston Airbnb Listings

I can see now outliers the price is in better shape. The distribution of the log of the price is even better I will use this transformation later while preparing train and test splits.

Correlation Matrix of all Numerical columns

To plot a correlation matrix between numeric variables I fetched numeric columns from the listings dataset and then using the panda's correlation function I was able to get a correlation matrix between all variables. Using the seaborn heatmap function I plotted the correlation matrix as a heatmap. Positive values show a positive or strong relationship between two given feature whereas negative values shows week relationship. This heatmap helped me further to select columns for preparing our final train and test splits.

1. What Features are affecting the price most? It would be interesting to see how features are related to price? Which ones are having a hot relationship with the price?

Correlation of numeric features with price

Observations:

  1. We can see that cleaning_fee, guests_included, security_deposit, beds, bedrooms, bathrooms, accommodates, longitude, latitude, etc are having a strong relationship with price. We must select these columns for our model.
  2. Surprisingly the number of reviews and reviews per month has a negative relationship with price. We can omit such features.

Let’s see how bedrooms, bathrooms, and the number of beds are affecting the price. We can group by bathrooms and bedrooms followed by taking mean and then create a pivot keeping price as a response feature. We can plot the result as a heatmap using the seaborn heatmap function and see whether an extra bathroom or bed is costing more to the user or not.

2. How do features affect the price of listings? Do experience and comfort cost more to the user?

Heatmap of bedrooms, bathrooms, beds with respect to price

Observations:

  1. It is seen that when the user seeks an extra bathroom it costs 40USD more on average. Comfort costs more!
  2. Interestingly, beds, bedroom and price pattern is not consistent. Though it slightly indicates that if users seek an extra bed it might cost them extra.
  3. Listings having zero beds or bedrooms is rare and not relevant can be dropped from the dataset.

After removing rows having zero beds or bedrooms I had a total of 3176 rows and 55 columns left with me.

Categorical Features Analysis

Boston listings have many categorical features, It was interesting to see which ones are affecting the price most? It was quite interesting to see the pattern of price by neighborhood, property type, room type, bed type, cancellation policy, host_is_superhost, etc.

I leveraged the seaborn boxplot function to visualize categorical columns concerning price. host_is_superhost, host_identity_verified, and instant_bookable have two types of values i.e. t (true) and f (false) which is more reasonable to say Yes and No or 1 and 0. I will transform it later. Amenities no doubt affect the pricing of a listing but transforming and analyzing it would be tricky. I kept my study simple and thought of having a separate notebook wherein I can explore more with textual columns.

2. How do features affect the price of listings? Do experience and comfort cost more to the user?

Airbnb Boston Listing price distribution across neighborhoods
Airbnb Boston Listing distribution across neighborhoods
Relationship of price with various categorical features

Observations:

  1. Jamaika Plains, South End, and Back Bay have a high volume of listings. Lowest in Leather District and Longwood Medical Area.
  2. It is also seen that Leather District, China Town, and Downtown are super expensive while Hyde Park, Mattapan, and Dorchester are the cheapest neighborhoods.
  3. Guesthouses, boats, lofts, and villas are expensive as compare to homes and apartments. Experience matters and costs more!
  4. As expected Entire home is expensive while private rooms and shared rooms are the cheapest for users.
  5. Listings that offer Real Beds are comparatively expensive while Airbeds are cheaper. It’s all about comfort. As comfort level goes up price also tends to go up. Again comfort cots more!
  6. Listings that have high prices tend to have a strict cancellation policy while listings of cheaper prices are super flexible.
  7. Superhost lists their prices slightly higher than those who are not Superhost. Again experience matters which costs more to the user.
  8. Listings that offer instant booking are slightly cheaper than listings that don’t offer.
  9. Listings, where the host is verified, are slightly higher in price than listings where the host is not verified.
  10. Listings that require guest phone verification are comparatively higher in price than listings that don’t require.

Data preparation for model

To build an accurate model we should always select features that are highly correlated and influence the price most. The above data exploration explains that few numeric features are highly correlated with price, I kept these features for the model. On the other hand, we have seen how categorical features influenced the price.

Categorical features such as host_is_superhost, instant_bookable, room_type, bed_type, cancellation policy, etc can be transformed into numeric features. Replacing values for each key with a unique number will make more sense and help our model understand it better. Pandas get_dummies function comes very handily to transform such categorical features into binary vectors.

Finally, I prepared a data frame containing selected numeric and categorical features. I have selected the following features for the model based on the study above.

1. What Features are affecting the price most? name the features that affect the price most.

Selected Numerical Features:

price, latitude,longitude, accommodates, bedrooms, bathrooms, beds, security_deposit, cleaning_fee, guests_included, availability_30, availability_60, availability_90, availability_365, review_score_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_location, review-scores_value, calculated_host_listings_count.

Selected Categorical Features:

host_response_time, host_is_superhost, room_type, bed_type, neighbourhood_cleansed, cancellation_policy, property_type, host_identity_verified, instant_bookable, host_has_profile_pic, require_guest_profile_picture, require_guest_phone_verification.

Train and Evaluate model

Here comes the most exciting part of my study. The training and Evaluating model is been always interesting to me. I divided this activity into the following steps:

  1. Extracting input (X) and output (y) features from the dataset.
  2. Split the X and y samples into training and testing samples. train_test_split function of sklearn does the job for us. It takes X and y and splits it into train and test datasets respectively.
  3. Instantiate and Fit the model using x_train and y_train samples. sklearn’s fit function does the magic here by taking x_train and y_train as input.
  4. Predict price using the given model by providing x_test samples. sklearn’s predict function takes x_test samples and returns prediction.
  5. Calculate mean absolute error by providing y_test (actual output) and prediction samples. mean_absolute_error of sklearn helps to calculate the same.
  6. Plot actual values vs predicted values and see how our model performed throughout its journey of learning.

I have used LinearRegressor and RandomForestRegressor from sklearn as my model.

3. Can we predict the price of a listing in Boston AirBnB?

Answer: Yes we can predict the price of a listing in Boston AirBnB. RandomForestRegressor did a better job for me as compare to LinearRegressor. The absolute mean error for RFR is 0.31 whereas 0.35 for LR.

Distribution of actual and predicted values for Random Forest Regressor
Distribution of actual and predicted values for Linear Regressor

Conclusions

Finally, I trained and evaluated two models and we saw RandomForestRegressor was a winner! we have seen the absolute mean error for RFR is 0.31 whereas 0.35 for LR. We have seen hot features that affected the price most. Experience and comfort cost more to the user.

I can see the scope of doing better here, nevertheless, Kaggle and Udacity will give me many more opportunities to perform better.

Here is the GitHub repository to refer to:

Here is the full Kaggle notebook to refer to:

--

--

Vishal Patidar

Technology Enthusiast | Self Proclaimed Singer and Writer