A Data-Driven Story of Airbnb

Introduction

A‌i‌r‌b‌n‌b‌,‌ ‌I‌n‌c‌.‌ is a company based in San Francisco that operates an online marketplace and hospitality service. It allows people to lease or rent short-term lodging including holiday cottages, apartments, homestays, hostel beds, or hotel rooms, to make reservations at restaurants etc.

I was curious about how we could use machine learning to answer questions like “Can we predict price of a home based on listings information?” ,“ Can we predict a positive or negative review based on user’s comments ?” or “What is the busiest month in Seattle?”. Let’s dive into this!

Trending AI Articles:

1. How ethical is Artificial Intelligence?

2. Predicting buying behavior using Machine Learning

3. Understanding and building Generative Adversarial Networks(GANs)

4. Building a Django POST face-detection API using OpenCV and Haar Cascades

Data

First things first we need data, I decided to investigate Airbnb in Boston.Fortunately, the data was available for free 🙂 on kaggle . It contains Listings information and user’s review comments for Airbnb homes in Boston.

Can we predict price of a home based on listings information?

To answer this question let us start by investigating this data set. I started by performing some basic statistic on the data set. For that, I used data set from Airbnb in Seattle so as to make some comparison. You can find the code here. I noticed that the cheapest home was a Private room found in Boston and it costs 10$. Also rooms with the highest ratings were the ones found in Seattle ranging from 250$ and 500$.

OK let’s now dive into our main focus.

The data set was really messy so I had to do some cleaning first. Later on I used a popular machine learning library used in data science competitions called Lightgbm. I was amazed because I succeeded in achieving a Mean error of 0.011, which is really good.The code can be found here.

I then wanted to know which feature was important to the algorithm in predicting the price. This can be summarized in this chart.

As we can see, the most important feature in predicting price is room_type i.e whether the room is an apartment, house etc.

Can we predict a positive or negative review based on user’s comments ?

To answer this question, I used review comments from the data set to predict the user’s ratings score . To separate the review into positive and negative score, I converted rating scores less than 80% as negative and rating scores more than 80% as positive review. After pre-processing the comments, I used LSTM( Long-short term memory) , which is a deep learning model widely used when text data is concerned. The code can be found here.The plot of the loss function gave the following.

From the graph, we see that the training loss decreases with time but the test loss increases , meaning the model generalizes well for training data but does not when it sees new data. Possible improvement to this model could be:

  • Get more training data, the data especially for negative review was small.So increasing negative reviews will surely reduce the test loss.
  • We could also use a less complex model to train the data on.

What is the busiest month in Seattle?

Here what I did was that using the date and availability for the home, I considered that busy means in this context that the room is not available. So I used this to plot the % busy against date and here is what I got.

As you can see the busiest period was in January 2016. The code can be found here.

To Recap:

  • We used LightGBM (a machine learning library) to predict the price of a room based on the listings information. We found that the most relevant feature in predicting the price was the room type(i.e whether the room is an apartment, house etc).
  • We performed some basic statistics on the listings information from Boston and Seattle. We found that homes with the highest ratings score were those found in Seattle whose price ranges from 250$ to 500$.
  • We saw that it was possible to predict user’s rating score based on their review comments.
  • And we also saw that the busiest period in Seattle was January 2016

Hope you enjoyed the lecture. See you soon for more informative lectures

I want to thank Udacity for the wonderful job they are doing. This blog post is part of the fulfillment of Data Science nanodegree program.

Don’t forget to give us your 👏 !

https://medium.com/media/c43026df6fee7cdb1aab8aaf916125ea/href


A Data-Driven Story of Airbnb was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.