Who Actually Owns the Airbnbs You’re Reserving? — Advertising Notion vs Information Analytics Actuality | by Anna Gordun Peiro | Oct, 2024

I’ve written a information on tips on how to carry out this information evaluation and generate the graph within the earlier part. I’m utilizing the dataset from the town of Barcelona for example the totally different information evaluation steps.

After downloading the listings.csv.gz information from InsideAirbnb I opened them in Python with out decompressing. I’m utilizing Polars for this challenge simply to develop into conversant in the instructions (you should utilize Pandas if you want):

import polars as pl
df=pl.read_csv('listings.csv.gz')

Listed below are the cities that I used for the evaluation and the variety of listings in every dataset:

Cities used for evaluation. Picture by writer

Should you like this packed bubble plot, make certain to verify my final article:

First look into the dataset and that is what it appears to be like like:

Dataframe snippet. Image by writer

The content material relies on accessible information in every itemizing URL, and it accommodates rows for every itemizing and 75 columns that element from description, neighbourhood, and variety of bedrooms, to rankings, minimal variety of nights and worth.

As talked about earlier, although this dataset has limitless potential, I’ll focus solely on multi-property possession.

After downloading the info, there’s little information cleansing to do:

1- Filtering “property_type” to solely “Total rental models” to filter out room listings.

2- Filtering “has_availability” to “t”(True) to take away non-active listings.

import polars as pl
#I renamed the listings.csv.gz file to cityname_listings.csv.gz.
df=pl.read_csv('barcelona_listings.csv.gz')
df=df.filter((pl.col('property_type')=="Total rental unit")&(pl.col('has_availability')=="t"))

For information processing, I remodeled the unique information into a special construction that might enable me to quantify what number of listings within the dataset are owned by the identical host. Or, rephrased, what share of the town listings are owned by multi-property hosts. That is how I approached it:

  • Carried out a value_counts on the “host_id” column to depend what number of listings are owned by the identical host id.
  • Created 5 totally different bins to quantify multi-property ranges: 1 property, 2 properties, +2 properties, +5 properties, +10 properties and +100 properties.
  • Carried out a polars.minimize to bin the depend of listings per host_id (steady worth) into my discrete classes (bins)
host_count=df['host_id'].value_counts().kind('depend')
breaks = [1,2,5,10,100]
labels = ['1','2','+2','+5','+10','+100']
host_count = host_count.with_columns(
pl.col("depend").minimize(breaks=breaks, labels=labels, left_closed=False).alias("binned_counts")
)
host_count

That is the outcome. Host_id, variety of listings, and bin class. Information proven corresponds to the town of Barcelona.

Binned listings counts. Picture by writer

Please take a second to grasp that host id 346367515 (final on the record) owns 406 listings? Is the airbnb group feeling beginning to really feel like an phantasm at this level?

To get a metropolis normal view, impartial of the host_id, I joined the host_count dataframe with the unique df to match every itemizing to the right multi-property label. Afterwards, all that’s left is an easy value_counts() on every property label to get the entire variety of listings that fall below that class.

I additionally added a share column to quantify the burden of every label

df=df.be a part of(host_count,on='host_id',how='left')

graph_data=df['binned_counts'].value_counts().kind('binned_counts')
total_sum=graph_data['count'].sum()
graph_data=graph_data.with_columns(((pl.col('depend')/total_sum)*100).spherical().solid(pl.Int32).alias('share'))

ultimate information outcome. Picture by writer

Don’t fear, I’m a visible particular person too, right here’s the graph illustration of the desk:

import plotly.categorical as px

palette=["#537c78","#7ba591","#cc222b","#f15b4c","#faa41b","#ffd45b"]

# I wrote the text_annotation manually trigger I like modifying the x place
text_annotation=['19%','7%','10%','10%','37%','17%']
text_annotation_xpos=[17,5,8,8,35,15]
text_annotation_ypos=[5,4,3,2,1,0]
annotations_text=[
dict(x=x,y=y,text=text,showarrow=False,font=dict(color="white",weight='bold',size=20))
for x,y,text in zip(text_annotation_xpos,text_annotation_ypos,text_annotation)

]

fig = px.bar(graph_data, x="share",y='binned_counts',orientation='h',shade='binned_counts',
color_discrete_sequence=palette,
category_orders={"binned_counts": ["1", "2", "+2","+5","+10","+100"]}
)
fig.update_layout(
peak=700,
width=1100,
template='plotly_white',
annotations=annotations_text,
xaxis_title="% of listings",
yaxis_title="Variety of listings owned by the identical host",
title=dict(textual content="Prevalence of multi-property in Barcelona's airbnb listings<br><sup>% of airbnb listings in Barcelona owned by multiproperty hosts</sup>",font=dict(measurement=30)),
font=dict(
household="Franklin Gothic"),
legend=dict(
orientation='h',
x=0.5,
y=-0.125,
xanchor='heart',
yanchor='backside',
title="Variety of properties per host"
))

fig.update_yaxes(anchor='free',shift=-10,
tickfont=dict(measurement=18,weight='regular'))

fig.present()

Multiproperty in Barcelona’s Airbnb. Picture by writer

Again to the query at first: how can I conclude that the Airbnb essence is misplaced in Barcelona?

  • Most listings (64%) are owned by hosts with greater than 5 properties. A major 17% of listings are managed by hosts who personal greater than 100 properties
  • Solely 26% of listings belong to hosts with simply 1 or 2 properties.

Should you want to analyse multiple metropolis on the similar time, you should utilize a operate that performs all cleansing and processing without delay:

import polars as pl

def airbnb_per_host(file,ptype,neighbourhood):
df=pl.read_csv(file)
if neighbourhood:
df=df.filter((pl.col('property_type')==ptype)&(pl.col('neighbourhood_group_cleansed')==neighbourhood)&
(pl.col('has_availability')=="t"))
else:
df=df.filter((pl.col('property_type')==ptype)&(pl.col('has_availability')=="t"))

host_count=df['host_id'].value_counts().kind('depend')
breaks=[1,2,5,10,100]
labels=['1','2','+2','+5','+10','+100']
host_count = host_count.with_columns(
pl.col("depend").minimize(breaks=breaks, labels=labels, left_closed=False).alias("binned_counts"))

df=df.be a part of(host_count,on='host_id',how='left')

graph_data=df['binned_counts'].value_counts().kind('binned_counts')
total_sum=graph_data['count'].sum()
graph_data=graph_data.with_columns(((pl.col('depend')/total_sum)*100).alias('share'))

return graph_data

After which run it for each metropolis in your folder:

import os
import glob

# please keep in mind that I renamed my information to : cityname_listings.csv.gz
df_combined = pl.DataFrame({
"binned_counts": pl.Sequence(dtype=pl.Categorical),
"depend": pl.Sequence(dtype=pl.UInt32),
"share": pl.Sequence(dtype=pl.Float64),
"metropolis":pl.Sequence(dtype=pl.String)
})

city_files =glob.glob("*.csv.gz")

for file in city_files:
file_name=os.path.basename(file)
metropolis=file_name.cut up('_')[0]
print('Scanning began for --->',metropolis)

information=airbnb_per_host(file,'Total rental unit',None)

information=information.with_columns(pl.lit(metropolis.capitalize()).alias("metropolis"))

df_combined = pl.concat([df_combined, data], how="vertical")

print('Completed scanning of ' +str(len(city_files)) +' cities')

Take a look at my GitHub repository for the code to construct this graph because it’s a bit too lengthy to connect right here:

Closing comparability graph. Picture by Creator

And that’s it!

Let me know your ideas within the feedback and, within the meantime, I want you a really fulfilling and genuine Airbnb expertise in your subsequent keep 😉

All photos and code on this article are by the writer