Superior Plotly with code sequence (Half 9): To dot, to slope or to stack?


Photo by Steffen Petermann on Unsplash (a bubble's added by me)
Photograph by Steffen Petermann on Unsplash (a bubble’s added by me)

Statue will be present in Weimar – Park an der Ilm (however Shakespeare clearly doesn’t converse)

Welcome to the ninth submit in my “Plotly with code” sequence! In the event you missed the primary one, you’ll be able to test it out within the hyperlink under, or flick through my “one submit to rule all of them” to observe together with the whole sequence or different subjects I’ve beforehand written about.

Superior Plotly with Code Collection (Half 1): Options to Bar Charts

A brief abstract on why I’m penning this sequence

My go-to instrument for creating visualisations is Plotly. It’s extremely intuitive, from layering traces to including interactivity. Nevertheless, while Plotly excels at performance, it doesn’t include a “information journalism” template that provides polished charts proper out of the field.

That’s the place this sequence is available in – I’ll be sharing how one can rework Plotly’s charts into glossy, professional-grade charts that meet information journalism requirements.

PS: All pictures are authored on my own until in any other case specified.

Intro – Clustered columns cluster your mind

What number of instances have you ever used a number of colors to signify a number of classes in a bar chart? I guess that fairly a number of…

These a number of colors blended with a number of classes really feel like you might be clustering bars collectively. Clustering doesn’t seem to be an inviting phrase when you’re speaking insights. Certain, clustering is beneficial when you’re analysing patterns, however once you talk what you discovered from these patterns, you need to in all probability be seeking to take away, clear and declutter (decluttering is my golden rule after having learn Cole Nussbaumer Storytelling with Information ebook).

In Superior Plotly with code sequence (Half 4): Grouping bars vs multi-coloured bars, we already coated a situation the place utilizing colors to signify a third dimension made it fairly troublesome for a reader to grasp. The distinction we shall be overlaying on this weblog is when the cardinality of those classes explode. For instance, within the Half 4 weblog, we represented nations with continents, that are very easy to mentally map. Nevertheless, what occurs if we attempt to signify meals classes with nations?

Now, that may be a totally different downside.

What is going to we cowl on this weblog?

  1. State of affairs 1. To subplot or to stack bars? That’s the query.
  2. State of affairs 2. How on earth to plot 5 nations in opposition to 7 forms of meals?
  3. State of affairs 3. Clustered charts fail to convey change over 2 interval for two teams.

PS: As at all times, code and hyperlinks to my GitHub repository shall be offered alongside the way in which. Let’s get began!

State of affairs 1: Visualisation methods… To subplot or to stack bars? That’s the query.

Picture you’re a marketing consultant presenting at a convention about how workforce is distributed in every nation. You collected the required information, which could seem like the screenshot under.

Source: ILOSTAT data explorer
Supply: ILOSTAT information explorer

You need the chart to point out what’s the proportion of every sector by nation. You don’t assume an excessive amount of concerning the chart design, and use the default output from plotly.categorical

What a mess to glance at...
What a large number to look at…

The place do I believe this plot has points?

The very first thing to say is how poor each this chart is at telling an fascinating story. Some clear points are:

  • You must use a key, which slows down understanding. Backwards and forwards we go, between bar and key
  • You don’t an excessive amount of house for information labels. I’ve tried including them, however the bars are too slender, and the labels are rotated. So both you’re the exorcist little one otherwise you hold making an attempt to decipher a worth primarily based on the y-axis or the offered gridlines.
  • There are simply too many bars, in no specific order. You’ll be able to’t rank clustered bars in the identical means as you’ll be able to rank bars exhibiting a single variable. Do you rank by the worth of a particular “sector” class? Do you order alphabetically by the x-axis or the legend classes?
  • Evaluating the highest of the bars is sort of inconceivable. Say that you just wish to evaluate if Vietnam has extra workforce in building that Spain… was it a enjoyable train to search for the nation, then work out that building is the purple bar and in some way look throughout the chart? Nope. I imply, if I hadn’t added the labels (even when rotated), you’d have in all probability not been in a position to inform the distinction.
Summary of issues
Abstract of points

Let’s see if there are higher alternate options to the chart above.

Different # 1. Utilizing subplots.

If the story you wish to inform is one the place the main focus is on evaluating which nations have the very best proportion per sector, then I might advocate separating classes into subplots. In, Superior Plotly with code sequence (Half 8): steadiness dominant bar chart classes, we already launched using subplots. The situation was utterly totally different, but it surely nonetheless confirmed how efficient subplots will be.

Tweaking the chart above, may render the next.

Why do I believe this plot is best than the clustered bar chart?

  1. The horizontal bar charts now enable for a extra legible rendering of the information labels.
  2. There isn’t a want of color coding. Because of the subplot titles, one can simply perceive which class is being displayed.

A phrase of warning

  1. Separating by subplots signifies that you need to choose how one can order the nations within the y-axis. That is finished by selecting a particular class. On this case, I selected “agriculture”, which signifies that the opposite 3 classes to don’t preserve their pure order, making comparisons troublesome.
  2. The magnitude of the bars might (or might not) be saved throughout all subplots. On this case, we didn’t normalise the magnitudes. What I imply is that the vary of values – from min to max – is set for every particular person subplot. The consequence is that you’ve got bars of worth 12 (see “building” for India) rendered a lot bigger than a bar of worth 30 (see “providers” for India).
Summary of improvements (and some caveats)
Abstract of enhancements (and a few caveats)

However, even with it’s flaws, the subplot chart is far more legible than the primary clustered bar chart.

Different #2. Utilizing a stacked bar chart.

Now, say that what you wish to convey is how skewed (or not) the distribution of the workforce by nation is. As we noticed within the subplot different, it is a bit troublesome, as every bar chart is rendered in another way by every subplot. The subplot different was nice to reply that “India has the biggest % of their workforce devoted to building, with Nigeria having the smallest”, however it’s rather more troublesome to reply that “building and providers signify 77% of India’s workforce”.

Examine the stacked bar under and determine which one do you favor.

Superior Plotly with code sequence (Half 9): To dot, to slope or to stack?

Why do I believe this plot is best than the clustered bar chart?

  1. Stacked bar charts assist pin a narrative for every of the rows within the y-axis. Now, even what was comparatively simple to grasp within the clustered bar chart, makes it rather more simple to grasp within the stacked one.
  2. The horizontal bar charts now enable for a extra legible rendering of the information labels.
  3. As a result of you might be coping with percentages, stacked bar charts can actually convey that numbers add as much as 100%.
  4. Lastly, all bars have the right magnitudes.

A phrase of warning

  1. Equally to the subplot different, separating by subplots signifies that you need to choose how one can order the nations within the y-axis.
  2. A color coded legend is required, so some further processing time is required.
Summary of improvements (and some caveats)
Abstract of enhancements (and a few caveats)

Once more, regardless of it’s points, I hope you’d agree with me that the stacked bar chart is far more legible than the primary clustered bar chart.

Recommendations on how one can create these 2 plots

Making a subplot chart

  • 1st, you’ll in fact have to create a subplot object
fig = make_subplots(
  rows=1, cols=4, 
  shared_yaxes=True,
  subplot_titles=list_of_categories,
)
  • 2nd, merely loop by means of every class and plot every bar chart on the precise “column hint”
fig.add_trace(
    go.Bar(
        y=aux_df['country'],
        x=aux_df['percentage'],
        marker=dict(shade='darkblue'),
        textual content=aux_df['percentage'].spherical(0),
        textposition='auto',
        orientation='h',
        showlegend=False,
    ),
    row=1, col=i
)

Making a stacked bar chart

  • 1st, determine the order of the primary class
df = df.sort_values(['sector', 'percentage'], ascending=[True, True])
  • 2nd, loop by means of every class. Extract the knowledge by nation and add a hint.
for sector in df['sector'].distinctive():
  aux_df = df[df['sector'] == sector].copy()

  fig.add_trace(
            go.Bar(
                x=aux_df['percentage'],
                y=aux_df['country'],
                orientation='h',
                title=sector,
                textual content=aux_df['percentage'],
                textposition='auto',
            )
        )
  • third, it’s essential to inform plotly that it is a stacked bar chart. You are able to do this within the update_layout methodology.
fig.update_layout(barmode='stack')

As Shakespeare would have stated had he labored as an information analyst: to subplot or to stack?

State of affairs 2: How on earth to plot 5 nations in opposition to 7 forms of meals?

On this second situation, you might be working with multi-category information that represents how a lot sort of meals is exported by every nation as a proportion of its complete manufacturing. Within the dataset under, you’ll have details about 5 nations and seven forms of meals. How would you convey this info?

Mock data by author
Mock information by creator

The default output which is doomed to fail

You check out what would the default output from plotly.categorical present. And what you see is just not one thing you want.

Choice 1. Put the nations within the x-axis with the meals classes within the legend

Countries in the x-axis
International locations within the x-axis

Choice 2. Put the meals classes within the x-axis and nations within the legend

Countries in the y-axis
International locations within the y-axis

You’ll convene with me that neither chart can be utilized to inform a transparent story. Let’s see what occurs if we use a stacked bar chart as above.

Different # 1. The stacked bar chart (which on this case, fails)

The stacked bar chart served us effectively within the earlier situation. Can it additionally assist us right here, the place now we have extra classes and the place the sum of the odds is just not equal to 100%?

Examine the bar charts under:

Choice 1. Put the nations within the x-axis with the meals classes within the legend

If you want to tell a story at a country level, I am afraid the stacked bar chart looks really weird.
If you wish to inform a narrative at a rustic degree, I’m afraid the stacked bar chart seems to be actually bizarre.

Choice 2. Put the meals classes within the x-axis and nations within the legend

If you want to tell a story at the food category level, the chart is slightly better, but not by a huge amount.
If you wish to inform a narrative on the meals class degree, the chart is barely higher, however not by an enormous quantity.

Each stacked charts actually fail to simplify what we’re trying it. In truth, I might argue they’re as troublesome to learn because the clustered bar chart. So, on this case, stacked bar charts have truly failed us.

Different #2. The dot plot (which is a elaborate scatter plot)

This different I’m about to current is impressed within the subplot concept we utilized in situation 1. Nevertheless, on this case, I’ll change the bars for dots.

One factor I didn’t like from situation 1, was that the magnitude of bars didn’t make sense throughout the totally different classes. Every subplot had it’s personal x-axis vary.

Now, what do you consider this dot plot method for clearer information storytelling?

Why do I believe this plot is best?

  1. Dot plot magnitudes are saved fixed throughout the board.
  2. Provided that I’ve a rustic (Netherlands) which surpasses the remaining, I really feel the dots convey this superiority higher – much more once I color them in another way.
  3. Having these subplots organized as a desk, makes factor look aligned and neat. In different phrases, it’s simple to scan for solutions on the nation degree or on the meals class degree.
  4. No color coding required! And we are able to use emojis!
Summary of improvements
Abstract of enhancements

Recommendations on how one can create this plot

  • 1st, create a subplots object. I’ve outlined the titles with emojis utilizing a dictionary
list_of_categories = df['Food'].distinctive().tolist()
list_of_categories.type()

food_to_emoji = {
        'Cucumbers': '🥒',
        'Eggs': '🥚',
        'Mushrooms': '🍄 ',
        'Onions': '🧅',
        'Peppers': '🌶 ️',
        'Potatoes': '🥔',
        'Tomatoes': '🍅 '
}
subplot_titles = [f"{category} {food_to_emoji.get(category, '')}" for category in list_of_categories]
fig = make_subplots(rows=1, cols=7, shared_yaxes=True,
                    subplot_titles=subplot_titles
                    )
  • 2nd, how one can add 1 single information level for every mixture of {nation}-{meals}? Loop by means of the meals classes, however within the x-axis drive plotting a dummy worth (I used the number one)
for i, characteristic in enumerate(list_of_categories):
  c = i + 1

  if c == 1:
        aux_df = df[df['Food'] == characteristic].sort_values('proportion', ascending=False).copy()
     else:
        aux_df = df[df['Food'] == characteristic].copy()
     fig.add_trace(
              go.Scatter(
                  y=aux_df['Country'],
                  x=[1] * len(aux_df), # <---- pressured x-axis
                  mode='markers+textual content',
                  textual content=textual content,
                  textposition=textposition,
                  textfont=textfont,
                  marker=marker,
                  showlegend=False,
              ),
              row=1, col=c
          )
  • third, however if you happen to plot the worth 1, how do you present the true meals % values? Straightforward, you outline these within the textual content, textposition, textfont and marker parameters.
textual content = [f"{val}%" for val in aux_df['percentage'].spherical(0)]

textposition = ['top center' if val < 10 else 'middle center' for val in aux_df['percentage']]
textfont = dict(shade=['grey' if val < 10 else 'white' for val in aux_df['percentage']])
marker = dict(dimension=aux_df['percentage'] * 3,
              shade=['rgb(0, 61, 165)' if country == 'Netherlands 🇳🇱 ' else 'darkgrey' for country in aux_df['Country']])

State of affairs 3: Clustered charts fail to convey change over 2 teams.

In each situations above, we have been coping with a number of classes and noticed how clustered bar charts hinder our capability to shortly perceive the message we try to convey. On this final situation, we cowl the case the place you solely have 2 classes. On this case, 2 totally different intervals in time (it may very well be 2 segments, 2 areas, and so on)

As a result of the cardinality is so small (solely 2), I’ve seen many individuals nonetheless utilizing stacked bar charts. Examine the information under. It represents the rating that totally different Spanish soccer groups have held in 2 totally different seasons.

Source: UEFA rankings
Supply: UEFA rankings

In the event you plotted the groups within the x-axis, the rating within the y-axis and the season as a colored legend, we’d have the next plot.

The place do I believe this plot has points?

  1. Rankings aren’t effectively represented with bar charts. For instance, right here a bigger rating is worse (ie, rank = 1 is significantly better than rank = 50)
  2. It isn’t simple to match rankings for the season 2023–2024. It’s because now we have sorted the chart in ascending order primarily based on the 2013–2014 season.
  3. There are groups which had a UEFA rating in season 2013–2014, however didn’t in 2023–2024 (Malaga). This isn’t instantly obvious from the chart.

The slope graph different

I at all times transfer to slope graphs once I have to visualise rank comparability or any story of change (ie, this datapoint has travelled from right here to right here). Change doesn’t should be over time (though it’s the commonest sort of change). A slope graph may very well be used to match 2 situations, 2 reverse views, 2 geographies, and so on. I actually like them as a result of your eye can simply journey from begin level to finish level with out interruption. As well as, it makes the diploma of change far more apparent. Examine the chart under… is the story of how Valencia CF utterly destroyed it’s rating for the reason that arrival of a brand new proprietor.

Recommendations on how one can create this plot

  • 1st, loop by means of every membership and plot a Scatter plot.
for membership in df['club'].distinctive():
  club_data = df[df['club'] == membership]

  # DEFINITION OF COLOUR PARAMETERS
  ...
  fig.add_trace(go.Scatter(
    x=club_data['season'],
    y=club_data['ranking'],
    mode='traces+markers+textual content',
    title=membership,
    textual content=club_data['text_column'],
    textposition=['middle left' if season == '2013-14' else 'middle right' for season in club_data['season']],
    textfont=dict(shade=colour_),
    marker=dict(shade=shade, dimension=marker_size),
  line=dict(shade=shade, width=line_width)
))
  • 2nd, outline the textual content, textfont, marker and line parameters.
for membership in df['club'].distinctive():
  club_data = df[df['club'] == membership]

  if membership == 'Valencia':
         shade = 'orange'
         line_width = 4
         marker_size = 8
         colour_ = shade
      else:
         shade = 'lightgrey'
         line_width = 2
         marker_size = 6
         colour_ = 'gray'

      # go.Scatter()
      ...
  • third, as a result of we’re coping with “rankings”, you’ll be able to set the yaxis_autorange="reversed"
fig.update_layout(
   ...
   yaxis_autorange='reversed',  # Rankings are often higher when decrease
)

Abstract of multi-category visualization approaches

On this submit, we explored how one can transfer past clustered bar charts by utilizing simpler visualisation methods. Right here’s a fast recap of the important thing takeaways:

State of affairs 1: subplots vs. stacked bars

  • Subplots: Greatest for category-specific comparisons, with clear labels and no want for colour-coded legends.
  • Stacked Bars: Supreme for exhibiting cumulative distributions, with constant bar magnitudes and intuitive 100% totals.

State of affairs 2: dot plot for prime cardinality

  • When coping with a number of classes each within the x and y axis, dot plots provide a cleaner view.
  • Not like subplots or stacked bars, dot plots hold magnitudes fixed and comparisons clear.

State of affairs 3: slope graphs for two-point comparisons

  • For monitoring modifications between two factors, slope graphs clearly present motion and route.
  • They spotlight upward, downward, or steady traits in a single look.

The place can you discover the code?

In my repo and the stay Streamlit app:

Acknowledgements

Additional studying

Thanks for studying the article! If you’re thinking about extra of my written content material, right here is an article capturing all of my different blogs posts organised by themes: Information Science crew and undertaking administration, Information storytelling, Advertising & bidding science and Machine Studying & modelling.

All my written articles in a single place

Keep tuned!

If you wish to get notified once I launch new written content material, be happy to observe me on Medium or subscribe to my Substack publication. As well as, I might be very pleased to speak on Linkedin!

Senior Information Science Lead | Jose Parreño Garcia | Substack


Initially printed at https://joseparreogarcia.substack.com.