Which chart would you use to show the distribution of scores if your variable is continuous?

Choose the correct graph or chart style for the task you want your audience to accomplish.

Photo by Morgan Housel on Unsplash

This is the second installment in a two-part series on Data Visualization. If you haven’t read Part 1 of this series, I recommend checking that out!

In part 1 of this series, we walked through the first three data visualization functions: relationship, data over time, and ranking plot. In case you need a quick refresher:

  • Relationship: We use a relationship method to display a connection or correlation between two or more variables. We often use scatter charts or heat maps to show the relationship method.
  • Data over time: This visualization method shows data over the period to find trends or changes over time. A line chart or an area chart is often used to represent data over a continuous time span.
  • Ranking: A visualization method displays the relative order of data values. Bar charts are often used to present data points or compare metric values across different subgroups of our data.

For the second part, we’ll discuss the last two data visualization functions: distribution and comparison.

Distribution Plot

15. Histogram
16. Density Curve with Histogram
17. Density Plot
18. Box Plot
19. Strip Plot
20. Violin Plot
21. Population Pyramid

Comparisons Plot

22. Bubble Chart
23. Bullet Chart
24. Pie Chart
25. Net Pie Chart
26. Donut Chart
27. TreeMap
28. Diverging Bar
29. Choropleth Map
30. Bubble Map

Disclaimer: I grouped the chart by the purpose of data visualization, but it isn’t perfect. For example, the scatter plots and bubble charts are useful for quickly identifying relationships between numeric variables. However, unlike the scatter plot, each point on the bubble chart is assigned a label or category. Buble charts can also be useful for comparison between data points. Additionally, time can be shown either by having it as a variable on one axis or by animating the data variables changing over time. So bubble charts can be useful for relationships, data over time, and comparison. For those reasons, I consider this as a guide for selecting a chart based on the purpose of analysis or communication needs.

Distribution

Distribution charts are used to show how variables are distributed over time, helping identify outliers and trends.

When evaluating a distribution, we want to find out the existence (or absence) of patterns and their evolution over time.

15. Histogram

A histogram is a vertical bar chart that depicts the distribution of a set of data. Each bar in a histogram represents the tabulated frequency at each interval/bin.

Note:

  • Histograms plot quantitative data with ranges of the data grouped into bins or intervals, while bar charts plot categorical data.
  • Bar graphs have space between the columns, while histograms do not.

The histogram shows the distribution of variables, plotting quantitative data, and identifying the frequency of something occurring within a bucketed range of values. In other words, histograms help gives an estimate as to where values are concentrated, what the extremes are, and whether there are any gaps or unusual values. They are also useful for giving a rough view of the probability distribution.

If we are considering just one variable instead, the best visualization to use is the histogram. Since histogram allows us to group continuous data into bins, it provides a good representation of where observations are concentrated. If considering two variables, we use a scatter chart as described previously.

Python Implementation

Here, we want to present the distribution of happiness scores.

plt.hist(happy['Score'], edgecolor = 'black')
Image by Author

16. Density Curve with Histogram

A histogram can also be used to compare the data distribution to a theoretical model, such as a normal distribution. This requires using a density scale for the vertical axis.

sns.distplot(happy['Score'], hist=True, kde=True, 
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 4})
Image by Author

17. Density Plot

Density plots (aka Kernel Density Plots or Density Trace Graph) are used to observe a variable's distribution in a dataset.

This chart is a smoothed version of the histogram and is used in the same concept. It uses a kernel density estimate to show the variable's probability density function, allowing for smoother distributions by smoothing out the noise. Thus, the plots are smooth across bins and are not affected by the number of bins created, creating a more defined distribution shape. The peaks of a density plot help display where values are concentrated over the interval. (see more)

An advantage density plots have over histograms is that they’re better at determining the distribution shape because they’re not affected by the number of bins used (each bar used in a typical histogram).

Density plots are used to study the distribution of one or a few variables. Checking our variables' distribution one by one is probably the first task we should do once getting a new dataset. It delivers a good quantity of information. Several distribution shapes exist; here is an illustration of the six most common ones.

Python Implementation

Here, we want to visualize the probability density of the engine displacement in liters (displ)

We use kdeplot function for visualizing the probability density of a continuous variable. It depicts the probability density at different values in a continuous variable. We can plot a single graph for multiple samples, which helps in more efficient data visualization.

# simple density plot
sns.kdeplot(car['displ'], shade=True)
Image by Author

The density plot also allows us to compare the distribution of a few variables. However, we should not compare more than three or four since it would make the figure cluttered and unreadable.

cty Record miles per gallon (mpg) for city driving. Here, we want to make a comparison between two class: compact and suv

for class_ in ['compact', 'suv', 'midsize']:
# extract the data
x = car[car['class'] == class_]['cty']
# plot the data using seaborn
sns.kdeplot(x, shade=True, label = '{} '.format(class_))
Image by Author

18. Box Plot

A box plot or whisker plot summarizes a set of data measured on an interval scale. This type of graph shows the shape of the distribution, its central value, and its variability.

Image by Author

The box plot shows data is distributed based on a five-number statistical summary. A small “box” indicates that most of the information falls within a consistent range, while a larger box displays the data is more widely distributed.

  • The line that divides the box into two parts represents the median of the data. For example, the median equals 5, meaning the same number of data points below and above 5.
  • The ends of the box show the upper (Q3) and lower (Q1) quartiles. If the third quartile equals 10, 75% of the observation is lower than 10.
  • The difference between Q1 and Q3 is called the interquartile range (IQR)
  • The extreme line shows Q3+1.5xIQR to Q1–1.5xIQR (the highest and lowest value excluding outliers).
  • Dots (or other markers) beyond the extreme line shows potential outliers.

We use box plots in descriptive data analysis, indicating whether a distribution is skewed and potential unusual observations (outliers) in the data set.

Box plots are also very useful when large numbers of observations are involved and when two or more data sets are being compared.

One drawback of boxplots is that they emphasize the tails of a distribution, which are the least specific data set points. They also hide many of the details of the distribution.

Python Implementation

Here, we want to present the distribution of vehicle classes.

plot1 = ax.boxplot(vects,
notch=False, vert=True,
meanline=True,showmeans=True,
patch_artist=True)
Image by Author

19. Strip Plot

A strip plot is a scatter plot where one of the variables is categorical. The strip plot is an alternative to a histogram or a density plot. It is typically used for small data sets (histograms and density plots are usually preferred for larger data sets).

Source: Seaborn

A strip plot can be drawn on its own. Still, it is also a good complement to a box or violin plot in cases we want to show all observations along with some representation of the underlying distribution.

Python Implementation

Boxplot is a fantastic way to study distributions. However, some types of distribution can be hidden under the same box. Thus, plotting strip charts and boxplots side-by-side can be useful to display everyobservation over your boxplot, to be sure not to miss an interesting pattern.

jitter=True helps spread out overlapping points so that all the points don’t fall in single vertical lines above the species

ax = sns.boxplot(car['class'], car['hwy'],boxprops=dict(alpha=0.75))
ax = sns.stripplot(car['class'], car['hwy'], jitter=True, edgecolor="gray")
Image by Author

20. Violin Plot

Sometimes the median and mean aren’t enough to understand a dataset. This is where the violin plot comes in. A violin plot is a hybrid of a box plot and a density plot rotated and placed on each side. It is used to visualize the distribution of the data and its probability density.

The “violin” shape of a violin plot comes from the data’s density plot. We turn that density plot sideways and put it on both sides of the box plot, mirroring each other. Each side of a violin is a density estimation to show the distribution shape of the data. Reading the violin shape is exactly how we read a density plot: Wider sections of the violin plot represent a higher probability that members of the population will take on the given value; the skinnier sections represent a lower probability.

Box plots are well explained in many statistics courses, while violin plots are rarely mentioned. One downside of violin plots is that it is not familiar to many readers, we should consider who is our target readers while using violin plots.

Python Implementation

Here, we want to compare the distribution of mile per gallon for highway driving across vehicle classes.

sns.violinplot(car['class'], car['hwy'],
scale='width', inner='quartile')
Image by Author

21. Population Pyramid

Population pyramids are ideal for detecting changes or differences in population patterns. Population pyramids are important graphs for visualizing how populations are composed when looking at groups divided by age and sex.

It is a pair of back-to-back histograms (for each sex) that displays a population's distribution in all age groups and both sexes. The x-axis is used to plot population numbers, and the y-axis lists all age groups. The shape of a population pyramid can be used to interpret a population. For instance, a pyramid with an extensive base and a narrow top section suggests a community with high fertility and death rates. A pyramid with a wider top half and a narrower base would tell an aging population with low fertility rates.

Population pyramids can also be used to speculate a population’s future development. This makes the population pyramids useful for fields such as Ecology, Sociology, and Economics.

Python Implementation

Assume we want to show the age-sex distribution of a given population. It a graphic profile of the population’s residents. Sex is shown on the left/right sides, age on the y-axis, and the number of people on the x-axis.

ax[0].barh(range(0, len(df)), df['Male'], align='center', color='#4c85ff')
ax[0].set(title='Males')
ax[1].barh(range(0, len(df)), df['Female'], align='center', color='#ff68b3')
ax[1].set(title='Females')
Image by Author

Comparison

When analyzing our data, we might be interested in comparing data sets to understand differences or similarities between data points or time periods. I grouped the charts that are most used to compare one or more datasets.

Comparison questions ask how different values or attributes within the data compare to each other.

Note:

  • Bar charts can be listed under comparison groups since we can compare different groups during the same time period using horizontal for the vertical bar charts.
  • Tables help you compare exact values to one another. Column and bar charts showcase comparisons across different categories, while line charts excel at showing trends over time.

22. Bubble Chart

A bubble chart displays multiple bubbles (circles) in a two-dimensional plot. It is a generalization of the scatter plot, replacing the dots with bubbles.

Like a scatter plot, bubble charts use a cartesian coordinate system to plot points along a grid where the x-axis and y-axis are separate variables. However, unlike a scatter plot, each point is assigned a label or category (either displayed alongside or on a legend). Each plotted point then represents a third variable by the area of its bubble. We can use color to distinguish between categories or used to describe an additional data variable. Time can be shown either by having it as a variable on one axis or by animating the data variables changing over time.

We can use a bubble chart to depict and show relationships between numeric variables. However, marker size as a dimension allows for the comparison between three variables rather than just two. In a single bubble chart, we can make three different pairwise comparisons (X vs. Y, Y vs. Z, X vs. Z) and an overall three-way comparison. It would require multiple two-variable scatter plots to gain the same number of insights; even then, inferring a three-way relationship between data points will not be as direct as in a bubble chart.

Too many bubbles can make the chart hard to read, so bubble charts have a limited data size capacity. This can be somewhat remedied by interactivity: clicking or hovering over bubbles to display hidden information, having an option to reorganize or filter out grouped categories.

px.scatter(happy, x="GDP", y="Score", animation_frame="Year",
animation_group="Country",
size="Rank", color="Country", hover_name="Country",
trendline= "ols")
Image by Author

23. Bullet Chart

Bullet charts are used typically to display performance data; they are similar to bar charts but are accompanied by extra visual elements to pack in more context. Originally, Bullet Graphs were developed by Stephen Few as an alternative to dashboard gauges and meters. This is because they often displayed insufficient information, were less space-efficient, and were cluttered with “chartjunk.”

Image by Author

The main bar's length encodes the primary data value in the middle of the chart, known as the feature measure. The line marker that runs perpendicular to the graph's orientation is known as the comparative measure and is used as a target marker to compare against the feature measure value. So if the main bar has passed the relative measure position, we know we hit the goal.

The segmented colored bars behind the feature measure display the qualitative range scores. Each color shade (the three shades of grey in the example above) assigns a performance range rating, for example, low, average, and excellent. When using a bullet chart, we often keep the maximum number of ranges to five.

fig = ff.create_bullet(
data, titles='label', subtitles='sublabel', markers='point',
measures='performance', ranges='range', orientation='h',
measure_colors=['#1e747c', '#7ac7bf'],
range_colors=['#F5E1DA', '#F1F1F1']
)
Image by Author

24. Pie Chart

Pie charts are a classic way to show the composition of groups. A pie chart is a circular graph divided into slices. The larger a slice is the more significant portion of the total quantity it represents.

However, it is not generally advisable to use because the area of the pie portions can sometimes become misleading. So, while using pie charts, it is highly recommended to explicitly write down the percentage or numbers for each pie portion.

Pie charts are best suited to depict sections of a whole: the proportional distribution of the data. However, the significant downsides to pie charts are:

  • Unsuitable for extensive data: as the number of values shown increases, each segment/slice's size becomes smaller and harder to read.
  • Not great for making accurate comparisons between groups of pie charts.
  • Cant display the proportion distribution over time.

Python Implementation

Assume we want to display a spending habit of a particular customer.

labels = 'Food', 'Housing', 'Saving', 'Gas', 'Insurance', 'Car'
spend = [800, 2000, 500, 200, 300, 250]
p = plt.pie(spend, # Value
labels=labels, # Labels for each sections
explode=(0.07, 0, 0, 0, 0, 0), # To slice the perticuler section
colors=colors, # Color of each section
autopct='%1.1f%%', # Show data in persentage for with 1 decimal point
startangle=130, # Start angle of first section
shadow=True # Showing shadow of pie chart
)
Image by Author

25. Neted Pie Chart

A nested pie chart goes one step further and split every pie chart's outer level into smaller groups. In the inner circle, we treat each number as belonging to its group. In the outer circle, we plot them as members of their original groups.

Python Implementation

We first generate a dataframe to work on. In the inner circle, we treat each number as belonging to its group. In the outer circle, we plot them as members of their original groups.

The effect of the donut shape is achieved by setting a width to the pie’s wedges through the wedgeprops argument. For example, we can pass in wedgeprops = dict(linewidth=5) to set the width of the wedge border lines equal to 0.5

# get the data
size = 0.3
labels = 'Food', 'Housing', 'Saving', 'Gas', 'Insurance', 'Car'
spend = [800, 2000, 500, 200, 300, 250]
vals = np.array([[300., 500.], [1800., 200.], [500., 0.],[200., 0.], [150., 150.],[150., 50]])
in_labels = 'At Home','Out', 'Rent','Utilities','Saving','', 'Gas','','Car','Health','Tires','Mai
# outer level
ax.pie(vals.sum(axis=1), # plot the total [60., 32.] = 92
radius=1, # Radius to increase or decrease the size of pie chart
labels=labels, # Labels for each sections
colors=outer_colors, # Color of each section
wedgeprops=dict(linewidth=5,width=size, edgecolor='w') # Add edges to each portion/slice of the pie
)
# inner level
patches, texts = ax.pie(vals.flatten(), # using flatten we plot 60, 32 separetly
radius=1-size,
labels=in_labels,
labeldistance=0.8,
colors=inner_colors,
wedgeprops=dict(linewidth=3,width=size, edgecolor='w'))
Image by Author

26. Donut Chart

A donut chart is essentially a pie chart with an area of the center cut out, making it look like a donut. As donut charts are hollowed out, there is no central point to attract your attention. Where do your eyes go instead?

Image by Author

If we are like most people, our eyes travel around the circumference and judge each piece according to its length. Therefore, we can also think of a donut chart as a stacked bar graph curled around on itself.

Pie charts and donut charts are commonly used to visualize election and census results, revenue by product or division, recycling data, survey responses, budget breakdowns, educational statistics, spending plans, or population segmentation.

As we discussed above, one drawback of pie charts is that it is not useful for making accurate comparisons between groups of pie charts. Since the readers focus on the slices' proportional areas to one another and the chart as a whole, it is tricky to see the differences between slices, especially when comparing multiple pie charts together.

A donut chart resolves this problem because readers tend to focus more on reading the arcs' length instead of comparing the proportions between slices.

Python Implementation

trace = go.Pie(labels=labels,
values=spend,
marker=dict(colors=colors),
hole=0.3)
Image by Author

27. TreeMap

A treemap chart is similar to a pie chart, and it does better work without misleading each group's contributions. Treemap chart allows us to split the whole into hierarchies and then show an internal breakdown of each of these hierarchies. At their simplest, they display shapes in sizes appropriate to their value, so bigger rectangles represent higher values.

We often use treemaps for sales data. They capture relative sizes of data categories, allowing for a quick, high-level summary of the similarities and anomalies within one category and between multiple categories.

Treemap charts are not suitable when our data is not divisible into categories and sub-categories.

Moreover, when we’re encoding data with area and intensity of color, our eyes aren’t great a detecting relatively minor differences in either of these dimensions. If our data is such that our audience needs to make precise comparisons between categories, it’s even more cumbersome when the categories aren’t aligned to a common baseline. We should never make our audience do more work than necessary to understand a graph!

Python Implementation

Python allows us to create these charts quickly, as it will calculate each rectangle's size and plot it in a way that fits. We can also combine our treemap with the Matplotlib library’s ability to scale colors against variables to make good looking and easy to understand plots with Python.

Here, we want to display the value counted for each type of car.

Make sure you install squarify!

import squarify# plot the data using squarify
squarify.plot(sizes=label_value.values(), label=labels, color=colors, alpha=0.6)
Image by Author

28. Diverging Chart

A diverging bar chart is a bar chart with marks for some dimension members pointing up or right, and the marks for other dimension members point in the opposite direction (down or left, respectively).

Note:

  • The marks flowing down or left does not necessarily represent negative values.
  • The divergent line can represent zero, but it can also separate the marks for two-dimension members.

We use diverging bars to see how the items vary based on a single metric and visualize the order and amount of this variance. If our primary objective is to compare each dimension member's trend, a divergent bar chart is a good option.

The drawback to using diverging bar charts is that it’s not easy to compare the values across dimension members with a grouped bar chart.

Python Implementation

# plot using horizontal lines and make it look like a column by changing the linewidth
ax.hlines(y=health.index, xmin=0 , xmax=health['x_plot'], color=colors, linewidth=5)
Image by Author

29. Choropleth Map

A choropleth mapis a type of thematic map. A set of pre-defined areas is colored or patterned in proportion to a statistical variable representing an aggregate summary of a geographic characteristic within each region. Typically the color scale will be darker for large values and lighter for small values.

We use a choropleth map to visualize how a measurement varies across a geographic area, or it shows the level of variability within a region. For example, we can use a choropleth map to show the percentage of unemployed people by region, with darker color indicating that more people are unemployed.

Note:

A large region doesn’t mean the colored statistic is of more importance than a smaller regional area. The larger area’s statistic may be related to the people in the area rather than the area itself — so we should beware of size distorting our interpretation.

Python Implementation

world_map_1 = go.Figure(data=[happiness_rank], layout=layout)
Image by Author

30. Bubble Map

A bubble map uses circles of different sizes to represent a numeric value on a territory. It displays one bubble per geographic coordinate or one bubble per region.

We often use bubble maps for a dataset with:

  • a list of geographic coordinates (longitude and latitude), and a numeric variable is controlling the size of the bubble.
  • a list of regions with attributed values and known boundaries. In this case, the bubble map will replace the usual choropleth map.

Note: Bubble map allows avoiding the bias caused by different regional areas in choropleth maps. (large regions tend to have more weight during the observation).

Python Implementation

fig = px.scatter_geo(
df_today, # provide the Pandas data frame
locations='countryCode', # indicate locations
color='continent',
hover_name='country', # what to display when the mouse hovering on the bubble
size='cases', # how large the bubble is
projection='equirectangular',
title=f'World COVID-19 Cases for {today}'
)
Image by Author

My note just covered all of what I consider to be the necessities for data visualization. Data visualization isn’t going away anytime soon, so it’s essential to build a foundation of analysis and storytelling, and exploration that you can carry with regardless of the tools or software you end up using. It takes months, sometimes years, to master a skill, so don’t stop learning! If you want to dig deeper into this particular topic, here are some excellent places to start.

  1. Histograms and Density Plots in Python

2. The Next Level of Data Visualization in Python

3. A step-by-step guide for creating advanced Python data visualizations with Seaborn / Matplotlib

4. Matplotlib Cheat Sheet

What type of chart is best for continuous data?

Line charts are among the most frequently used chart types. Use lines when you have a continuous data set. These are best suited for trend-based visualizations of data over a period of time, when the number of data points is very high (more than 20).

How do you show continuous data distribution?

A histogram shows the distribution of a continuous variable and, since the variable is continuous, there should be no gaps between the bars. A bar chart shows the distribution of a discrete variable or a categorical one, and so will have spaces between the bars.

Which chart would you use to show a relationship between two continuous variables?

Scatter plots are used to display the relationship between two continuous variables x and y.

Which chart would you use to show the change in a continuous variable over time?

. . . a Line graph. Line graphs are used to track changes over short and long periods of time. When smaller changes exist, line graphs are better to use than bar graphs. Line graphs can also be used to compare changes over the same period of time for more than one group.

Which type of chart is used to display the distribution of a variable?

A histogram is a chart that plots the distribution of a numeric variable's values as a series of bars.