Which chart would you use to show the distribution of scores if your variable is continuous?
Show
Choose the correct graph or chart style for the task you want your audience to accomplish.Photo by Morgan Housel on UnsplashThis is the second installment in a two-part series on Data Visualization. If you haven’t read Part 1 of this series, I recommend checking that out! In part 1 of this series, we walked through the first three data visualization functions: relationship, data over time, and ranking plot. In case you need a quick refresher:
For the second part, we’ll discuss the last two data visualization functions: distribution and comparison. Distribution Plot15. Histogram Comparisons Plot22. Bubble Chart Disclaimer: I grouped the chart by the purpose of data visualization, but it isn’t perfect. For example, the scatter plots and bubble charts are useful for quickly identifying relationships between numeric variables. However, unlike the scatter plot, each point on the bubble chart is assigned a label or category. Buble charts can also be useful for comparison between data points. Additionally, time can be shown either by having it as a variable on one axis or by animating the data variables changing over time. So bubble charts can be useful for relationships, data over time, and comparison. For those reasons, I consider this as a guide for selecting a chart based on the purpose of analysis or communication needs. DistributionDistribution charts are used to show how variables are distributed over time, helping identify outliers and trends. When evaluating a distribution, we want to find out the existence (or absence) of patterns and their evolution over time. 15. HistogramA histogram is a vertical bar chart that depicts the distribution of a set of data. Each bar in a histogram represents the tabulated frequency at each interval/bin. Note:
The histogram shows the distribution of variables, plotting quantitative data, and identifying the frequency of something occurring within a bucketed range of values. In other words, histograms help gives an estimate as to where values are concentrated, what the extremes are, and whether there are any gaps or unusual values. They are also useful for giving a rough view of the probability distribution. If we are considering just one variable instead, the best visualization to use is the histogram. Since histogram allows us to group continuous data into bins, it provides a good representation of where observations are concentrated. If considering two variables, we use a scatter chart as described previously. Python Implementation Here, we want to present the distribution of happiness scores. plt.hist(happy['Score'], edgecolor = 'black') Image by Author16. Density Curve with HistogramA histogram can also be used to compare the data distribution to a theoretical model, such as a normal distribution. This requires using a density scale for the vertical axis. sns.distplot(happy['Score'], hist=True, kde=True, Image by Author
17. Density PlotDensity plots (aka Kernel Density Plots or Density Trace Graph) are used to observe a variable's distribution in a dataset. This chart is a smoothed version of the histogram and is used in the same concept. It uses a kernel density estimate to show the variable's probability density function, allowing for smoother distributions by smoothing out the noise. Thus, the plots are smooth across bins and are not affected by the number of bins created, creating a more defined distribution shape. The peaks of a density plot help display where values are concentrated over the interval. (see more) An advantage density plots have over histograms is that they’re better at determining the distribution shape because they’re not affected by the number of bins used (each bar used in a typical histogram). Density plots are used to study the distribution of one or a few variables. Checking our variables' distribution one by one is probably the first task we should do once getting a new dataset. It delivers a good quantity of information. Several distribution shapes exist; here is an illustration of the six most common ones. Python ImplementationHere, we want to visualize the probability density of the engine displacement in liters ( We use # simple density plot Image by AuthorThe density plot also allows us to compare the distribution of a few variables. However, we should not compare more than three or four since it would make the figure cluttered and unreadable.
for class_ in ['compact', 'suv', 'midsize']: Image by Author18. Box PlotA box plot or whisker plot summarizes a set of data measured on an interval scale. This type of graph shows the shape of the distribution, its central value, and its variability. Image by AuthorThe box plot shows data is distributed based on a five-number statistical summary. A small “box” indicates that most of the information falls within a consistent range, while a larger box displays the data is more widely distributed.
We use box plots in descriptive data analysis, indicating whether a distribution is skewed and potential unusual observations (outliers) in the data set. Box plots are also very useful when large numbers of observations are involved and when two or more data sets are being compared. One drawback of boxplots is that they emphasize the tails of a distribution, which are the least specific data set points. They also hide many of the details of the distribution. Python ImplementationHere, we want to present the distribution of vehicle classes. plot1 = ax.boxplot(vects, Image by Author19. Strip PlotA strip plot is a scatter plot where one of the variables is categorical. The strip plot is an alternative to a histogram or a density plot. It is typically used for small data sets (histograms and density plots are usually preferred for larger data sets). Source: SeabornA strip plot can be drawn on its own. Still, it is also a good complement to a box or violin plot in cases we want to show all observations along with some representation of the underlying distribution. Python ImplementationBoxplot is a fantastic way to study distributions. However, some types of distribution can be hidden under the same box. Thus, plotting strip charts and boxplots side-by-side can be useful to display everyobservation over your boxplot, to be sure not to miss an interesting pattern.
ax = sns.boxplot(car['class'], car['hwy'],boxprops=dict(alpha=0.75)) Image by Author20. Violin PlotSometimes the median and mean aren’t enough to understand a dataset. This is where the violin plot comes in. A violin plot is a hybrid of a box plot and a density plot rotated and placed on each side. It is used to visualize the distribution of the data and its probability density. The “violin” shape of a violin plot comes from the data’s density plot. We turn that density plot sideways and put it on both sides of the box plot, mirroring each other. Each side of a violin is a density estimation to show the distribution shape of the data. Reading the violin shape is exactly how we read a density plot: Wider sections of the violin plot represent a higher probability that members of the population will take on the given value; the skinnier sections represent a lower probability. Box plots are well explained in many statistics courses, while violin plots are rarely mentioned. One downside of violin plots is that it is not familiar to many readers, we should consider who is our target readers while using violin plots. Python ImplementationHere, we want to compare the distribution of mile per gallon for highway driving across vehicle classes. sns.violinplot(car['class'], car['hwy'], Image by Author21. Population PyramidPopulation pyramids are ideal for detecting changes or differences in population patterns. Population pyramids are important graphs for visualizing how populations are composed when looking at groups divided by age and sex. It is a pair of back-to-back histograms (for each sex) that displays a population's distribution in all age groups and both sexes. The x-axis is used to plot population numbers, and the y-axis lists all age groups. The shape of a population pyramid can be used to interpret a population. For instance, a pyramid with an extensive base and a narrow top section suggests a community with high fertility and death rates. A pyramid with a wider top half and a narrower base would tell an aging population with low fertility rates. Population pyramids can also be used to speculate a population’s future development. This makes the population pyramids useful for fields such as Ecology, Sociology, and Economics. Python ImplementationAssume we want to show the age-sex distribution of a given population. It a graphic profile of the population’s residents. Sex is shown on the left/right sides, age on the y-axis, and the number of people on the x-axis. ax[0].barh(range(0, len(df)), df['Male'], align='center', color='#4c85ff') Image by Author
ComparisonWhen analyzing our data, we might be interested in comparing data sets to understand differences or similarities between data points or time periods. I grouped the charts that are most used to compare one or more datasets. Comparison questions ask how different values or attributes within the data compare to each other. Note:
22. Bubble ChartA bubble chart displays multiple bubbles (circles) in a two-dimensional plot. It is a generalization of the scatter plot, replacing the dots with bubbles. Like a scatter plot, bubble charts use a cartesian coordinate system to plot points along a grid where the x-axis and y-axis are separate variables. However, unlike a scatter plot, each point is assigned a label or category (either displayed alongside or on a legend). Each plotted point then represents a third variable by the area of its bubble. We can use color to distinguish between categories or used to describe an additional data variable. Time can be shown either by having it as a variable on one axis or by animating the data variables changing over time. We can use a bubble chart to depict and show relationships between numeric variables. However, marker size as a dimension allows for the comparison between three variables rather than just two. In a single bubble chart, we can make three different pairwise comparisons (X vs. Y, Y vs. Z, X vs. Z) and an overall three-way comparison. It would require multiple two-variable scatter plots to gain the same number of insights; even then, inferring a three-way relationship between data points will not be as direct as in a bubble chart. Too many bubbles can make the chart hard to read, so bubble charts have a limited data size capacity. This can be somewhat remedied by interactivity: clicking or hovering over bubbles to display hidden information, having an option to reorganize or filter out grouped categories. px.scatter(happy, x="GDP", y="Score", animation_frame="Year", Image by Author
23. Bullet ChartBullet charts are used typically to display performance data; they are similar to bar charts but are accompanied by extra visual elements to pack in more context. Originally, Bullet Graphs were developed by Stephen Few as an alternative to dashboard gauges and meters. This is because they often displayed insufficient information, were less space-efficient, and were cluttered with “chartjunk.” Image by AuthorThe main bar's length encodes the primary data value in the middle of the chart, known as the feature measure. The line marker that runs perpendicular to the graph's orientation is known as the comparative measure and is used as a target marker to compare against the feature measure value. So if the main bar has passed the relative measure position, we know we hit the goal. The segmented colored bars behind the feature measure display the qualitative range scores. Each color shade (the three shades of grey in the example above) assigns a performance range rating, for example, low, average, and excellent. When using a bullet chart, we often keep the maximum number of ranges to five. fig = ff.create_bullet( Image by Author24. Pie ChartPie charts are a classic way to show the composition of groups. A pie chart is a circular graph divided into slices. The larger a slice is the more significant portion of the total quantity it represents. However, it is not generally advisable to use because the area of the pie portions can sometimes become misleading. So, while using pie charts, it is highly recommended to explicitly write down the percentage or numbers for each pie portion. Pie charts are best suited to depict sections of a whole: the proportional distribution of the data. However, the significant downsides to pie charts are:
Python ImplementationAssume we want to display a spending habit of a particular customer. labels = 'Food', 'Housing', 'Saving', 'Gas', 'Insurance', 'Car' Image by Author25. Neted Pie ChartA nested pie chart goes one step further and split every pie chart's outer level into smaller groups. In the inner circle, we treat each number as belonging to its group. In the outer circle, we plot them as members of their original groups. Python ImplementationWe first generate a dataframe to work on. In the inner circle, we treat each number as belonging to its group. In the outer circle, we plot them as members of their original groups. The effect of the donut shape is achieved by setting a width to the pie’s wedges through the # get the dataImage by Author 26. Donut ChartA donut chart is essentially a pie chart with an area of the center cut out, making it look like a donut. As donut charts are hollowed out, there is no central point to attract your attention. Where do your eyes go instead? Image by AuthorIf we are like most people, our eyes travel around the circumference and judge each piece according to its length. Therefore, we can also think of a donut chart as a stacked bar graph curled around on itself. Pie charts and donut charts are commonly used to visualize election and census results, revenue by product or division, recycling data, survey responses, budget breakdowns, educational statistics, spending plans, or population segmentation. As we discussed above, one drawback of pie charts is that it is not useful for making accurate comparisons between groups of pie charts. Since the readers focus on the slices' proportional areas to one another and the chart as a whole, it is tricky to see the differences between slices, especially when comparing multiple pie charts together. A donut chart resolves this problem because readers tend to focus more on reading the arcs' length instead of comparing the proportions between slices. Python Implementationtrace = go.Pie(labels=labels, Image by Author27. TreeMapA treemap chart is similar to a pie chart, and it does better work without misleading each group's contributions. Treemap chart allows us to split the whole into hierarchies and then show an internal breakdown of each of these hierarchies. At their simplest, they display shapes in sizes appropriate to their value, so bigger rectangles represent higher values. We often use treemaps for sales data. They capture relative sizes of data categories, allowing for a quick, high-level summary of the similarities and anomalies within one category and between multiple categories. Treemap charts are not suitable when our data is not divisible into categories and sub-categories. Moreover, when we’re encoding data with area and intensity of color, our eyes aren’t great a detecting relatively minor differences in either of these dimensions. If our data is such that our audience needs to make precise comparisons between categories, it’s even more cumbersome when the categories aren’t aligned to a common baseline. We should never make our audience do more work than necessary to understand a graph! Python ImplementationPython allows us to create these charts quickly, as it will calculate each rectangle's size and plot it in a way that fits. We can also combine our treemap with the Matplotlib library’s ability to scale colors against variables to make good looking and easy to understand plots with Python. Here, we want to display the value counted for each type of car. Make sure you install import squarify# plot the data using squarifyImage by Author 28. Diverging ChartA diverging bar chart is a bar chart with marks for some dimension members pointing up or right, and the marks for other dimension members point in the opposite direction (down or left, respectively). Note:
We use diverging bars to see how the items vary based on a single metric and visualize the order and amount of this variance. If our primary objective is to compare each dimension member's trend, a divergent bar chart is a good option. The drawback to using diverging bar charts is that it’s not easy to compare the values across dimension members with a grouped bar chart. Python Implementation# plot using horizontal lines and make it look like a column by changing the linewidth Image by Author29. Choropleth MapA choropleth mapis a type of thematic map. A set of pre-defined areas is colored or patterned in proportion to a statistical variable representing an aggregate summary of a geographic characteristic within each region. Typically the color scale will be darker for large values and lighter for small values. We use a choropleth map to visualize how a measurement varies across a geographic area, or it shows the level of variability within a region. For example, we can use a choropleth map to show the percentage of unemployed people by region, with darker color indicating that more people are unemployed. Note: A large region doesn’t mean the colored statistic is of more importance than a smaller regional area. The larger area’s statistic may be related to the people in the area rather than the area itself — so we should beware of size distorting our interpretation. Python Implementationworld_map_1 = go.Figure(data=[happiness_rank], layout=layout) Image by Author30. Bubble MapA bubble map uses circles of different sizes to represent a numeric value on a territory. It displays one bubble per geographic coordinate or one bubble per region. We often use bubble maps for a dataset with:
Note: Bubble map allows avoiding the bias caused by different regional areas in choropleth maps. (large regions tend to have more weight during the observation). Python Implementationfig = px.scatter_geo( Image by AuthorMy note just covered all of what I consider to be the necessities for data visualization. Data visualization isn’t going away anytime soon, so it’s essential to build a foundation of analysis and storytelling, and exploration that you can carry with regardless of the tools or software you end up using. It takes months, sometimes years, to master a skill, so don’t stop learning! If you want to dig deeper into this particular topic, here are some excellent places to start.
2. The Next Level of Data Visualization in Python 3. A step-by-step guide for creating advanced Python data visualizations with Seaborn / Matplotlib 4. Matplotlib Cheat Sheet What type of chart is best for continuous data?Line charts are among the most frequently used chart types. Use lines when you have a continuous data set. These are best suited for trend-based visualizations of data over a period of time, when the number of data points is very high (more than 20).
How do you show continuous data distribution?A histogram shows the distribution of a continuous variable and, since the variable is continuous, there should be no gaps between the bars. A bar chart shows the distribution of a discrete variable or a categorical one, and so will have spaces between the bars.
Which chart would you use to show a relationship between two continuous variables?Scatter plots are used to display the relationship between two continuous variables x and y.
Which chart would you use to show the change in a continuous variable over time?. . . a Line graph.
Line graphs are used to track changes over short and long periods of time. When smaller changes exist, line graphs are better to use than bar graphs. Line graphs can also be used to compare changes over the same period of time for more than one group.
Which type of chart is used to display the distribution of a variable?A histogram is a chart that plots the distribution of a numeric variable's values as a series of bars.
|