Communicate with Data Correctly

Find out how data visualization has evolved over the years and what you can do with it

Data Engineer working on his computer

In our daily work, at some point, we will be assessing information and trying to understand it. It is very common for work teams of all kinds to use metrics to evaluate how their work has evolved or if they are close to reaching a goal. When looking at data and striving to understand it, the use of a correct data display is key. And that is what we are going to talk about in this blog post.

I have been studying and working with data management as a Data Engineer for several years, and part of that includes finding the best way to visualize the data in question. However, we don’t always succeed in displaying and visualizing data in an easily understandable way. 

As software developers, we are often less qualified to tell stories with data. In our education we learn a lot about language and how to write papers and articles, and are also provided lectures on numbers and mathematics. But we are never really taught how to present data or quantifiable information in an easy and simple way. For me, learning more about the world of data visualization brought these two areas of knowledge together.

The evolution in data displays

Lately in the software world, we’ve seen a strong app development trend that displays data using dashboard-style functionalities. It provides information through graphics such as these:

Collaborative dashboard software mockup.

But the first data visualization tools started long ago and in a more analogical way than the ones we know today.

One of the earliest data display formats that we know are maps. In the following image we see the first world map from 1570, which showed the trade routes at that time. The creator, Abraham Ortelius, published an atlas of 53 maps as a first effort to bring together  the world’s cartography knowledge.

Abraham Ortelius, Public domain, via Wikimedia Commons
Abraham Ortelius, Public domain, via Wikimedia Commons

Then, in 1645, a mathematician sent a letter to the Queen of Spain giving her what is known as the world’s first statistical data chart. In this letter he reported mistakes he identified in the ranges of longitude on the world map.

Section of a letter by Langren to Isabella Clara Eugenia, 1628
Section of a letter by Langren to Isabella Clara Eugenia, 1628

In the graphic he included in the letter, he wanted to prove the correct estimation of the distance between Toledo and Rome, and to show how it differed from the maps known at that time. He might simply have made a chart or he could have listed the values of the longitudes, but he noticed this form failed to achieve his intended purpose.

More than 250 years ago, a history teacher wanted to summarize all the time periods and dynasties in human history. To communicate this to his students and make it easier for them to remember, he created a timeline, which was one of the first important contributions he made to the world of data visualizations.

Joseph Priestley (1733-1804), Public domain, via Wikimedia Commons
Joseph Priestley (1733-1804), Public domain, via Wikimedia Commons

Like these, there are several other interesting examples of data visualization that communicate and display information more quickly.

In this one, a historian wanted to illustrate the impact of their military’s trip to Russia during the French Revolution. The graphic shows us the number of people who started the campaign and then, as they advanced into Russian territory and battles took place, the number decreased. The graphic also shows a line marking the retreat, and how the number of men who returned from war is much lower than the starting number. 

https://commons.princeton.edu/mg/minard-carte-figurative/
https://commons.princeton.edu/mg/minard-carte-figurative/

This next image is a reference to an outbreak of cholera in London in 1850. At first, there were many hypotheses of how the disease was transmitted. Many said it was contagious through the air, but Dr. John Snow refused to accept this idea. To prove his theory that the illness was associated with the use of something contaminated, he began to mark the place where each cholera victim died with a small line on their house’s door. This metric led to the first heat map in history. Eventually, they realized the regions with concentrated lines contained a contaminated water bomb, which explained the outbreak.

Joseph Priestley (1733-1804), Public domain, via Wikimedia Commons
Joseph Priestley (1733-1804), Public domain, via Wikimedia Commons

Why do we display data?

Data visualization is used to explore the information we have and see how it behaves, as well as understand why events occur in a certain way.

There are different types of data visualizations:

Quantity Visualization

Example: bar charts or heat maps

Example of quantity visualization charts
https://clauswilke.com/dataviz/directory-of-visualizations.html
  • They are used to indicate the best or worst in a category by comparing two or more points; to compare performance with the target or goal; to show what has or has not progressed in a certain period of time.
  • It is useful in showing similarities and differences in a straightforward way.

Distribution Visualization

Examples: histogram, density plot, box plots

Examples of distribution visualization charts
https://clauswilke.com/dataviz/directory-of-visualizations.html
  • They are used to indicate the highest, middle, and lowest values.
  • It is recommended to find out if there is something that stands out from the rest, reveals atypical values such as distribution shape, frequencies, ranges, etc.

Proportion Visualization

Examples: Cake, bars, plots, grouped bars, mosaics, etc

Examples of proportion visualization charts
https://clauswilke.com/dataviz/directory-of-visualizations.html
  • They indicate which parts make up the totality and serve to highlight which is larger or smaller, which is similar or different, etc.
  • Highly recommended to show summaries, similarities, anomalies, percents related to the total, etc.

X-Y Relationship Visualization

Examples: line graph, bubble chart, 2D bins, etc

Examples of X-Y relationship visualization charts
https://clauswilke.com/dataviz/directory-of-visualizations.html
  • They are useful to question whether the relationship between two numerical variables is positive, negative, or neither, or to understand how one X value or group is related to another Y value or group.
  • Used to show atypical values, correlations, positive and negative relationships between two or more variables.

Guidelines for good visualization:

  1. Understand the context: who is the audience for the data? Do they have technical knowledge about the topic? How much time do we have to show this data, and where are we going to show it? Will it be a face-to-face meeting for a hundred people using a projector or is it a one-on-one Zoom video call? What information is most essential? 
  2. Choose an appropriate visualization: we have options to display data through texts, tables, graphics, among others. Depending on the information and what we want to highlight, it is important to choose the most effective visuals.
  3. Remove disorder and focus the attention of the reader where we want through the choices we make when displaying the data: what shapes or colors we use, what elements we highlight, which ones we can take out because they don’t add value, etc.
  4. Try to tell a story to lead the reader toward what we want to reveal.

To discover the best way to display data, I recommend testing, experimenting, and making mistakes. Only by trying different alternatives and comparing different visualization styles can you evaluate which one really suits the guidelines and the situation. 

It is important to help the reader by highlighting the most important info and taking them where we want to go. For example, if we want to highlight the number 3 in this image, why not make it  a different color than the rest?

https://www.storytellingwithdata.com/books

However, balance is also essential. Avoid helping too much by mixing up things that we normally associate with a specific order or format. In the example below we see how one could interpret the information incorrectly by mixing age ranges in this way, even though the bars are organized with the aim of making it easier for the viewer:

Data visualization example
https://clauswilke.com/dataviz/

Typical mistakes:

  • Overload of information: by trying to show too many things at once, you end up not getting anything across.
  • Abusing cake graphics: these are useful if we want to have a notion of parts vs. the whole. To measure growth, for example, it is not a recommended method as it is difficult for the human eye to compare angles and evaluate.
  • Choosing the wrong color palettes that do not correctly reflect the differences you want to highlight.
  • Not respecting scales or not showing the whole timeline.
  • Variations vs. absolute numbers can also give the wrong picture.

Good practices:

  • Never use 3D graphics. It is difficult to draw imaginary lines or planes that allow us to find the intersection with the axes. It is better to have more than one graph versus using a 3D one to show 3-way aspects comparisons.
  • In a correlation or an average, show the points. If not, we have no way of finding out exactly what factors led to this average.
  • Inverted axes can be confusing. Don’t fail to follow the logic of how humans read and interpret information.

There’s nothing like an example:

Let me show you how to use data visualization effectively with an example extracted from several books that I recommend for studying data visualization:

In order to make the post more insightful, we will replicate the graphics and show code examples using R for its simplicity and popularity in data analysis. However, it is possible to find many more examples for different languages at http://chartmaker.visualisingdata.com/.

Data visualization example
				
					 ggplot( aes (x= factor(MonthNum), y= Value, fill=Type)) +
  geom_bar(stat="identity", position=position_dodge()) +
  ggtitle( 'Tickets') 
				
			

In this image we can see the number of tickets received and processed by a support team. This team leader wants to demonstrate that the resignation of a team member has significantly affected the ticket processing capacity. However, as we can see in the image, it is not very easy to see a trend change in the number of tickets processed vs. received.

This is due to the fact that we are using bars instead of a line graph. If we change the data visualization format, we can showcase the trend better:

Data visualization example
				
					dataTickets %>% 
  ggplot( aes (x= factor(MonthNum), y= Value, col=Type)) +
  geom_point() +
  geom_line(aes(group = Type)) +
  expand_limits(x = 0, y = 0) +
  ggtitle( 'Tickets') 

				
			

As you can see, it is much clearer that there was a “cut” in the trend from May (the month the member resigned). However, this graph could be improved further:

				
					dataTickets %>% 
   ggplot( aes (x= Month, y= Value, col=Type)) +
   geom_line(aes(group = Type), size= 1.2) +
   ggtitle( 'Tickets') + 
   theme_classic()  +
   expand_limits(x = 0, y = 0) +
   geom_point(data = subset( dataTickets , Month %in% c("Aug","Sep","Oct","Nov","Dec"))) +
   geom_text(data = subset( dataTickets , Month %in% c("Aug","Sep","Oct","Nov","Dec")), aes(label = Value), vjust = -1  , size =3, color = 'black')

				
			

The rationale for the changes: 

  • We do not need a comparison of the values, so we removed the background and unnecessary lines. 
  • We added the values at each point after August (where the biggest difference is) and its value (in a different color for easier reading).
  • We replaced the numbers on the X-axis with abbreviated months to make it easier for users to read.
				
					dataTickets %>% 
  ggplot(  aes (x= Month, y= Value, col=Type)) +
  geom_line(aes(group = Type), size= 1.2) +
  ggtitle( 'Tickets') + 
  theme_classic()  +
  expand_limits(x = 0, y = 0) +
  geom_point(data = subset( dataTickets , Month %in% c("Aug","Sep","Oct","Nov","Dec")), size=2) +
  geom_text(data = subset( dataTickets , Month %in% c("Aug","Sep","Oct","Nov","Dec")), aes(label = Value), vjust = -1  , size =3, color = 'black') +
  theme(legend.position = "none") + 
  scale_color_manual(values=c('steelblue', 'grey') ) + 
  geom_vline(xintercept = 'May',  color='dimgray', size=0.5) +
  labs(title = "Please approve the hire of 2 FTE's \nto backfill those who quit in the past.",
         subtitle = "Ticket volume over time",
         caption = "Data source: Tickets Computer System, as of 12/31/2014. Questions? Please write to aaaa@bbb.com",
         x= "Year 2014",
         y= "Number of Tickets"
       )+
  theme(
    plot.title = element_text( size = 18, face = "bold", colour = 'dimgray'),     
    plot.subtitle = element_text(size = 14, colour = 'dimgray'),           
    plot.caption = element_text(hjust = 0, face = "italic", colour = 'dimgray')
  ) + 
  geom_dl(aes(label = Type),method = list("last.points",   hjust = -.1)) +  
  theme(plot.margin = margin(1, 5, 1, 1, "cm")) +
  coord_cartesian(clip = "off") +
  geom_text( aes(  hjust = 0 , x = 'Jun', y = 50,  
                label = "2 employees quit in May. We nearly kept up with the incoming\nvolume in the following two months, but fell behind the increase\nin Aug and havent't been able to catch up since.")  
             , nudge_x = -0.8, color='grey',  size=2.5) 

				
			

Finally, we made a few other improvements: 

  • Eliminated the colors since they can draw attention to areas we don’t want.
  • Replaced the classic captions with labels to make the reading even easier.
  • Added an eye-catching title, subtitles, and captions, and also improved the axis labels.
  • Added a line with a description at the break point.

If we compare the 2 graphs, the second one helps us communicate our message better and can incite much more interest in our audience. And that’s what data visualization should be all about.

See related posts