Data transformation
An important tool for tidying data is the option to pivot tables from a wide format to a long format and viceversa. Let’s have a look at these two preliminary cases.
Wide vs Long
In the following case we have a clear example where data are not tidy.
table_colvars
## # A tibble: 4 x 4
## country `2020` `2021` `2022`
## <chr> <dbl> <dbl> <dbl>
## 1 Italy 12 9 7
## 2 France 13 10 9
## 3 Germany 11 8 10
## 4 Belgium 14 12 14
Columns do not represent different variables, they are a variable themselves. In order to use these data properly, we should move those values into a column, and therefore pivot the table from this wide format to a longer one.
We do that, by using the code below:
table_colvars %>%
pivot_longer(
cols = c(`2020`:`2022`),
names_to = "year",
values_to = "cases"
)
## # A tibble: 12 x 3
## country year cases
## <chr> <chr> <dbl>
## 1 Italy 2020 12
## 2 Italy 2021 9
## 3 Italy 2022 7
## 4 France 2020 13
## 5 France 2021 10
## 6 France 2022 9
## 7 Germany 2020 11
## 8 Germany 2021 8
## 9 Germany 2022 10
## 10 Belgium 2020 14
## 11 Belgium 2021 12
## 12 Belgium 2022 14
Now colums are consistent with a single variable representation, and rows represent indeed one single record per variable combination.
Let’s have a look at this second table, were we have the opposite case:
table_diffvars
## # A tibble: 16 x 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Italy 2020 cases 12
## 2 Italy 2020 population 34564
## 3 Italy 2021 cases 11
## 4 Italy 2021 population 34575
## 5 France 2020 cases 13
## 6 France 2020 population 7656564
## 7 France 2021 cases 10
## 8 France 2021 population 7656874
## 9 Germany 2020 cases 11
## 10 Germany 2020 population 909987
## 11 Germany 2021 cases 9
## 12 Germany 2021 population 910987
## 13 Belgium 2020 cases 14
## 14 Belgium 2020 population 87667
## 15 Belgium 2021 cases 8
## 16 Belgium 2021 population 87887
Here we have one column that actually represents two distinct variables, both indicating an integer count of different entities though. They should be two separate columns. We can do this, by pivoting the table to a wider format, as with the code below:
table_diffvars %>%
pivot_wider(
names_from = type,
values_from = count
)
## # A tibble: 8 x 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Italy 2020 12 34564
## 2 Italy 2021 11 34575
## 3 France 2020 13 7656564
## 4 France 2021 10 7656874
## 5 Germany 2020 11 909987
## 6 Germany 2021 9 910987
## 7 Belgium 2020 14 87667
## 8 Belgium 2021 8 87887
Filter
Filtering is an important step in tidying the data and sometimes also in simplifying the plots: this can be done in tidyverse with the following code:
table_diffvars %>%
pivot_wider(
names_from = type,
values_from = count
) %>%
filter(year == 2020)
## # A tibble: 4 x 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Italy 2020 12 34564
## 2 France 2020 13 7656564
## 3 Germany 2020 11 909987
## 4 Belgium 2020 14 87667
Select
we can also select specific columns
table_diffvars %>%
pivot_wider(
names_from = type,
values_from = count
) %>%
select(country,year,population)
## # A tibble: 8 x 3
## country year population
## <chr> <dbl> <dbl>
## 1 Italy 2020 34564
## 2 Italy 2021 34575
## 3 France 2020 7656564
## 4 France 2021 7656874
## 5 Germany 2020 909987
## 6 Germany 2021 910987
## 7 Belgium 2020 87667
## 8 Belgium 2021 87887
Add variables
and use mutate to add variables from calculations we have made:
table_diffvars %>%
pivot_wider(
names_from = type,
values_from = count
) %>%
mutate(
perc_cases = cases/population * 100
)
## # A tibble: 8 x 5
## country year cases population perc_cases
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Italy 2020 12 34564 0.0347
## 2 Italy 2021 11 34575 0.0318
## 3 France 2020 13 7656564 0.000170
## 4 France 2021 10 7656874 0.000131
## 5 Germany 2020 11 909987 0.00121
## 6 Germany 2021 9 910987 0.000988
## 7 Belgium 2020 14 87667 0.0160
## 8 Belgium 2021 8 87887 0.00910
Groups and summaries
Grouping the data is also an important tool to better represent the information we have in our datasets, and we can then create summaries resulting from the operations done by groups:
table_diffvars %>%
pivot_wider(
names_from = type,
values_from = count
) %>%
mutate(
perc_cases = cases/population * 100
) %>%
group_by(country) %>%
summarise(
avg_population = mean(population),
avg_cases = mean(cases)
)
## # A tibble: 4 x 3
## country avg_population avg_cases
## <chr> <dbl> <dbl>
## 1 Belgium 87777 11
## 2 France 7656719 11.5
## 3 Germany 910487 10
## 4 Italy 34570. 11.5
Arrange rows
Rows can be arranged using the arrange function:
table_diffvars %>%
pivot_wider(
names_from = type,
values_from = count
) %>%
mutate(
perc_cases = cases/population * 100
) %>%
group_by(country) %>%
summarise(
avg_population = mean(population),
avg_cases = mean(cases)
) %>%
arrange(avg_population)
## # A tibble: 4 x 3
## country avg_population avg_cases
## <chr> <dbl> <dbl>
## 1 Italy 34570. 11.5
## 2 Belgium 87777 11
## 3 Germany 910487 10
## 4 France 7656719 11.5
Basic concepts
ggplot is built as introduced in the presentation using the key components of grammar of graphics:
- a dataset
- mapping the data to an aestethics
- a statistical transformation
- a geometry (which might involve a statistics)
- scales
- facets
To see this we are going to use the dataset mpg which reports fuel data on 38 popular models of cars.
Let’s see them:
ggplot(mpg, aes(x=class, fill = class))+
geom_bar()
if we don’t use a y value, we can use the geometry “bar”which relies on counts as default statistics. Alternatively we can transform the data
mpg %>%
group_by(class) %>%
summarise(count = n()) %>%
ggplot(aes(x=class, y=count, fill = class))+
geom_bar(stat = "identity")
The result is identical, we have changed the statistics of the geometry to accept an exact value for the y-axis without calculating a count or frequency.
This allows us to do an additional step, useful for visualisation, i.e. arranging the bars
mpg %>%
group_by(class) %>%
summarise(count = n()) %>%
ggplot(aes(x=reorder(class, -count), y=count, fill = class))+
geom_bar(stat = "identity")
Another thing many people like to do is to flip the coordinates
mpg %>%
group_by(class) %>%
summarise(count = n()) %>%
ggplot(aes(x=reorder(class, -count), y=count, fill = class))+
geom_bar(stat = "identity")+
coord_flip()
We need to customise a little the plot by adding the correct labels, remembering we flipped the coordinates:
mpg %>%
group_by(class) %>%
summarise(count = n()) %>%
ggplot(aes(x=reorder(class, -count), y=count, fill = class))+
geom_bar(stat = "identity")+
coord_flip()+
labs(y = "no. cars", x="class of car", title = "Number of cars produced by class")
Now let’s have a look at other elements, like facets and use another plot type:
ggplot(mpg, aes(x=displ, y=hwy))+
geom_point(colour = "blue")
Representations
This allows us to introduce one of the three elements of visualisation seen in the introduction, and start with relationships.
Relationships
Scatterplots with fits are usually a good (and very “statistical”) way to show relationships or trends
We can modify the above plot to show the fit:
ggplot(mpg, aes(x=displ, y=hwy))+
geom_point(colour = "blue")+
geom_smooth(method = "lm", colour = "red")
If we believe a linear model (i.e. lm) is not particularly ideal here, we can use a local polynomial regression or LOESS (which stands for locally estimated scatterplot smoothing):
ggplot(mpg, aes(x=displ, y=hwy))+
geom_point(colour = "blue")+
geom_smooth(method = "loess", colour = "red")
The relationships between engine displacement and fuel consumption on the highway might be however different depending on the manufacturer. We could explore this by using faceting:
ggplot(mpg, aes(x=displ, y=hwy))+
geom_point(colour = "blue")+
geom_smooth(method = "loess", colour = "red")+
facet_wrap(~manufacturer, ncol = 3)
Here we can already see:
- some manufacturer have a smaller range of engine displacement engines
- some datapoints are not enough to fit a loess regression
We can also cheat a little on the axes to view better some data, but this is something one should be very careful about:
ggplot(mpg, aes(x=displ, y=hwy))+
geom_point(colour = "blue")+
geom_smooth(method = "loess", colour = "red")+
facet_wrap(~manufacturer, ncol = 3, scales = "free")
Patterns
Another element we have seen in the introduction is that we are pretty good at spotting patterns, and particularly trends (for example over time) Here, a basic rule is to remember we should simplify / reduce data points, to highlight the pattern better.
In this case we use the dataset on housing sales in texas.
ggplot(txhousing,
aes(x=date, y=sales))+
geom_bar(stat = "identity")
The plot is a little messy, and doesn’t allow us to see very well. We can try and simplify the plot by plotting by year, and for this we need to summarise the data a little as we have seen before:
txhousing %>%
group_by(year) %>%
summarise(avg_sales = mean(sales, na.rm = TRUE),
sum_sales = sum(sales, na.rm = TRUE)) %>%
ggplot(aes(x = year, y = sum_sales))+
geom_bar(stat = "identity")
Besides bars, lines are also a very good way in particular if we wanted to compare things: we might want to see the trend of sales over the years, and by city
ggplot(txhousing, aes(x=date, y=sales, colour = city))+
geom_line()
however, this is terribly messy. there’s a few things we can do, with the goal of reducing the complexity.
First of all, we can remove the legend:
ggplot(txhousing, aes(x=date, y=sales, colour = city))+
geom_line()+
theme(legend.position = "none")
But the plot is unreadable and there’s so many colours.
Plus, we notice also that probably large cities also stand apart, and it becomes difficult to make real comparisons. Here we stress again the relationship between the data and the visualisation. Let’s consider the proportion over the number of listings, i.e. the successful sales if we may.
txhousing %>%
mutate(proportion_sales = sales / listings * 100) %>%
ggplot(aes(x=date, y=proportion_sales, colour = city))+
geom_line()+
theme(legend.position = "none")
Clearly there’s something going on - whatever the reasons in the data - when the sales are higher than listings, let’s check it:
txhousing %>%
filter(sales > listings)
## # A tibble: 5 x 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Garland 2015 3 172 25582593 135900 146 0.8 2015.
## 2 Garland 2015 5 195 32018011 151000 177 1 2015.
## 3 Garland 2015 6 212 35772714 154800 208 1.2 2015.
## 4 Midland 2010 2 100 19880017 175400 0 0 2010.
## 5 Odessa 2008 8 150 23100000 136700 91 0.8 2009.
We don’t know the reason for this, but let’s remove these data points which seem difficult to explain without additional information:
txhousing %>%
filter(sales < listings) %>%
mutate(proportion_sales = sales / listings * 100) %>%
ggplot(aes(x=date, y=proportion_sales, colour = city))+
geom_line()+
theme(legend.position = "none")
The plot is still very messy. There’s something we could do, quite easily and it uses the highlights package to highlight only those cities which kept the majority of sales above 25% of the listings:
txhousing %>%
filter(sales < listings) %>%
mutate(proportion_sales = sales / listings * 100) %>%
ggplot(aes(x=date, y=proportion_sales, colour = city))+
geom_line()+
theme(legend.position = "none")+
gghighlight(median(proportion_sales) > 25)
But we could also highlight those below a certain threshold:
txhousing %>%
filter(sales < listings) %>%
mutate(proportion_sales = sales / listings * 100) %>%
ggplot(aes(x=date, y=proportion_sales, colour = city))+
geom_line()+
theme(legend.position = "none")+
gghighlight(median(proportion_sales) > 25 | median(proportion_sales) < 5)
If we don’t like the labels there, we can put them on a side (we need to remove the setting on legends):
txhousing %>%
filter(sales < listings) %>%
mutate(proportion_sales = sales / listings * 100) %>%
ggplot(aes(x=date, y=proportion_sales, colour = city))+
geom_line()+
gghighlight(median(proportion_sales) > 25 | median(proportion_sales) < 5, use_direct_label = FALSE)
Let’s go back to bars again, as a way to show some patterns:
mpg %>%
group_by(trans, class) %>%
summarise(
avg_hwy = mean(hwy)
) %>%
ggplot(aes(x = class, y = avg_hwy, fill = trans))+
geom_bar(stat = "identity", position = "dodge")
Now the problem with this, is that it’s too cluttered and we can’t really compare what we’d like to see. If the goal is to show the impact of car class has on miles per gallon, then we should separate the transmission types.
mpg %>%
group_by(trans, class) %>%
summarise(
avg_hwy = mean(hwy)
) %>%
ggplot(aes(x = class, y = avg_hwy, fill = trans))+
geom_bar(stat = "identity", position = "dodge")+
facet_wrap(~trans)
Now last element is that x-axis labels overlap, so we can dodge them:
mpg %>%
group_by(trans, class) %>%
summarise(
avg_hwy = mean(hwy)
) %>%
ggplot(aes(x = class, y = avg_hwy, fill = trans))+
geom_bar(stat = "identity", position = "dodge")+
facet_wrap(~trans)+
scale_x_discrete(guide = guide_axis(n.dodge=2))
Proportions
An interesting data to explore is the expanded sleep dataset on mammals (msleep)
We will discuss pie charts in the next tutorial about storytelling. Let’s have a look at a different way to represent proportions, with bubble charts.
msleep %>%
arrange(bodywt) %>%
filter(!is.na(vore)) %>%
ggplot(aes(x=sleep_rem, y=sleep_total, colour = vore))+
geom_point(alpha=0.5, aes(size = bodywt))
But we can make it prettier, and control better the size using scales:
msleep %>%
arrange(bodywt) %>%
filter(!is.na(vore)) %>%
ggplot(aes(x=sleep_rem, y=sleep_total, colour = vore))+
geom_point(alpha=0.5, aes(size = bodywt))+
scale_size(range = c(2, 30), name="Body Weight")