[This article was first published on R on Nicola Rennie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Let’s say you need to understand how your data changes within a day, and between different days. For example, if you have hourly pollution data that follows a regular pattern throughout a day, but follows different patterns on a Wednesday and Saturday. Functional analysis is one approach of doing just that. During my PhD, I developed methods based on functional analysis to identify outlier demand in booking patterns for trains in railway networks. To demonstrate that those statistical methods are also applicable in other areas, I started to analyse air pollution data across the United Kingdom. This blog post will discuss the idea of using functional analysis to model air pollution data with the aim of identifying abnormal pollution days. Introducing the data The data comes from DEFRA’s (Department for Environment, Food and Rural Affairs) Automatic Urban and Rural Network (AURN), which reports the level of nitrogen dioxide (among other pollutants) in the air every hour at 164 different locations. For this analysis, we considered data recorded between November 6, 2018 to November 6, 2022. The data can be downloaded from uk-air.defra.gov.uk/data/data_selector_service. Show code: reading in data # Load R packages library(tidyverse) library(PrettyCols) library(fda) library(usefunc) #remotes::install_github(“nrennie/usefunc”) library(funcnetout) #remotes::install_github(“nrennie/funcnetout”) # Read in data pollution % select(-starts_with(“…”)) %>% slice(-1) %>% slice(1:(n() – 1)) %>% mutate(across(c(Aberdeen:`York Fishergate`), ~case_when(.x == “No data” ~ NA_character_, TRUE ~ as.character(.x)))) %>% mutate(across(c(Aberdeen:`York Fishergate`), as.numeric)) %>% mutate(DateTime = ymd_hms(paste(Date, Time)), .after = 2) %>% mutate(Date = ymd(Date), Time = hms(Time)) For this example, let’s focus on a single station (we might come back to how to deal with multiple stations in a later blog post…). Here, let’s choose the Aberdeen Wellington Road to analyse… Show code: selecting a station Aberdeen_WR % select(c(Date, Time, DateTime, `Aberdeen Wellington Road`)) %>% rename(`NO2` = `Aberdeen Wellington Road`) … and have a little look at what the data looks like. It’s a little bit difficult to see in this plot, but there are a few missing values. Not every hour of every day has a recorded value for the level of Nitrogen Dioxide – there are 55 days with at least one missing value. Show code: plotting the data ggplot(data = Aberdeen_WR, mapping = aes(x = DateTime, y = NO2)) + geom_line(linewidth = 0.1, colour = alpha(prettycols(“Purples”)[2], 0.5)) + labs(x = “”, y = “Hourly Nitrogen Dioxide Levels”) + theme_minimal() + theme(axis.text.y = element_text(margin = margin(l = 5)), plot.margin = unit(c(0.2, 0.5, 0.2, 0.2), unit =”cm”)) There are a few different approaches to dealing with missing data – we’ll utilise two different approaches here, depending on how many missing values there are. If there are more than 10 hours of data missing for a single day – we’ll discard that entire day and exclude it from our data set. There are only six days in the four year period the data covers that fit this criteria, so we’re not losing too much data. For the remaining days, we use mean imputation. For example, if a value is missing from 04:00 on a specific day, we’ll use the mean of the non-missing values taken at 04:00 on every other day. Show code: dealing with missing data # days with missing values missing % group_by(Date) %>% summarise(num_missing = sum(is.na(NO2))) %>% filter(num_missing > 0) %>% arrange(desc(num_missing)) # which ones are missing >= 10 hours of data too_many_missing % filter(num_missing >= 10) # remove missing data Aberdeen_WR % filter(Date %notin% too_many_missing$Date) # mean imputation for the others avg_hour % group_by(Time) %>% summarise(avg = mean(NO2, na.rm = TRUE)) Aberdeen_WR % left_join(avg_hour, by = “Time”) %>% mutate(NO2 = case_when(is.na(NO2) ~ avg, TRUE ~ NO2)) %>% select(-avg) Functional approaches to modelling Now that we’ve tidied up the data and dealt with the missing values, it’s time to get stuck into the modelling process! First of all, let’s have another look at our data but in a slightly different way: Show code: plotting the data ggplot(data = Aberdeen_WR, mapping = aes(x = hour(Time), y = NO2)) + geom_line(mapping = aes(group = Date), linewidth = 0.1, colour = alpha(prettycols(“Purples”)[4], 0.2)) + geom_smooth(colour = prettycols(“Purples”)[1], linewidth = 2) + labs(x = “”, y = “Hourly Nitrogen Dioxide Levels”) + scale_x_continuous(name = “Time of day”, limits = c(1, 24), breaks = seq(2, 24, by = 2), labels = c(“02:00”, “04:00”, “06:00”, “08:00”, “10:00”, “12:00”, “14:00”, “16:00”, “18:00”, “20:00”, “22:00”, “00:00″)) + theme_minimal() + theme(axis.text.y = element_text(margin = margin(l = 5)), axis.text.x = element_text(margin = margin(t = 5)), plot.margin = unit(c(0.2, 0.5, 0.2, 0.2), unit =”cm”), panel.grid.minor = element_blank()) The nitrogen dioxide levels follow a similar pattern each day, with higher levels around rush hour in the morning and evening. Functional analysis treats the pollution pattern for each day as an observation of a function over time. This allows us to compare the differences between patterns on different days, without complications from the varying pollution levels throughout each day. We still need to deal with seasonal and trend components of the functional data. For example, since the higher levels of pollution around 8am and 5pm are most likely caused by people travelling to work, this raises the question of whether pollution levels are lower on weekends, when there are generally fewer people travelling to work. Are there different pollution patterns on different days of the week? Show code: plotting the data Aberdeen_WR %>% mutate(hr = hour(Time), wday = wday(Date, label = TRUE)) %>% group_by(wday, hr) %>% summarise(avg = mean(NO2)) %>% ggplot(mapping = aes(x = hr, y = avg)) + geom_line(mapping = aes(colour = wday), linewidth = 0.1) + labs(x = “”, y = “Hourly Nitrogen Dioxide Levels”) + scale_x_continuous(name = “Time of day”, limits = c(1, 24), breaks = seq(2, 24, by = 2), labels = c(“02:00”, “04:00”, “06:00”, “08:00”, “10:00”, “12:00”, “14:00”, “16:00”, “18:00”, “20:00”, “22:00”, “00:00”)) + scale_colour_brewer(palette = “Dark2″) + guides(colour = guide_legend(nrow = 1)) + theme_minimal() + theme(axis.text.y = element_text(margin = margin(l = 5)), axis.text.x = element_text(margin = margin(t = 5)), plot.margin = unit(c(0.2, 0.5, 0.2, 0.2), unit =”cm”), panel.grid.minor = element_blank(), legend.title = element_blank(), legend.position = “top”) Visualising the average daily patterns suggests that pollution patterns on weekdays are generally very similar, with Saturdays and Sundays being lower on average but still different from each other. To test this formally, a functional ANOVA test can be applied. This tests whether the average pollution pattern on a weekday is significantly different from the average pollution pattern on a Saturday or Sunday. The test returns the probability that, if the two pollution patterns were equal, the observed patterns would be this different. For this analysis, the probability was 0. Therefore, we can conclude there is a significant difference in the pollution levels on weekdays compared to Saturdays and Sundays. We can either analyse Saturdays, Sundays, and weekends separately, or fit a model to remove the seasonality and analyse the residuals instead. A simplified version of the regression model might look something like this: $$ y_{n} = I_{weekday, n} + I_{Saturday, n} + e_{n}$$ where $I_{weekday, n}$ is
Read More
Previous ArticleUkraine’s President Zelenskyy makes surprise visit to
Next Article Biden, Xi seek to ‘manage our differences’