Intro to WQ data analysis

Basics of R and RStudio
Supplied datasets
Modelling
- WRTDS
- GAMs

Basics of R and RStudio

What to do when you start a new session

You will need to follow a few simple steps to use data and functions in R if you are starting a new session. Here is a basic workflow of these steps.

Open a new or saved script that you will use to type code (.r file extension). This is under the File menu on the top for both the basic R install and RStudio.
Load R packages that you will use. The functions in a package will not be available unless the package is loaded. The package must also be previously installed (see below for package installation). You can put the packages you are using at the top of your script.
Set the working directory. This is where R will load and save files.
Load any workspace that you were using previously. This an R specific file format (.RData) that contains any and all objects that were in a previous workspace. Ideally, you will have saved a workspace from your previous session.

-Alternatively, use RStudio projects to organize data and your workspace.

Here is a sample script of this workflow.

# my startup script

# load any installed packages
library(SWMPr)
library(ggplot2)

# set the working directory
# change as needed
setwd('C:/my_files')

# load your previous workspace if you saved one
# this must be in the working directory, otherwise use full path
load(file = 'my_workspace.RData')

# check what you loaded
ls()

Other data formats can also be imported or loaded into R. Flat text files or comma-separated files are commonly used, although R can import many other types. Use the following functions to import text or csv data. Don’t forget to assign your data to an object.

# import data and assign
# data are in the working directory, or use a full path

# import a csv file
dat <- read.csv('my_data.csv', header = T)

# importa a text file, separated by commas
dat <- read.table('my_data.txt', header = T, sep = ',')

What to do when you close a session

Make sure you save your work when you close a session! Save the script you’re working with using the File menu at the top. You will also want to save your data. Just as data of different types can be imported, data can also be saved in different formats. You will either want to save the whole workspace or individual parts (e.g., a data.frame as a .csv file).

# save the whole workspace as a .RData file
# will be saved in the working directory
save(list = ls(), file = 'my_workspace.RData')

# save one object (e.g., dat) as a .RData file
save(dat, file = 'my_data.RData')

# save as .csv
write.csv(dat, 'my_data.csv')

# save as text file
write.table(dat, 'my_data.txt', sep = ',', row.names = F, quote = F)

Installing packages

R installs and loads packages from its library on your computer. You can see where your library is with .libPaths(). Packages that you install from CRAN or elsewhere will go in this library. R will also look here when you load a package to see if it is installed. Most packages will be downloaded from CRAN, although you can also download packages from Github or BioConductor. In the latter case, you will have to first download and load the devtools package off CRAN. Here are some basics for installing and loading a package.

# install a package from CRAN
install.packages('ggplot2')

# install packages from Github
install.packages('devtools')
library(devtools)
install_github('fawda123/SWMPr')
library(SWMPr)

Keyboard shortcuts

You can view all the keyboard shortcuts in RStudio by clicking on Help, then keyboard shortcuts on the top menu. Here are some common shortcuts for Windows/Linux (* denotes RStudio only).

run selection * run selection clear console select all copy cut paste open document save document switch to source * switch to console * stop current execution skip to beginning of line skip to end of line

Getting help

If all else fails, a Google search will usually point you in the right direction. All of the documentation that comes with R and its packages are available online. A quick search for a function or package will lead you to the documentation. You can also access the help files on your computer in R.

# access a help file for a function
help(mean)

# or do this
?mean

# run the examples in the help file
example(mean)

Example dataset: import, wrangling, and plotting

https://nceas.github.io/oss-lessons/data-munging-qa-qc-cleaning/data-munging-qa-qc-cleaning.html

See what’s in your working directory, import a dataset, view some data summaries.

# print your working directory
getwd()

## [1] "C:/proj/wq_data_analysis"

# what's in the working directory
dir()

## [1] "data"                   "README.md"             
## [3] "wq_data_analysis.html"  "wq_data_analysis.Rmd"  
## [5] "wq_data_analysis.Rproj"

# what's in the data file
dir('data')

## [1] "pe_dat.csv"   "pe_dat.RData" "sf_dat.csv"   "sf_dat.RData"
## [5] "tb_dat.csv"   "tb_dat.RData"

# import tampa bay data as csv, view summmaries
tb_dat <- read.csv('data/tb_dat.csv', stringsAsFactors = F)
dim(tb_dat)

## [1] 22060     7

names(tb_dat)

## [1] "seg"          "stationid"    "lat"          "lon"         
## [5] "chla_ugl"     "salinity_ppt" "date"

class(tb_dat)

## [1] "data.frame"

str(tb_dat)

## 'data.frame':    22060 obs. of  7 variables:
##  $ seg         : chr  "Hillsborough Bay" "Hillsborough Bay" "Hillsborough Bay" "Hillsborough Bay" ...
##  $ stationid   : int  55 52 44 80 7 70 71 6 73 8 ...
##  $ lat         : num  27.8 27.9 27.9 27.8 27.9 ...
##  $ lon         : num  -82.4 -82.4 -82.5 -82.4 -82.5 ...
##  $ chla_ugl    : num  33 28 24 11 19 58 73 53 17 46 ...
##  $ salinity_ppt: num  23.1 22.6 20.4 23.9 21 21.2 23.3 20.4 23.9 20.4 ...
##  $ date        : chr  "1974-01-01" "1974-01-01" "1974-01-01" "1974-01-01" ...

summary(tb_dat)

##      seg              stationid          lat             lon        
##  Length:22060       Min.   : 6.00   Min.   :27.57   Min.   :-82.78  
##  Class :character   1st Qu.:28.00   1st Qu.:27.72   1st Qu.:-82.62  
##  Mode  :character   Median :52.00   Median :27.85   Median :-82.57  
##                     Mean   :51.72   Mean   :27.82   Mean   :-82.57  
##                     3rd Qu.:71.00   3rd Qu.:27.92   3rd Qu.:-82.48  
##                     Max.   :96.00   Max.   :27.99   Max.   :-82.41  
##     chla_ugl        salinity_ppt       date          
##  Min.   :  0.000   Min.   : 4.10   Length:22060      
##  1st Qu.:  3.873   1st Qu.:23.60   Class :character  
##  Median :  7.032   Median :26.90   Mode  :character  
##  Mean   : 10.100   Mean   :26.65                     
##  3rd Qu.: 12.500   3rd Qu.:30.00                     
##  Max.   :333.400   Max.   :48.65

head(tb_dat)

##                seg stationid     lat      lon chla_ugl salinity_ppt
## 1 Hillsborough Bay        55 27.8493 -82.4314       33         23.1
## 2 Hillsborough Bay        52 27.8970 -82.4382       28         22.6
## 3 Hillsborough Bay        44 27.9237 -82.4807       24         20.4
## 4 Hillsborough Bay        80 27.8096 -82.4460       11         23.9
## 5 Hillsborough Bay         7 27.8589 -82.4686       19         21.0
## 6 Hillsborough Bay        70 27.9089 -82.4632       58         21.2
##         date
## 1 1974-01-01
## 2 1974-01-01
## 3 1974-01-01
## 4 1974-01-01
## 5 1974-01-01
## 6 1974-01-01

tail(tb_dat)

##                   seg stationid     lat      lon chla_ugl salinity_ppt
## 22055 Lower Tampa Bay        91 27.6279 -82.6415      8.9        26.90
## 22056 Lower Tampa Bay        95 27.6112 -82.6947     11.9        28.99
## 22057 Lower Tampa Bay        93 27.5789 -82.7441      5.7        32.13
## 22058 Lower Tampa Bay        90 27.6260 -82.5915     10.8        27.68
## 22059 Lower Tampa Bay        96 27.6378 -82.6910     10.3        28.43
## 22060 Lower Tampa Bay        94 27.6100 -82.7832      6.0        31.77
##             date
## 22055 2012-09-01
## 22056 2012-09-01
## 22057 2012-09-01
## 22058 2012-09-01
## 22059 2012-09-01
## 22060 2012-09-01

You can easily manipulate data using packages in the tidyverse. The tidyverse is a collection of packages for data import, manipulation, and plotting. See here for a cheatsheet on data manipulation.

# install only once
install.packages('tidyverse')

# load every time
library(tidyverse)

Some basic data manipulation - selecting/removing columns, filter by row, create new columns:

# select specific columns
tb_sub <- tb_dat %>% 
  select(stationid, chla_ugl, date)
head(tb_sub)

##   stationid chla_ugl       date
## 1        55       33 1974-01-01
## 2        52       28 1974-01-01
## 3        44       24 1974-01-01
## 4        80       11 1974-01-01
## 5         7       19 1974-01-01
## 6        70       58 1974-01-01

# select only station 55
tb_sub <- tb_sub %>% 
  filter(stationid == 55)
head(tb_sub)

##   stationid chla_ugl       date
## 1        55     33.0 1974-01-01
## 2        55     16.0 1975-01-01
## 3        55     15.0 1977-01-01
## 4        55     24.0 1978-01-01
## 5        55     37.2 1979-01-01
## 6        55     11.7 1980-01-01

# create a new column
tb_sub <- tb_sub %>% 
  mutate(
    lnchla = log(chla_ugl)
  )
head(tb_sub)

##   stationid chla_ugl       date   lnchla
## 1        55     33.0 1974-01-01 3.496508
## 2        55     16.0 1975-01-01 2.772589
## 3        55     15.0 1977-01-01 2.708050
## 4        55     24.0 1978-01-01 3.178054
## 5        55     37.2 1979-01-01 3.616309
## 6        55     11.7 1980-01-01 2.459589

# convert a character column to a date and arrange
tb_sub <- tb_sub %>% 
  mutate(
    date = as.Date(date, format = '%Y-%m-%d')
  ) %>% 
  arrange(date)
head(tb_sub)

##   stationid chla_ugl       date   lnchla
## 1        55       33 1974-01-01 3.496508
## 2        55       52 1974-02-01 3.951244
## 3        55        6 1974-03-01 1.791759
## 4        55       12 1974-04-01 2.484907
## 5        55       17 1974-05-01 2.833213
## 6        55       16 1974-06-01 2.772589

Dates are easy to manipulate using the lubridate package.

# install.packages(lubridate)
library(lubridate)

# create a year column
tb_sub <- tb_sub %>%
  mutate(
    yr = year(date)
  )
head(tb_sub)

##   stationid chla_ugl       date   lnchla   yr
## 1        55       33 1974-01-01 3.496508 1974
## 2        55       52 1974-02-01 3.951244 1974
## 3        55        6 1974-03-01 1.791759 1974
## 4        55       12 1974-04-01 2.484907 1974
## 5        55       17 1974-05-01 2.833213 1974
## 6        55       16 1974-06-01 2.772589 1974

Get annual averages:

tb_agg <- tb_sub %>% 
  group_by(yr) %>% 
  summarise(avechl = mean(lnchla))
head(tb_agg)

## # A tibble: 6 x 2
##      yr   avechl
##   <dbl>    <dbl>
## 1  1974 2.836989
## 2  1975 2.939806
## 3  1976 3.088073
## 4  1977 3.008182
## 5  1978 2.870005
## 6  1979 3.236837

Some basic plots:

plot(lnchla ~ date, tb_sub)

plot(lnchla ~ date, tb_sub, type = 'l')

boxplot(lnchla ~ yr, tb_sub)

barplot(tb_agg$avechl, names.arg = tb_agg$yr)

Or use ggplot:

ggplot(tb_sub, aes(x = lnchla)) + 
    geom_histogram() + 
    facet_wrap(~yr, ncol = 3)

Supplied datasets

Tampa Bay data

Complete repository here, manuscript http://dx.doi.org/10.1007/s10666-015-9452-8

data(tb_dat)
str(tb_dat)

## 'data.frame':    22060 obs. of  7 variables:
##  $ seg         : Factor w/ 4 levels "Hillsborough Bay",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ stationid   : chr  "55" "52" "44" "80" ...
##  $ lat         : num  27.8 27.9 27.9 27.8 27.9 ...
##  $ lon         : num  -82.4 -82.4 -82.5 -82.4 -82.5 ...
##  $ chla_ugl    : num  33 28 24 11 19 58 73 53 17 46 ...
##  $ salinity_ppt: num  23.1 22.6 20.4 23.9 21 21.2 23.3 20.4 23.9 20.4 ...
##  $ date        : Date, format: "1974-01-01" "1974-01-01" ...

head(tb_dat)

##                seg stationid     lat      lon chla_ugl salinity_ppt
## 1 Hillsborough Bay        55 27.8493 -82.4314       33         23.1
## 2 Hillsborough Bay        52 27.8970 -82.4382       28         22.6
## 3 Hillsborough Bay        44 27.9237 -82.4807       24         20.4
## 4 Hillsborough Bay        80 27.8096 -82.4460       11         23.9
## 5 Hillsborough Bay         7 27.8589 -82.4686       19         21.0
## 6 Hillsborough Bay        70 27.9089 -82.4632       58         21.2
##         date
## 1 1974-01-01
## 2 1974-01-01
## 3 1974-01-01
## 4 1974-01-01
## 5 1974-01-01
## 6 1974-01-01

Patuxent Estuary data

Complete repository here, manuscript http://dx.doi.org/10.1111/1752-1688.12489

data(pe_dat)
str(pe_dat)

## 'data.frame':    5235 obs. of  7 variables:
##  $ date   : Date, format: "1986-01-13" "1986-01-13" ...
##  $ STATION: chr  "TF1.3" "TF1.4" "TF1.6" "RET1.1" ...
##  $ lnchla : num  0.0488 0.0953 0.9895 3.2059 3.1361 ...
##  $ sal    : num  0 0 0.623 10.688 12.894 ...
##  $ lnQ    : num  4.95 4.95 4.95 4.95 4.95 ...
##  $ LAT    : num  38.8 38.8 38.7 38.5 38.4 ...
##  $ LONG   : num  -76.7 -76.7 -76.7 -76.7 -76.6 ...

head(pe_dat)

##         date STATION     lnchla        sal      lnQ      LAT      LONG
## 1 1986-01-13   TF1.3 0.04879016  0.0000000 4.945919 38.81092 -76.71227
## 2 1986-01-13   TF1.4 0.09531018  0.0000000 4.945919 38.77302 -76.70927
## 3 1986-01-13   TF1.6 0.98954119  0.6234074 4.945919 38.65845 -76.68382
## 4 1986-01-13  RET1.1 3.20592575 10.6879594 4.945919 38.49090 -76.66429
## 5 1986-01-13   LE1.1 3.13609510 12.8944000 4.945919 38.42535 -76.60176
## 6 1986-01-13   LE1.2 2.88917830 13.4699673 4.945919 38.37887 -76.51132

San Francisco Delta data

Complete repository here, manuscript in prep

data(sf_dat)
str(sf_dat)

## Classes 'tbl_df', 'tbl' and 'data.frame':    5595 obs. of  13 variables:
##  $ Site_Code: chr  "C10" "C10" "C10" "C10" ...
##  $ Location : chr  "Delta" "Delta" "Delta" "Delta" ...
##  $ Date     : Date, format: "1976-01-20" "1976-02-17" ...
##  $ Latitude : num  37.7 37.7 37.7 37.7 37.7 ...
##  $ Longitude: num  -121 -121 -121 -121 -121 ...
##  $ nh       : num  0.25 0.19 0.13 0.14 NaN 0.13 0.1 0.06 0.02 0.04 ...
##  $ no23     : num  1.02 1.64 1.53 1.4 1.9 1.5 1.6 1.2 0.98 1.4 ...
##  $ din      : num  1.27 1.83 1.66 1.54 NaN 1.63 1.7 1.26 1 1.44 ...
##  $ tss      : num  43 65 26 67 106 104 141 149 171 136 ...
##  $ chl      : num  4.14 7.1 12.66 14.67 23.93 ...
##  $ sio2     : num  15 16 15 15 20 15 19 14 15 15 ...
##  $ tp       : num  0.2 0.12 0.16 0.23 0.33 0.28 0.2 0.52 0.47 0.58 ...
##  $ sal      : num  0.5 0.6 0.5 0.5 0.6 0.5 0.6 0.6 0.6 0.5 ...

head(sf_dat)

## # A tibble: 6 x 13
##   Site_Code Location       Date Latitude Longitude    nh  no23   din   tss
##       <chr>    <chr>     <date>    <dbl>     <dbl> <dbl> <dbl> <dbl> <dbl>
## 1       C10    Delta 1976-01-20 37.67575  -121.265  0.25  1.02  1.27    43
## 2       C10    Delta 1976-02-17 37.67575  -121.265  0.19  1.64  1.83    65
## 3       C10    Delta 1976-03-09 37.67575  -121.265  0.13  1.53  1.66    26
## 4       C10    Delta 1976-03-22 37.67575  -121.265  0.14  1.40  1.54    67
## 5       C10    Delta 1976-04-05 37.67575  -121.265   NaN  1.90   NaN   106
## 6       C10    Delta 1976-04-19 37.67575  -121.265  0.13  1.50  1.63   104
## # ... with 4 more variables: chl <dbl>, sio2 <dbl>, tp <dbl>, sal <dbl>

Modelling

WRTDS

The tidal (estuarine) adaptation of WRTDS can be downloaded from CRAN: WRTDStidal. Additional information is available on the GitHub page: description, tutorial.

GAMs

Excellent blog about time series modelling with GAMs: https://petolau.github.io/Analyzing-double-seasonal-time-series-with-GAM-in-R/

Beck and Murphy 2017 provides some context of GAMs vs WRTDS, also see Harding et al. 2016. Wood 2006 is comprehensive on the theory and Zuur 2012 is more applied.

Here is some example code we used in our 2017 paper, using the mgcv package:

# install.packages('mgcv')
library(mgcv)  

# format the data to model
data(pe_dat)
pe_sub <- pe_dat %>% 
  filter(STATION %in% 'TF1.6') %>% 
  mutate(
    doy = yday(date), 
    decyr = decimal_date(date), 
    mo = month(date)
  )

# model using a tensor product smooth for interactions
# k is upper limits on knots to reduce overfit
mod <- gam(lnchla ~ te(decyr, doy, lnQ, bs = c('tp', 'cc', 'tp'), k = c(5, 10, 5)), 
  knots = list(doy = c(1, 366)), 
  data = pe_sub, 
  na.action = na.exclude
  )

# predictions
pe_sub <- pe_sub %>% 
  mutate(
    pred = predict(mod)
  )
  
# plot
ggplot(pe_sub, aes(x = date)) + 
  geom_point(aes(y = lnchla)) + 
  geom_line(aes(y = pred))

Flow-normalization from the GAM (this can be improved):

flow_norm <- matrix(0, nrow = nrow(pe_sub), ncol = 2)

Q_pred <- matrix(0, nrow = 29, ncol = 12)
ct1 <- 1

for(i in 1:12) {
  
  # identify all flows/salinities in this month
  all_i_Qs <- with(pe_sub, lnQ[mo == i])
  all_i_doy <- with(pe_sub, doy[mo == i])
  all_i_dec <- with(pe_sub, decyr[mo == i])
  
  Q_pred <- matrix(0, nrow = length(all_i_Qs), ncol = length(all_i_Qs))
  
  for(q in 1:length(all_i_Qs)) {
    newd_q <- data.frame(decyr = all_i_dec, lnQ = all_i_Qs[q], doy = all_i_doy)
    Q_pred[, q] <- predict(mod, newd_q, se = FALSE)
  }
  
  # average the predictions for each flow to have a values for each year, in this month
  ct2 <- ct1 + length(all_i_Qs) - 1
  flow_norm[ct1:ct2, 2] <- rowMeans(Q_pred)
  flow_norm[ct1:ct2, 1] <- all_i_dec
  ct1 <- ct2 + 1
  
}

ord_fn <- order(flow_norm[, 1])
fn_out <- flow_norm[ord_fn, ] %>% 
  data.frame
names(fn_out) <- c('decyr', 'pred_nrm')

# add to original data object
pe_sub <- pe_sub %>% 
  left_join(fn_out, by = 'decyr') 

# plot
ggplot(pe_sub, aes(x = date)) + 
  geom_point(aes(y = lnchla)) + 
  geom_line(aes(y = pred)) + 
  geom_line(aes(y = pred_nrm), col = 'red')