You will need to follow a few simple steps to use data and functions in R if you are starting a new session. Here is a basic workflow of these steps.
Open a new or saved script that you will use to type code (.r file extension). This is under the File menu on the top for both the basic R install and RStudio.
Load R packages that you will use. The functions in a package will not be available unless the package is loaded. The package must also be previously installed (see below for package installation). You can put the packages you are using at the top of your script.
Set the working directory. This is where R will load and save files.
Load any workspace that you were using previously. This an R specific file format (.RData) that contains any and all objects that were in a previous workspace. Ideally, you will have saved a workspace from your previous session.
-Alternatively, use RStudio projects to organize data and your workspace.
Here is a sample script of this workflow.
# my startup script
# load any installed packages
library(SWMPr)
library(ggplot2)
# set the working directory
# change as needed
setwd('C:/my_files')
# load your previous workspace if you saved one
# this must be in the working directory, otherwise use full path
load(file = 'my_workspace.RData')
# check what you loaded
ls()
Other data formats can also be imported or loaded into R. Flat text files or comma-separated files are commonly used, although R can import many other types. Use the following functions to import text or csv data. Don’t forget to assign your data to an object.
# import data and assign
# data are in the working directory, or use a full path
# import a csv file
dat <- read.csv('my_data.csv', header = T)
# importa a text file, separated by commas
dat <- read.table('my_data.txt', header = T, sep = ',')
Make sure you save your work when you close a session! Save the script you’re working with using the File menu at the top. You will also want to save your data. Just as data of different types can be imported, data can also be saved in different formats. You will either want to save the whole workspace or individual parts (e.g., a data.frame as a .csv file).
# save the whole workspace as a .RData file
# will be saved in the working directory
save(list = ls(), file = 'my_workspace.RData')
# save one object (e.g., dat) as a .RData file
save(dat, file = 'my_data.RData')
# save as .csv
write.csv(dat, 'my_data.csv')
# save as text file
write.table(dat, 'my_data.txt', sep = ',', row.names = F, quote = F)
R installs and loads packages from its library on your computer. You can see where your library is with .libPaths()
. Packages that you install from CRAN or elsewhere will go in this library. R will also look here when you load a package to see if it is installed. Most packages will be downloaded from CRAN, although you can also download packages from Github or BioConductor. In the latter case, you will have to first download and load the devtools package off CRAN. Here are some basics for installing and loading a package.
# install a package from CRAN
install.packages('ggplot2')
# install packages from Github
install.packages('devtools')
library(devtools)
install_github('fawda123/SWMPr')
library(SWMPr)
You can view all the keyboard shortcuts in RStudio by clicking on Help, then keyboard shortcuts on the top menu. Here are some common shortcuts for Windows/Linux (* denotes RStudio only).
run selection * run selection clear console select all copy cut paste open document save document switch to source * switch to console * stop current execution skip to beginning of line skip to end of line
If all else fails, a Google search will usually point you in the right direction. All of the documentation that comes with R and its packages are available online. A quick search for a function or package will lead you to the documentation. You can also access the help files on your computer in R.
# access a help file for a function
help(mean)
# or do this
?mean
# run the examples in the help file
example(mean)
https://nceas.github.io/oss-lessons/data-munging-qa-qc-cleaning/data-munging-qa-qc-cleaning.html
See what’s in your working directory, import a dataset, view some data summaries.
# print your working directory
getwd()
## [1] "C:/proj/wq_data_analysis"
# what's in the working directory
dir()
## [1] "data" "README.md"
## [3] "wq_data_analysis.html" "wq_data_analysis.Rmd"
## [5] "wq_data_analysis.Rproj"
# what's in the data file
dir('data')
## [1] "pe_dat.csv" "pe_dat.RData" "sf_dat.csv" "sf_dat.RData"
## [5] "tb_dat.csv" "tb_dat.RData"
# import tampa bay data as csv, view summmaries
tb_dat <- read.csv('data/tb_dat.csv', stringsAsFactors = F)
dim(tb_dat)
## [1] 22060 7
names(tb_dat)
## [1] "seg" "stationid" "lat" "lon"
## [5] "chla_ugl" "salinity_ppt" "date"
class(tb_dat)
## [1] "data.frame"
str(tb_dat)
## 'data.frame': 22060 obs. of 7 variables:
## $ seg : chr "Hillsborough Bay" "Hillsborough Bay" "Hillsborough Bay" "Hillsborough Bay" ...
## $ stationid : int 55 52 44 80 7 70 71 6 73 8 ...
## $ lat : num 27.8 27.9 27.9 27.8 27.9 ...
## $ lon : num -82.4 -82.4 -82.5 -82.4 -82.5 ...
## $ chla_ugl : num 33 28 24 11 19 58 73 53 17 46 ...
## $ salinity_ppt: num 23.1 22.6 20.4 23.9 21 21.2 23.3 20.4 23.9 20.4 ...
## $ date : chr "1974-01-01" "1974-01-01" "1974-01-01" "1974-01-01" ...
summary(tb_dat)
## seg stationid lat lon
## Length:22060 Min. : 6.00 Min. :27.57 Min. :-82.78
## Class :character 1st Qu.:28.00 1st Qu.:27.72 1st Qu.:-82.62
## Mode :character Median :52.00 Median :27.85 Median :-82.57
## Mean :51.72 Mean :27.82 Mean :-82.57
## 3rd Qu.:71.00 3rd Qu.:27.92 3rd Qu.:-82.48
## Max. :96.00 Max. :27.99 Max. :-82.41
## chla_ugl salinity_ppt date
## Min. : 0.000 Min. : 4.10 Length:22060
## 1st Qu.: 3.873 1st Qu.:23.60 Class :character
## Median : 7.032 Median :26.90 Mode :character
## Mean : 10.100 Mean :26.65
## 3rd Qu.: 12.500 3rd Qu.:30.00
## Max. :333.400 Max. :48.65
head(tb_dat)
## seg stationid lat lon chla_ugl salinity_ppt
## 1 Hillsborough Bay 55 27.8493 -82.4314 33 23.1
## 2 Hillsborough Bay 52 27.8970 -82.4382 28 22.6
## 3 Hillsborough Bay 44 27.9237 -82.4807 24 20.4
## 4 Hillsborough Bay 80 27.8096 -82.4460 11 23.9
## 5 Hillsborough Bay 7 27.8589 -82.4686 19 21.0
## 6 Hillsborough Bay 70 27.9089 -82.4632 58 21.2
## date
## 1 1974-01-01
## 2 1974-01-01
## 3 1974-01-01
## 4 1974-01-01
## 5 1974-01-01
## 6 1974-01-01
tail(tb_dat)
## seg stationid lat lon chla_ugl salinity_ppt
## 22055 Lower Tampa Bay 91 27.6279 -82.6415 8.9 26.90
## 22056 Lower Tampa Bay 95 27.6112 -82.6947 11.9 28.99
## 22057 Lower Tampa Bay 93 27.5789 -82.7441 5.7 32.13
## 22058 Lower Tampa Bay 90 27.6260 -82.5915 10.8 27.68
## 22059 Lower Tampa Bay 96 27.6378 -82.6910 10.3 28.43
## 22060 Lower Tampa Bay 94 27.6100 -82.7832 6.0 31.77
## date
## 22055 2012-09-01
## 22056 2012-09-01
## 22057 2012-09-01
## 22058 2012-09-01
## 22059 2012-09-01
## 22060 2012-09-01
You can easily manipulate data using packages in the tidyverse. The tidyverse is a collection of packages for data import, manipulation, and plotting. See here for a cheatsheet on data manipulation.
# install only once
install.packages('tidyverse')
# load every time
library(tidyverse)
Some basic data manipulation - selecting/removing columns, filter by row, create new columns:
# select specific columns
tb_sub <- tb_dat %>%
select(stationid, chla_ugl, date)
head(tb_sub)
## stationid chla_ugl date
## 1 55 33 1974-01-01
## 2 52 28 1974-01-01
## 3 44 24 1974-01-01
## 4 80 11 1974-01-01
## 5 7 19 1974-01-01
## 6 70 58 1974-01-01
# select only station 55
tb_sub <- tb_sub %>%
filter(stationid == 55)
head(tb_sub)
## stationid chla_ugl date
## 1 55 33.0 1974-01-01
## 2 55 16.0 1975-01-01
## 3 55 15.0 1977-01-01
## 4 55 24.0 1978-01-01
## 5 55 37.2 1979-01-01
## 6 55 11.7 1980-01-01
# create a new column
tb_sub <- tb_sub %>%
mutate(
lnchla = log(chla_ugl)
)
head(tb_sub)
## stationid chla_ugl date lnchla
## 1 55 33.0 1974-01-01 3.496508
## 2 55 16.0 1975-01-01 2.772589
## 3 55 15.0 1977-01-01 2.708050
## 4 55 24.0 1978-01-01 3.178054
## 5 55 37.2 1979-01-01 3.616309
## 6 55 11.7 1980-01-01 2.459589
# convert a character column to a date and arrange
tb_sub <- tb_sub %>%
mutate(
date = as.Date(date, format = '%Y-%m-%d')
) %>%
arrange(date)
head(tb_sub)
## stationid chla_ugl date lnchla
## 1 55 33 1974-01-01 3.496508
## 2 55 52 1974-02-01 3.951244
## 3 55 6 1974-03-01 1.791759
## 4 55 12 1974-04-01 2.484907
## 5 55 17 1974-05-01 2.833213
## 6 55 16 1974-06-01 2.772589
Dates are easy to manipulate using the lubridate package.
# install.packages(lubridate)
library(lubridate)
# create a year column
tb_sub <- tb_sub %>%
mutate(
yr = year(date)
)
head(tb_sub)
## stationid chla_ugl date lnchla yr
## 1 55 33 1974-01-01 3.496508 1974
## 2 55 52 1974-02-01 3.951244 1974
## 3 55 6 1974-03-01 1.791759 1974
## 4 55 12 1974-04-01 2.484907 1974
## 5 55 17 1974-05-01 2.833213 1974
## 6 55 16 1974-06-01 2.772589 1974
Get annual averages:
tb_agg <- tb_sub %>%
group_by(yr) %>%
summarise(avechl = mean(lnchla))
head(tb_agg)
## # A tibble: 6 x 2
## yr avechl
## <dbl> <dbl>
## 1 1974 2.836989
## 2 1975 2.939806
## 3 1976 3.088073
## 4 1977 3.008182
## 5 1978 2.870005
## 6 1979 3.236837
Some basic plots:
plot(lnchla ~ date, tb_sub)
plot(lnchla ~ date, tb_sub, type = 'l')
boxplot(lnchla ~ yr, tb_sub)
barplot(tb_agg$avechl, names.arg = tb_agg$yr)
Or use ggplot:
ggplot(tb_sub, aes(x = lnchla)) +
geom_histogram() +
facet_wrap(~yr, ncol = 3)
Complete repository here, manuscript http://dx.doi.org/10.1007/s10666-015-9452-8
data(tb_dat)
str(tb_dat)
## 'data.frame': 22060 obs. of 7 variables:
## $ seg : Factor w/ 4 levels "Hillsborough Bay",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ stationid : chr "55" "52" "44" "80" ...
## $ lat : num 27.8 27.9 27.9 27.8 27.9 ...
## $ lon : num -82.4 -82.4 -82.5 -82.4 -82.5 ...
## $ chla_ugl : num 33 28 24 11 19 58 73 53 17 46 ...
## $ salinity_ppt: num 23.1 22.6 20.4 23.9 21 21.2 23.3 20.4 23.9 20.4 ...
## $ date : Date, format: "1974-01-01" "1974-01-01" ...
head(tb_dat)
## seg stationid lat lon chla_ugl salinity_ppt
## 1 Hillsborough Bay 55 27.8493 -82.4314 33 23.1
## 2 Hillsborough Bay 52 27.8970 -82.4382 28 22.6
## 3 Hillsborough Bay 44 27.9237 -82.4807 24 20.4
## 4 Hillsborough Bay 80 27.8096 -82.4460 11 23.9
## 5 Hillsborough Bay 7 27.8589 -82.4686 19 21.0
## 6 Hillsborough Bay 70 27.9089 -82.4632 58 21.2
## date
## 1 1974-01-01
## 2 1974-01-01
## 3 1974-01-01
## 4 1974-01-01
## 5 1974-01-01
## 6 1974-01-01
Complete repository here, manuscript http://dx.doi.org/10.1111/1752-1688.12489
data(pe_dat)
str(pe_dat)
## 'data.frame': 5235 obs. of 7 variables:
## $ date : Date, format: "1986-01-13" "1986-01-13" ...
## $ STATION: chr "TF1.3" "TF1.4" "TF1.6" "RET1.1" ...
## $ lnchla : num 0.0488 0.0953 0.9895 3.2059 3.1361 ...
## $ sal : num 0 0 0.623 10.688 12.894 ...
## $ lnQ : num 4.95 4.95 4.95 4.95 4.95 ...
## $ LAT : num 38.8 38.8 38.7 38.5 38.4 ...
## $ LONG : num -76.7 -76.7 -76.7 -76.7 -76.6 ...
head(pe_dat)
## date STATION lnchla sal lnQ LAT LONG
## 1 1986-01-13 TF1.3 0.04879016 0.0000000 4.945919 38.81092 -76.71227
## 2 1986-01-13 TF1.4 0.09531018 0.0000000 4.945919 38.77302 -76.70927
## 3 1986-01-13 TF1.6 0.98954119 0.6234074 4.945919 38.65845 -76.68382
## 4 1986-01-13 RET1.1 3.20592575 10.6879594 4.945919 38.49090 -76.66429
## 5 1986-01-13 LE1.1 3.13609510 12.8944000 4.945919 38.42535 -76.60176
## 6 1986-01-13 LE1.2 2.88917830 13.4699673 4.945919 38.37887 -76.51132
Complete repository here, manuscript in prep
data(sf_dat)
str(sf_dat)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5595 obs. of 13 variables:
## $ Site_Code: chr "C10" "C10" "C10" "C10" ...
## $ Location : chr "Delta" "Delta" "Delta" "Delta" ...
## $ Date : Date, format: "1976-01-20" "1976-02-17" ...
## $ Latitude : num 37.7 37.7 37.7 37.7 37.7 ...
## $ Longitude: num -121 -121 -121 -121 -121 ...
## $ nh : num 0.25 0.19 0.13 0.14 NaN 0.13 0.1 0.06 0.02 0.04 ...
## $ no23 : num 1.02 1.64 1.53 1.4 1.9 1.5 1.6 1.2 0.98 1.4 ...
## $ din : num 1.27 1.83 1.66 1.54 NaN 1.63 1.7 1.26 1 1.44 ...
## $ tss : num 43 65 26 67 106 104 141 149 171 136 ...
## $ chl : num 4.14 7.1 12.66 14.67 23.93 ...
## $ sio2 : num 15 16 15 15 20 15 19 14 15 15 ...
## $ tp : num 0.2 0.12 0.16 0.23 0.33 0.28 0.2 0.52 0.47 0.58 ...
## $ sal : num 0.5 0.6 0.5 0.5 0.6 0.5 0.6 0.6 0.6 0.5 ...
head(sf_dat)
## # A tibble: 6 x 13
## Site_Code Location Date Latitude Longitude nh no23 din tss
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 C10 Delta 1976-01-20 37.67575 -121.265 0.25 1.02 1.27 43
## 2 C10 Delta 1976-02-17 37.67575 -121.265 0.19 1.64 1.83 65
## 3 C10 Delta 1976-03-09 37.67575 -121.265 0.13 1.53 1.66 26
## 4 C10 Delta 1976-03-22 37.67575 -121.265 0.14 1.40 1.54 67
## 5 C10 Delta 1976-04-05 37.67575 -121.265 NaN 1.90 NaN 106
## 6 C10 Delta 1976-04-19 37.67575 -121.265 0.13 1.50 1.63 104
## # ... with 4 more variables: chl <dbl>, sio2 <dbl>, tp <dbl>, sal <dbl>
The tidal (estuarine) adaptation of WRTDS can be downloaded from CRAN: WRTDStidal. Additional information is available on the GitHub page: description, tutorial.
Excellent blog about time series modelling with GAMs: https://petolau.github.io/Analyzing-double-seasonal-time-series-with-GAM-in-R/
Beck and Murphy 2017 provides some context of GAMs vs WRTDS, also see Harding et al. 2016. Wood 2006 is comprehensive on the theory and Zuur 2012 is more applied.
Here is some example code we used in our 2017 paper, using the mgcv package:
# install.packages('mgcv')
library(mgcv)
# format the data to model
data(pe_dat)
pe_sub <- pe_dat %>%
filter(STATION %in% 'TF1.6') %>%
mutate(
doy = yday(date),
decyr = decimal_date(date),
mo = month(date)
)
# model using a tensor product smooth for interactions
# k is upper limits on knots to reduce overfit
mod <- gam(lnchla ~ te(decyr, doy, lnQ, bs = c('tp', 'cc', 'tp'), k = c(5, 10, 5)),
knots = list(doy = c(1, 366)),
data = pe_sub,
na.action = na.exclude
)
# predictions
pe_sub <- pe_sub %>%
mutate(
pred = predict(mod)
)
# plot
ggplot(pe_sub, aes(x = date)) +
geom_point(aes(y = lnchla)) +
geom_line(aes(y = pred))
Flow-normalization from the GAM (this can be improved):
flow_norm <- matrix(0, nrow = nrow(pe_sub), ncol = 2)
Q_pred <- matrix(0, nrow = 29, ncol = 12)
ct1 <- 1
for(i in 1:12) {
# identify all flows/salinities in this month
all_i_Qs <- with(pe_sub, lnQ[mo == i])
all_i_doy <- with(pe_sub, doy[mo == i])
all_i_dec <- with(pe_sub, decyr[mo == i])
Q_pred <- matrix(0, nrow = length(all_i_Qs), ncol = length(all_i_Qs))
for(q in 1:length(all_i_Qs)) {
newd_q <- data.frame(decyr = all_i_dec, lnQ = all_i_Qs[q], doy = all_i_doy)
Q_pred[, q] <- predict(mod, newd_q, se = FALSE)
}
# average the predictions for each flow to have a values for each year, in this month
ct2 <- ct1 + length(all_i_Qs) - 1
flow_norm[ct1:ct2, 2] <- rowMeans(Q_pred)
flow_norm[ct1:ct2, 1] <- all_i_dec
ct1 <- ct2 + 1
}
ord_fn <- order(flow_norm[, 1])
fn_out <- flow_norm[ord_fn, ] %>%
data.frame
names(fn_out) <- c('decyr', 'pred_nrm')
# add to original data object
pe_sub <- pe_sub %>%
left_join(fn_out, by = 'decyr')
# plot
ggplot(pe_sub, aes(x = date)) +
geom_point(aes(y = lnchla)) +
geom_line(aes(y = pred)) +
geom_line(aes(y = pred_nrm), col = 'red')