Widget HTML Atas

R Programming Tutorial Pdf Download

Introductory tutorial to programming in R, split in 2 parts: the basics on part1 (Online sources of information about R; Packages, Documentation and Help; Basics and syntax of R; Main R data structures: Vectors, Matrices, Data frames, Lists and Factors; Brief intro to R control-flow via Loops and Conditionals; Brief description of function declaration) and summary statistics and graphics in part 2.

Figures - uploaded by Isabel Duarte

Author content

All figure content in this area was uploaded by Isabel Duarte

Content may be subject to copyright.

ResearchGate Logo

Discover the world's research

  • 20+ million members
  • 135+ million publications
  • 700k+ research projects

Join for free

R for Absolute Beginners - Part 1/2

Syntax and Data Structures in R

Authors: Isabel Duarte & Ramiro Magno | Collaborators: Bruno Louro & Rui Machado

4 June 2018

Contents

Introduction................................................. 2

Online sources and other useful Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Basics .................................................... 3

General notes (about R and RStudio) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Start/QuitRStudio.......................................... 4

Packagerepositories ......................................... 4

Installing packages and Getting help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Workingenvironment ........................................ 5

Hands-ontutorial.............................................. 5

1. Create an RStudio project (30 min) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.Operators(60min) ........................................... 7

2.1Assignmentoperators ...................................... 7

2.2Comparisonoperators ...................................... 8

2.3Logicaloperators......................................... 8

2.4Arithmeticoperators....................................... 9

3.Datastructures(120min) ....................................... 9

3.1Vectors .............................................. 9

Creatingvectors ....................................... 9

Vectorizedarithmetics .................................... 10

Subsetting/Indexingvectors................................. 10

Namingindexesofavector ................................. 11

Excludingelements...................................... 11

3.2Matrices.............................................. 11

Subsetting/Indexing matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3Dataframes............................................ 12

Subsetting/Indexing Data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4Lists................................................ 13

Subsetting/Indexinglists .................................. 13

3.5Datastructureconversion .................................... 13

4. Loops and Conditionals in R (60 min) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1 for() and while() loops .................................... 14

4.2 Conditionals: if() statements ................................. 14

4.3 Conditionals: ifelse() statements............................... 15

5.Functions(60min) ........................................... 15

6. Loading data and Saving files (30 min) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

7. Some great R functions to "play" with (60 min) . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

7.1 Using the iris buil-indataset ................................. 16

7.2 Using the esoph buil-indataset................................. 17

1

Introduction

This mini hands-on tutorial serves as an introduction to R, covering the following topics:

Online sources of information about R;

Packages, Documentation and Help;

Basics and syntax of R;

Main R data structures: Vectors, Matrices, Data frames, Lists and Factors;

Brief intro to R control-flow via Loops and Conditionals;

Brief description of function declaration;

Listing of some of the most commonly used built-in R functions.

This document will guide you through the initial steps toward using R.

RStudio

will be used as the

development platform for this workshop since it integrates many functionalities that facilitate the learning

process, and it is a free software, available for Linux, Mac and Windows. You can download it directly from:

https://www.rstudio.com/products/rstudio/download/

This protocol is divided into

7 parts

, each one identified by a

Title

,

Maximum execution time

(in

parenthesis), a brief

Task description

and the

R commands

to be executed. These will always be inside

grey text boxes, with the font colored according to the R syntax highlighting.

Now, just Keep Calm. . . and Good Work!

Online sources and other useful Bibliography

Links

R Project (The developers of R)

Quick-R (Roadmap and R code to quickly use R)

Cookbook for R (R code "recipes")

Bioconductor workflows (R code for pipelines of genomic analyses)

Advanced R (If you want to learn R from a programmers point of view)

Books

Introductory Statistics with R (Springer, Dalgaard, 2008)

A first course in statistical programming with R (CUP, Braun and Murdoch, 2016)

Computational Genome Analysis: An Introduction (Springer, Deonier, Tavaré and Waterman, 2005)

R programming for Bioinformatics (CRC Press, Gentleman, 2008)

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data (O'Reilly, Wickham and

Grolemund, 2017) (for advanced users)

2

Basics

General notes (about R and RStudio)

1. R is case sensitive - be aware of capital letters (bis different from B).

2.

All R code lines starting with the

#

(hash) sign are interpreted as comments, and therefore not

evaluated.

# This is a comment

# 3 + 4 # this code is not evaluated, so and it does not print any result

2+ 3# the code before the hash sign is evaluated, so it prints the result (value 5)

[1] 5

3.

Expressions in R are evaluated from the innermost parenthesis toward the outermost one (following

proper mathematical rules).

# Example with parenthesis:

((2 + 2 )/ 2 )- 2

[1] 0

# Without parenthesis:

2+2/ 2 -2

[1] 1

4.

Spaces matter in variable names — use a dot or underscore to create longer names to make the

variables more descriptive, e.g. my.variable_name.

5.

Spaces between variables and operators do not matter:

3+2

is the same as

3+2

, and

function (arg1

, arg2) is the same as function(arg1,arg2).

6.

If you want to write 2 expressions/commands in the same line, you have to separate them by a

;

(semi-colon)

#Example:

3+ 2; 5+ 1

[1] 5

[1] 6

7.

More recent versions of RStudio

auto-complete

your commands by showing you possible alternatives

as soon as you type 3 consecutive characters, however, if you want to see the options for less than 3

chars, just press tab to display available options.

Tip: Use auto-complete as much as possible

to avoid typing mistakes.

8.

There are 4 main vector

data types

:

Logical

(TRUE or FALSE);

Numeric

(eg. 1,2,3. . . );

Character (eg. "u", "alg", "arve") and Complex (eg. 3+2i)

9.

Vectors are ordered sets of elements. In R vectors are

1-based

, i.e. the first index position is number 1

(as opposed to other programming languages whose indexes start at zero).

10.

R objects can be divided in two main groups:

Functions

and

Data-related objects

. Functions

receive arguments inside circular brackets

( )

and objects receive arguments inside square brackets

[ ]

:

3

function (arguments)

data.object [arguments]

Start/Quit RStudio

RStudio can be opened by double-clicking its icon.

The R environment is controlled by hidden files (files that start with a

.

) in the startup directory:

.RData

,

.Rhistory and .Rprofile (optional).

.RData

is a file containing all the objects, data, and functions created during a work-session. This file

can then be loaded for future work without requiring the re-computation of the analysis. (Note: it can

potentially be a very large file);

.Rhistory saves all commands that have been typed during the R session;

.Rprofile useful for advanced users to customize RStudio behaviour.

It is always good practice to rename these files:

# DO NOT RUN

save.image (file="myProjectName.RData")

savehistory (file="myProjectName.Rhistory")

To quit R (close it), use the

q ()

function, and you will be asked if you want to save the workspace image

(i.e. the .RData file):

q()

Save workspace image to ~/path/to/your/working/directory/.RData? [y/n/c]:

By typing y(yes), then the entire R workspace will be written to the .RData file (which can be very large).

Often it is sufficient to just save an analysis script (i.e. a reproducible protocol) in an R source file. This way,

one can quickly regenerate all data sets and objects for future analysis. The .RData file is particularly useful

to save the results from analyses that require a long time to compute.

Package repositories

In R, the fundamental unit of shareable code is the

package

. A package bundles together code, data,

documentation, and tests, and is easy to share with others. These packages are stored online from which

they can be easily retrieved and installed on your computer (R packages by Hadley Wickham). There are 2

main R repositories:

The Comprehensive R Archive Network - CRAN (nearly 8500 packages)

Bioconductor (>1560 packages in June 2018) (bioscience data analysis)

This huge variety of packages is one of the reasons why R is so successful: the chances are that someone has

already solved a problem that you're working on, and you can benefit from their work by downloading their

package for free.

In this course, we will not use any packages. However, if you continue to use R for bioinformatics analysis

you will need to install Bioconductor. So for future reference, here is the code to install Bioconductor, and to

set the repositories that you want to use when searching and installing packages:

##### THIS PROCESS MIGHT TAKE A VERY LONG TIME #####

# To Install Bioconductor, run the following code

## try http:// if https:// URLs are not supported

4

source("https://bioconductor.org/biocLite.R")

biocLite()

# To set the other relevant repositories:

setRepositories()

# then follow the instructions and input the numbers corresponding to the requested repositories

# (if you want to cover most packages, just use all listed repositories: 1 2 3 4 5 6 7 8 9)

Installing packages and Getting help

R has many built-in ways of providing help regarding its functions and packages:

install.packages ("ggplot2" ) # install the package called ggplot2

library ("ggplot2" ) # load the library ggplot2

help (package= ggplot2) # help(package="package_name") to get help about a specific package

vignette ("ggplot2" ) # show a pdf with the package manual (called R vignettes)

?qplot # ?function to get quick info about the function of interest

Working environment

Your working environment is the place where the variables, functions, and data that you create are stored.

More advanced users can create more than one environment.

ls() # list all objects in your environment

dir() # list all files in your working directory

getwd() # find out the path to your working directory

setwd("/home/isabel" ) # example of setting a new working directory path

Hands-on tutorial

1. Create an RStudio project (30 min)

To start we will open RStudio. This is an Integrated Development Environment -

IDE

- that includes

syntax-highlighting

text editor

(1), an

R console

to execute code (2), as well as

workspace

and

history

management (3), and tools for

plotting

and

exporting

images,

browsing

the workspace, managing

packages and viewing html/pdf files created within RStudio (4).

Projects are a great functionality, easing the transition between dataset analysis, and allowing a fast navigation

to your analysis/working directory. To create a new project:

File > New Project... > New Directory > New Project

Directory name: r-absoluteBeginners

Create project as a subdirectory of: ~/

Browse... (directory/folder to save the workshop data)

Create Project

Projects should be personalized by clicking on the menu in the right upper corner. The general options -

R

General

- are the most important to customize, since they allow the definition of the RStudio "behavior"

when the project is opened. The following suggestions are particularly useful:

5

Figure 1: Figure 1: RStudio Graphical User Interface (GUI)

6

Figure 2: Figure 2: Customize Project

Restore .RData at startup - Yes (for analyses with +1GB of data, you should choose "No")

Save .RData on exit - Ask

Always save history - Yes

2. Operators (60 min)

Important NOTE: Please create a new R Script file to save all the code you use for today's

tutorial and save it in your current working directory. Name it: r4ab_day1.R

2.1 Assignment operators

Values are assigned to named variables with an

<-

(arrow) or an

=

(equal) sign. In most cases they are

interchangeable, however it is good practice to use the arrow since it is explicit about the direction of the

assignment. If the equal sign is used, the assignment occurs from left to right.

7

x<-7# assign the number 7 to a variable named x

x# R will print the value associated with variable x

y<-9# assign the number 9 to the variable y

z=3# assign the value 3 to the variable z

42 -> lue # assign the value 42 to the variable named lue

x -> xx # assign the value of x (which is the number 7) to the variable named xx

xx

my_variable = 5 # assign the number 5 to the variable named my_variable

2.2 Comparison operators

Allow the direct comparison between values:

Symbol Description

== exactly the same (equal)

!= different (not equal)

<smaller than

>greater than

<= smaller or equal

>= greater or equal

1== 1# TRUE

1!= 1# FALSE

x> 3# TRUE (x is 7)

y<= 9# TRUE (y is 9)

my_variable < z # FALSE (z is 3 and my_variable is 5)

2.3 Logical operators

Compare logical (TRUE FALSE) values:

Symbol Description

&AND

|OR

!NOT

QUESTION: Are these TRUE, or FALSE?

x<y&x> 10 # AND means that both expressions have to be true

x<y|x> 10 # OR means that only one expression must be true

!(x != y & my_variable <= y) # yet another AND example

8

2.4 Arithmetic operators

R makes calculations using the following arithmetic operators:

Symbol Description

+summation

-subtraction

*multiplication

/division

ˆpowering

3/ y ## 0.3333333

x* 2 ## 14

3- 4## -1

my_variable + 2 ## 7

2^ z ## 8

3. Data structures (120 min)

3.1 Vectors

The basic data structure in R is the

vector

, which requires all of its elements to be of the same type (e.g. all

numeric; all character (text); all logical (TRUE FALSE)).

Creating vectors

Function Description

ccombine

:integer sequence

seq general sequence

rep repetitive patterns

x <- c(1,2,3,4,5,6)

x

[1]123456

class (x) # this function outputs the class of the object

[1] "numeric"

y<-10

class (y)

[1] "numeric"

9

z<-"a string"

class (z)

[1] "character"

# The results are shown in the comments next to each line

seq (1,6 ) ##123456

seq (from=100 , by=1 , length=5 ) ## 100 101 102 103 104

1: 6##123456

10: 1##10987654321

rep (1 : 2,3) ##121212

Vectorized arithmetics

Most arithmetic operations in the R language are

vectorized

, i.e. the operation is applied

element-wise

.

When one operand is shorter than the other, the shortest one is

recycled

, i.e. the values from the shorter

vector are re-used in order to have the same length as the longer vector.

Please note that when one of the vectors is recycled, a

warning

is printed in the R Console. This warning is

not an error, i.e. the operation has been completed despite the warning message.

1: 3+ 10:12

[1] 11 13 15

# Notice the warning: this is recycling (the shorter vector "restarts" the "cycling")

1: 5+ 10:12

Warning in 1:5 + 10:12: longer object length is not a multiple of shorter

object length

[1] 11 13 15 14 16

x+ y# Remember that x = c(1 2 3 4 5 6) and y = 10

[1] 11 12 13 14 15 16

c(70,80 ) + x

[1] 71 82 73 84 75 86

Subsetting/Indexing vectors

Subsetting is one of the most powerfull features of R. It is the extraction of one or more elements, which

are of interest, from vectors, allowing for example the filtering of data, the re-ordering of tables, removal of

unwanted datapoints, etc. There are several ways of subsetting data.

Note: Please remember that indices in R are 1-based (see introduction).

# Subsetting by indices

myVec <- 1 : 26 ; myVec

myVec [1 ] # prints the first value of myVec

myVec [6 : 9 ] # prints the 6th, 7th, 8th and 9th values of myVec

# LETTERS is a built-in vector with the 26 letters of the alphabet

myLOL <- LETTERS # assign the 26 letters to the vector named myLOL

myLOL[c(3,3,13,1,18 )] # print the requested positions of vector myLOL

10

#Subsetting by same length logical vectors

myLogical <- myVec > 10 ; myLogical

# returns only the values in positions corresponding to TRUE in the logical vector

myVec [myLogical]

Naming indexes of a vector

Referring to an index by name rather than by position can make code more readable and flexible. Use the

function names to attribute names to each position of the vector.

joe <- c(24 , 1.70)

names (joe) ## NULL

names (joe) <- c ("age","height")

names (joe) ## "age" "height"

joe ["age" ] == joe [1 ] ## age TRUE

names (myVec) <- LETTERS

myVec

# Subsetting by field names

myVec [c("A" , "A" , "B" , "C" , "E" , "H" , "M")] ## The Fibonacci Series :o)

Excluding elements

Sometimes we want to retain most elements of a vector, except for a few unwanted positions. Instead of

specifying all elements of interest, it is easier to specify the ones we want to remove. This is easily done using

the minus sign.

alphabet <- LETTERS

alphabet # print vector alphabet

[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"

[18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

vowel.positions <- c(1,5,9,15,21)

alphabet[vowel.positions] # print alphabet in vowel.positions

[1] "A" "E" "I" "O" "U"

consonants <- alphabet [- vowel.positions] # exclude all vowels from the alphabet

consonants

[1] "B" "C" "D" "F" "G" "H" "J" "K" "L" "M" "N" "P" "Q" "R" "S" "T" "V"

[18] "W" "X" "Y" "Z"

3.2 Matrices

Matrices are two dimensional vectors (tables), explicitly created with the

matrix

function. Just like one-

dimensional vectors, they store same-type elements.

IMPORTANT NOTE:

R uses a

column-major order

for the internal linear storage of array values,

meaning that first all of column 1 is stored, then all of column 2, etc. This implies that, by default, when you

create a matrix, R will populate the first column, then the second, then the third, and so on until all values

given to the

matrix

function are used. This is the default behaviour of the matrix function, which can be

changed via the byrow parameter (default value is set to FALSE).

11

my.matrix <- matrix (1: 12 , nrow=3 , byrow = FALSE ) # byrow = FALSE is the default (see ?matrix)

dim (my.matrix) # check the dimension (size) of the matrix: number of rows and number of columns

my.matrix # print the matrix

xx <- matrix (1 : 12 , nrow=3 , byrow = TRUE )

dim (xx) # check if the dimensions of xx are the same as the dimensions of my.matrix

xx # compare my.matrix with xx and make sure you understand what is hapenning

Subsetting/Indexing matrices

Very Important Note:

The arguments inside the square brackets in matrices (and data.frames - see next

section) are the

[row_number, column_number]

. If any of these is omitted, R assumes that all values are to

be used.

# Creating a matrix of characters

my.matrix <- matrix (LETTERS, nrow = 4 , byrow = TRUE )

# Please notice the warning message (related to the "recycling" of the LETTERS)

my.matrix # print the matrix

dim (my.matrix) # check the dimensions of the matrix

# Subsetting by indices

my.matrix [,2 ] # all rows, column 2 (returns a vector)

my.matrix [3 ,] # row 3, all columns (returns a vector)

my.matrix [1:3 , c(4,2 )] # rows 1, 2 and 3 from columns 4 and 2 (by this order) (returns a matrix)

3.3 Data frames

Data frames are the most flexible and commonly used R data structures, used to store datasets in spreadsheet-

like tables.

In a

data.frame

, usually the observations are the rows and the variables are the columns. Unlike matrices,

each column of a data frame can be a vector of different type (i.e. text, number, logicals, etc, can all be

stored in the same data frame). Each column must to be of the same data type. Data frames are easily

subset by index number or by column name.

df <- data.frame (type=rep ( c ("case","control" ),c(2,3)),time=rnorm (5))

# rnorm is a random number generator retrieved from a normal distribution

class (df) ## "data.frame"

df

Subsetting/Indexing Data frames

Remember:

The arguments inside the square brackets, just like in matrices, are the

[row_number,

column_number]. If any of these is omitted, R assumes that all values are to be used.

NOTE:

R includes a package in its default base installation, named

"The R Datasets Package"

. This

resource includes a diverse group of datasets, containing data from different fields: biology, physics, chemistry,

economics, psychology, mathematics. These data are very useful to learn R. For more info about these

datasets, run the following command: library(help=datasets)

# Familiarize yourself with the iris dataset (built-in dataset with measurements of iris flowers)

iris

12

# Subset by indices the iris dataset

iris [,3 ] # all rows, column 3

iris [1 ,] # row 1, all columns

iris [1 : 9 , c(3,4,1,2 )] # rows 1 to 9 with columns 3, 4, 1 and 2 (in this order)

# Subset by column name (for data.frames)

iris$ Species #show only the species column

iris[,"Sepal.Length"]

# Select the time column from the df data frame created above

df$ time ## 0.5229577 0.7732990 2.1108504 0.4792064 1.3923535

3.4 Lists

Lists are very powerful data structures, consisting of ordered sets of elements, that can be arbitrary R objects

(vectors, strings, functions, etc), and heterogeneous, i.e. each element of a different type.

lst = list (a=1 : 3 , b="hello" , fn= sqrt) # index 3 contains the function "square root"

lst

lst$fn (49 ) # outputs the square root of 49

Subsetting/Indexing lists

# Subsetting by indices

lst [1 ] # returns a list with the data contained in position 1 (preserves the type of data as list)

class (lst[1])

lst [[1 ]] # returns the data contained in position 1 (simplifies to inner data type)

class(lst[[1]])

# Subsetting by name

lst$ b # returns the data contained in position 1 (simplifies to inner data type)

class(lst $ b)

# Compare the class of these alternative indexing by name

lst["a"]

lst[["a"]]

3.5 Data structure conversion

Data structures can be interconverted (coerced) from one type to another. Sometimes it is useful to convert

between data structure types (particularly when using packages). R has several functions for such conversions:

# To check the class of the object:

class(lst)

# To check the basic structure of an object:

str(lst)

# "Force" the object to be of a certain type:

# (this is not valid code, just a syntax example)

as.matrix (myDataFrame) # convert a data frame into a matrix

13

as.numeric (myChar) # convert text characters into numbers

as.data.frame (myMatrix) # convert a matrix into a data frame

as.character (myNumeric) # convert numbers into text chars

4. Loops and Conditionals in R (60 min)

4.1 for() and while() loops

R allows the implementation of

loops

, i.e. replicating instructions in an iterative way (also called cycles).

The most common ones are

for()

loops and

while()

loops. The syntax for these loops is:

for (condition)

{ code-block } and while (condition) { code-block }.

# creating a for loop to calculate the first 12 values of the Fibonacci sequence

my.x <- c(1,1)

for (i in 1 :10) {

my.x <- c(my.x, my.x[i] + my.x[i+ 1])

print(my.x)

}

# while loops will execute a block of commands until a condition is no longer satisfied

x<-3 ; x

while (x < 9)

{

cat("Number" ,x,"is smaller than 9.\n" )# cat is a printing function (see ?cat)

x <- x+ 1

}

4.2 Conditionals: if() statements

Conditionals allow running commands only when certain conditions are TRUE. The syntax is:

if

(condition) { code-block }.

x <- - 5; x

if (x >= 0) { print("Non-negative number" ) } else { print("Negative number" ) }

# Note: The else clause is optional. If the command is run at the command-line,

# and there is an else clause, then either all the expressions must be enclosed

# in curly braces, or the else statement must be in line with the if clause.

# coupled with a for loop

x <- c ( - 5:5 );x

for (i in 1 : length(x)) {

if (x[i] > 0) {

print(x[i])

}

else {

print ("negative number")

}

}

14

4.3 Conditionals: ifelse() statements

The

ifelse

function combines element-wise operations (vectorized) and filtering with a condition that is

evaluated. The major advantage of the

ifelse

over the standard if-then-else statement is that it is vectorized.

The syntax is: ifelse (condition-to-test, value-for-true, value-for-false).

# re-code gender 1 as F (female) and 2 as M (male)

gender <- c(1,1,1,2,2,1,2,1,2,1,1,1,2,2,2,2,2)

ifelse(gender == 1, "F","M")

[1] "F" "F" "F" "M" "M" "F" "M" "F" "M" "F" "F" "F" "M" "M" "M" "M" "M"

5. Functions (60 min)

R allows defining new functions using the

function

command. The syntax (in pseudo-code) is the following:

my.function.name <- function (argument1, argument2, ...) {

expression1

expression2

...

return (value)

}

Now, lets code our own function to calculate the average (or mean) of the values from a vector:

# Define the function

# Please note that the function must be declared in the script before it can be used

my.average <- function (x) {

average.result <- sum (x)/length (x)

return (average.result)

}

# Create the data vector

my.data <- c(10,20,30)

# Run the function using the vector as argument

my.average(my.data)

# Compare with R built-in mean function

mean(my.data)

6. Loading data and Saving files (30 min)

Most R users need to load their own datasets, usually saved as table files (e.g. Excel, or .csv files), to be able

to analyse and manipulate them. After the analysis, the results need to be exported/saved (eg. to view or

use with other software).

# Inspect the esoph built-in dataset

esoph

dim(esoph)

colnames(esoph)

### Saving ###

# Save to a file named esophData.csv the esoph R dataset, separated by commas and

# without quotes (the file will be saved in the current working directory)

15

write.table (esoph, file="esophData.csv" , sep="," , quote=F)

# Save to a file named esophData.tab the esoph dataset, separated by tabs and without

# quotes (the file will be saved in the current working directory)

write.table (esoph, file="esophData.tab" , sep="\t" , quote=F)

### Loading ###

# Load a data file into R (the file should be in the working directory)

# read a table with columns separated by tabs

my.data.tab <- read.table ("esophData.tab" , sep="\t" , header=TRUE )

# read a table with columns separated by commas

my.data.csv <- read.csv ("esophData.csv" , header=T)

Note: if you want to

load

or

save

the files in directories different from the working dir, just use (inside quotes)

the full path as the first argument, instead of just the file name (e.g. "/home/Desktop/r_Workshop/esophData.csv").

7. Some great R functions to "play" with (60 min)

7.1 Using the iris buil-in dataset

# the unique function returns a vector with unique entries only (remove duplicated elements)

unique (iris $ Sepal.Length)

# length returns the size of the vector (i.e. the number of elements)

length (unique (iris $ Sepal.Length))

# table counts the occurrences of entries (tally)

table (iris $Species)

# aggregate computes statistics of data aggregates (groups)

aggregate (iris[,1 :4], by= list (iris $ Species), FUN=mean, na.rm=T)

# the %in% function returns the intersection between two vectors

month.name [month.name %in% c ("CCMar","May" , "Fish" , "July" , "September","Cool")]

# merge joins data frames based on a common column (that functions as a "key")

df1 <- data.frame(x=1 : 5 , y=LETTERS[1:5]) ; df1

df2 <- data.frame(x=c ("Eu","Tu","Ele"), y=1:6 ) ; df2

merge (df1, df2, by.x=1 , by.y=2 , all = TRUE )

# cbind and rbind (takes a sequence of vector, matrix or data-frame arguments

# and combine them by columns or rows, respectively)

my.binding <- as.data.frame (cbind(1 : 7 , LETTERS[1 : 7 ])) # the ' 1' (shorter vector) is recycled

my.binding

my.binding <- cbind (my.binding, 8:14 )[, c(1 , 3 , 2 )] # insert a new column and re-order them

my.binding

my.binding2 <- rbind ( seq(1,21,by=2 ), c(1 : 11))

my.binding2

# reverse the vector

rev (LETTERS)

16

# sum and cumulative sum

sum (1 : 50); cumsum (1 :50)

# product and cumulative product

prod (1 :25); cumprod (1 :25)

### Playing with some R built-in datasets (see library(help=datasets) )

iris # familiarize yourself with the iris data

# mean, standard deviation, variance and median

mean (iris[,2 ]); sd (iris[,2 ]); var (iris[,2 ]); median (iris[,2])

# minimum, maximum, range and summary statistics

min (iris[,1 ]); max (iris[,1 ]); range (iris[,1 ]); summary (iris)

# exponential, logarithm

exp (iris[1,1 :4]); log (iris[1,1 :4])

# sine, cosine and tangent (radians, not degrees)

sin (iris[1,1 :4]); cos (iris[1,1 :4]); tan (iris[1,1 :4])

# sort, order and rank the vector

sort (iris[1,1 : 4]); order (iris[1,1 :4]); rank (iris[1,1 : 4])

# useful to be used with if conditionals

any (iris[1,1 :4] > 2) # ask R if there are any values higher that 2?

all (iris[1,1 :4] > 2) # ask R if all values are higher than 2

# select data

which (iris[1,1 :4] > 2)

which.max (iris[1,1 :4])

7.2 Using the esoph buil-in dataset

The

esoph

(Smoking, Alcohol and (O)esophageal Cancer data) built-in dataset presents 2 types of variables:

continuous numerical variables (the number of cases and the number of controls), and discrete categorical

variables (the age group, the tobacco smoking group and the alcohol drinking group). Sometimes it is hard to

"categorize" continuous variables, i.e. to group them in specific intervals of interest, and name these groups

(also called levels).

Accordingly, imagine that we are interested in classifying the number of cancer cases according to their

occurrence: frequent, intermediate and rare. This type of variable recoding into factors is easily accomplished

using the function cut(), which divides the range of x into intervals and codes the values in x according to

which interval they fall.

# subset non-contiguous data from the esoph dataset

esoph

summary(esoph)

# cancers in patients consuming more than 30 g/day of tobacco

subset(esoph $ ncases, esoph $tobgp == "30+")

# total nr of cancers in patients older than 75

sum(subset(esoph $ncases, esoph$agegp == "75+"))

# factorize the nr of cases in 3 levels, equally spaced,

# and add the new column named cat_ncases, to the dataset

17

esoph$ cat_ncases <- cut (esoph$ncases,3,labels=c ("rare","med","freq"))

summary(esoph)

END

18

R for Absolute Beginners - Part 2/2

Summary Statistics and Graphics in R

Authors: Isabel Duarte & Ramiro Magno | Collaborators: Bruno Louro & Rui Machado

5 June 2018

Contents

Introduction................................................. 1

Hands-onExercises............................................. 2

Exercise 0. Understand the context of your data (15 min) . . . . . . . . . . . . . . . . . . . . 2

Exercise1.Getthedata(10min) ................................. 2

Exercise 2. Format conversion (10 min) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Exercise 3. Set working directory (15 min) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Exercise 4. Checking the data file structure (10 min) . . . . . . . . . . . . . . . . . . . . . . . 4

Exercise 5. Load data into R (20 min) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Exercise5.1.......................................... 4

Exercise5.2.......................................... 4

Exercise 6. Inspecting an R data frame (20 min) . . . . . . . . . . . . . . . . . . . . . . . . . 5

Exercise 7. Tidying-up the data (50 min) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Exercise7.1.......................................... 6

Exercise7.2.......................................... 7

Exercise7.3.......................................... 7

Exercise7.4.......................................... 7

Exercise7.5.......................................... 7

Exercise7.6.......................................... 7

Exercise 8. Exploring the data (1h30m) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Exercise8.1.......................................... 8

Exercise8.2.......................................... 8

Exercise8.3.......................................... 8

Exercise8.4.......................................... 9

Exercise8.5.......................................... 11

Exercise 9. Exploring the data graphically (1h30m) . . . . . . . . . . . . . . . . . . . . . . . . 11

Exercise9.0Introduction .................................. 11

Exercise 9.1 Scatterplots: Basic plotting with plot .................... 13

9.2HistogramsandBoxplots ................................ 17

9.3Curves........................................... 18

9.4MultipleGraphs ..................................... 19

Exercise 9.5 Export and save plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Exercise10.Extrastudy....................................... 23

Exercise 10.1 Write a function of your own . . . . . . . . . . . . . . . . . . . . . . . . 23

Exercise 10.2 Scatterplots with Error Bars . . . . . . . . . . . . . . . . . . . . . . . . . 24

Introduction

The goal of this tutorial is to get you acquainted with common first steps when dealing with a dataset in R,

such as:

1. reading/loading data into R

1

2. inspecting R objects

3. cleaning and tidying-up the data

4. basic descriptive statistics.

To this end, you will be guided through various exercises. Each exercise's title indicates the time allocated

to solve it. Since this time indication is an over-estimation of the time needed to properly answer each

question, please take your time to think about it and discuss it with the instructors. The tutorial comprises

10 exercises, of which only the first nine are to be fully completed during the workshop. Feel free to proceed

with Exercise 10 if you finish earlier; and if not, try to complete it at home. For now, keep calm and carry

on. . . and remember, do not hesitate to ask any questions to the instructors!

Important Note:

There are several alternative ways to accomplish these exercises in R; our suggestions

are just one possibility, particularly targeted for beginners and using simple R functions, organized in small

individual steps (without using any extra R packages). If you know another way that you find more intuitive

(easier for you), please feel free to use it, just

make sure that you truly understand each command

used.

Hands-on Exercises

This tutorial uses a dataset retrieved from the study Reversal of ocean acidification enhances net coral reef

calcification by Albright et al., 2016.

Exercise 0. Understand the context of your data (15 min)

Please read the following introduction.

A recent in situ experiment (Albright et al., 2016) has found that reducing the acidity of the seawater

surrounding a natural coral reef in the southern Great Barrier Reef significantly increases calcification. In

this experiment the acidity levels have been reverted to those characteristic of the pre-industrial era. By

"turning back time", the authors demonstrate that, all else being equal, net coral-reef calcification would

have been around 7% higher than current observations, suggesting that ocean acidification may already be

diminishing coral-reef growth.

One Tree Reef encloses three lagoons, two of which are hydrologically distinct (i.e., separated by reef walls).

At low tide, the water level drops below the outer reef crest, and the lagoons are effectively isolated from the

ocean (Figure 2c). Since First Lagoon sits approximately 30 cm higher than Third Lagoon, gravity-driven,

unidirectional flow results from First Lagoon over the reef flat separating the two lagoons, ending up in Third

Lagoon. The study site is situated along a section of the reef wall separating First and Third Lagoons (Figure

2d).

Exercise 1. Get the data (10 min)

Download the Supplementary Table 1 file containing the raw data for chemical and physical parameters

measured (or calculated) for all days and station locations. Save it in a directory of your choice, and please

keep the file's original name.

Exercise 2. Format conversion (10 min)

Open the downloaded file (should be named

nature17155-s2.xlsx

) with a spreadsheet software program

(e.g. Microsoft Excel or OpenOffice Calc) and export it as CSV (comma separated values). Save the exported

2

Figure 1:

Figure 1

|One Tree Reef in the southern Great Barrier Reef, Australia (Janice M. Lough, 2016).

Figure 2: Figure 2 | One Tree Reef in the southern Great Barrier Reef, Australia (Albright et al., 2016).

3

file in the same folder and name it nature17155-s2.csv.

Exercise 3. Set working directory (15 min)

Open RStudio (if you have not done so already) and from the

R console

run a command (or a combination

of commands) that confirms that the file

nature17155-s2.csv

is "visible" from R. If the working directory

is not the one containing nature17155-s2.csv, then change to it accordingly.

Hint: the functions getwd , setwd , list.files , dir and file.exists are your friends.

Exercise 4. Checking the data file structure (10 min)

The file

nature17155-s2.csv

has been saved as a CSV, which is a particular type of text file, where each

value (i.e. column) is separated by a comma character (

,

). However, it is possible to have a few variations, the

most relevant being: (i) the specific character used as value separator (e.g. comma, semi-colon (; ), tab); (ii)

whether values are quoted (**"**), (iii) whether the first line is a header (i.e. column names) or not. These

details are decisive in order to correctly import the data into R.

So, in order to have a glance at how the data are formatted/organized in the dataset file

nature17155-s2.csv

,

read (i.e. show) the first 3 lines in the R console:

readLines("nature17155-s2.csv" ,n= 3)

[1] "Station ID,Transect,Date,Type,T (C) in situ,Salinity,Alkalinity (umol/kg), Rhodamine (ppb),..."

[2] "D-16,Down,20140916,Control ,22.507,35.8542,2280.58,0.0507,8.1131,298.44,2273.95,..."

[3] "D-16,Down,20140917,Experiment ,22.995,35.8287,2253.42,0.1940,8.1206,298.4,2248.47,..."

Exercise 5. Load data into R (20 min)

From the output of the last command one can see that the comma character is indeed separating the different

columns. Notice however that data-fields with text containing commas are quoted, i.e. enclosed in quotation

marks, so that those free-text-commas are not mistakingly used as column separators. Additionally, notice

that the first line is the header of the dataset (column names) and not an observation/data point .

Exercise 5.1

Now, import the dataset using the function

read.table

with appropriate arguments, and save it in an R

object named reef:

reef <- read.table("DATA/nature17155-s2.csv" , header = TRUE , sep = "," , strip.white = TRUE )

# the strip.white argument removes trailing white spaces (spaces in the

# begining and end of each column)

Exercise 5.2

Next, let us inspect some of the imported data which has been saved in the variable reef.

class(reef)

head(reef) # inspect the first lines of reef

tail(reef) # inspect the last lines of reef

4

Exercise 6. Inspecting an R data frame (20 min)

The imported data has been loaded as a data frame having several columns, such as

Station.ID

,

Transect

,

Date

, etc.. Notice that special characters like white spaces or parenthesis in the column names have been

converted by R to dots (.).

Please note that in RStudio, data frames can be graphically inspected; by clicking on its name in the

environment panel, a new tab opens in the text editor panel, showing its first 1000 lines; so try that too!

Examine other relevant information about the

reef

data frame. Note: This is a challenge to try to discover

the functions that output these results.

[1] 526 16

[1] "Station.ID" "Transect"

[3] "Date" "Type"

[5] "T..C..in.situ" "Salinity"

[7] "Alkalinity..umol.kg." "Rhodamine..ppb."

[9] "Spec.pH..total." "T..K..Spec.pH"

[11] "Alk..S.Normalized..umol.kg." "Rhodamine..S.Normalized..ppb."

[13] "in.situ.pH..total." "in.situ.pCO2..uatm."

[15] "in.situ.CT..umol.kg." "in.situ.aragonite"

[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"

[12] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22"

[23] "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33"

[34] "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44"

[45] "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55"

[56] "56" "57" "58" "59" "60" "61" "62" "63" "64" "65" "66"

[67] "67" "68" "69" "70" "71" "72" "73" "74" "75" "76" "77"

[78] "78" "79" "80" "81" "82" "83" "84" "85" "86" "87" "88"

[89] "89" "90" "91" "92" "93" "94" "95" "96" "97" "98" "99"

[100] "100" "101" "102" "103" "104" "105" "106" "107" "108" "109" "110"

[111] "111" "112" "113" "114" "115" "116" "117" "118" "119" "120" "121"

[122] "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132"

[133] "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143"

[144] "144" "145" "146" "147" "148" "149" "150" "151" "152" "153" "154"

[155] "155" "156" "157" "158" "159" "160" "161" "162" "163" "164" "165"

[166] "166" "167" "168" "169" "170" "171" "172" "173" "174" "175" "176"

[177] "177" "178" "179" "180" "181" "182" "183" "184" "185" "186" "187"

[188] "188" "189" "190" "191" "192" "193" "194" "195" "196" "197" "198"

[199] "199" "200" "201" "202" "203" "204" "205" "206" "207" "208" "209"

[210] "210" "211" "212" "213" "214" "215" "216" "217" "218" "219" "220"

[221] "221" "222" "223" "224" "225" "226" "227" "228" "229" "230" "231"

[232] "232" "233" "234" "235" "236" "237" "238" "239" "240" "241" "242"

[243] "243" "244" "245" "246" "247" "248" "249" "250" "251" "252" "253"

[254] "254" "255" "256" "257" "258" "259" "260" "261" "262" "263" "264"

[265] "265" "266" "267" "268" "269" "270" "271" "272" "273" "274" "275"

[276] "276" "277" "278" "279" "280" "281" "282" "283" "284" "285" "286"

[287] "287" "288" "289" "290" "291" "292" "293" "294" "295" "296" "297"

[298] "298" "299" "300" "301" "302" "303" "304" "305" "306" "307" "308"

[309] "309" "310" "311" "312" "313" "314" "315" "316" "317" "318" "319"

[320] "320" "321" "322" "323" "324" "325" "326" "327" "328" "329" "330"

[331] "331" "332" "333" "334" "335" "336" "337" "338" "339" "340" "341"

[342] "342" "343" "344" "345" "346" "347" "348" "349" "350" "351" "352"

[353] "353" "354" "355" "356" "357" "358" "359" "360" "361" "362" "363"

5

[364] "364" "365" "366" "367" "368" "369" "370" "371" "372" "373" "374"

[375] "375" "376" "377" "378" "379" "380" "381" "382" "383" "384" "385"

[386] "386" "387" "388" "389" "390" "391" "392" "393" "394" "395" "396"

[397] "397" "398" "399" "400" "401" "402" "403" "404" "405" "406" "407"

[408] "408" "409" "410" "411" "412" "413" "414" "415" "416" "417" "418"

[419] "419" "420" "421" "422" "423" "424" "425" "426" "427" "428" "429"

[430] "430" "431" "432" "433" "434" "435" "436" "437" "438" "439" "440"

[441] "441" "442" "443" "444" "445" "446" "447" "448" "449" "450" "451"

[452] "452" "453" "454" "455" "456" "457" "458" "459" "460" "461" "462"

[463] "463" "464" "465" "466" "467" "468" "469" "470" "471" "472" "473"

[474] "474" "475" "476" "477" "478" "479" "480" "481" "482" "483" "484"

[485] "485" "486" "487" "488" "489" "490" "491" "492" "493" "494" "495"

[496] "496" "497" "498" "499" "500" "501" "502" "503" "504" "505" "506"

[507] "507" "508" "509" "510" "511" "512" "513" "514" "515" "516" "517"

[518] "518" "519" "520" "521" "522" "523" "524" "525" "526"

'data.frame': 526 obs. of 16 variables:

$ Station.ID : Factor w/ 24 levels "D0","D1","D-1",..: 7 7 7 7 7 7 7 7 7 7 ...

$ Transect : Factor w/ 2 levels "Down","Up": 1 1 1 1 1 1 1 1 1 1 ...

$ Date : int 20140916 20140917 20140918 20140919 20140920 20140921 20140924 20140925 20140926 20140927 ...

$ Type : Factor w/ 2 levels "Control","Experiment": 1 2 2 2 1 2 2 2 2 2 ...

$ T..C..in.situ : num 22.5 23 23 24.5 24.6 ...

$ Salinity : num 35.9 35.8 35.8 35.9 35.9 ...

$ Alkalinity..umol.kg. : num 2281 2253 2245 2189 2186 ...

$ Rhodamine..ppb. : num 0.0507 0.194 0.6468 0.3877 0.1912 ...

$ Spec.pH..total. : num 8.11 8.12 8.16 8.13 8.18 ...

$ T..K..Spec.pH : num 298 298 298 299 299 ...

$ Alk..S.Normalized..umol.kg. : num 2274 2248 2243 2182 2178 ...

$ Rhodamine..S.Normalized..ppb.: num 0.0505 0.1935 0.6462 0.3864 0.1905 ...

$ in.situ.pH..total. : num 8.15 8.15 8.18 8.15 8.2 ...

$ in.situ.pCO2..uatm. : num 287 284 260 276 241 ...

$ in.situ.CT..umol.kg. : num 1932 1903 1879 1834 1801 ...

$ in.situ.aragonite : num 3.78 3.79 3.95 3.82 4.12 3.92 3.6 3.75 3.39 3.13 ...

Exercise 7. Tidying-up the data (50 min)

From the output of the previous commands it can be seen that there are 16 variables (columns). Each

row refers to an observation. In this context, observations correspond to sampling stations where sets of

measurements were taken in the reef-flat study area.

The first column of the

reef

data frame is the

Station.ID

, an ID reference that identifies each location.

This ID is composed of two parts: (i) the first character, either U (referring to the upstream transect) or D

(for downstream transect); (ii) the following chars indicate the position (in metres) of the sampling location

relative to the tank. Since this information is pivotal for later analyses, it is useful to save the station positions

in its own column, formatted as a numeric vector.

The following exercises (7.*) show one possible way of transforming the

Station.ID

column into a station

position vector: (i) spliting the string in two parts (spliting the U or D from the numberic portion), (ii)

extracting the position (second part) as a numeric vector and (iii) creating a new data frame (

reef2

)

containing all original

reef

data plus the position as a new column. Try it for yourself, and make sure that

you understand all the steps and code involved.

Exercise 7.1

6

Format the column names removing the extra dots between the text. To do this we will use the function

gsub that finds patterns in text and replaces those patterns with other text (also called a string).

# Use gsub to substitute two consecutive dots with only one dot in the

# column names of reef Please run ?gsub to learn about regular expressions

# (regex) and how to use them.

colnames(reef) <- gsub("\\.\\." , "\\." , colnames(reef))

Extract the Station.ID column as a character vector.

station.id <- as.character (reef$Station.ID)

Exercise 7.2

Use the

strsplit

function to split the ID string into its two relevant parts. The

split

argument indicates

which characters are to be used to split the string. Please note that the characters used for the splitting are

omitted (removed) leaving an empty string ("").

split.result.list <- strsplit(station.id, split = "[U,D]" )

Exercise 7.3

For each element in

station.id

we obtained two strings: the empty string

""

and a string with the position

in meters. Since strsplit returns a list, we will unlist it to converts it to a character vector.

split.result.vector <- unlist(split.result.list)

Exercise 7.4

Next we must remove the empty strings "" in order to get the positions nicely arranged in a single vector.

station.position <- split.result.vector[split.result.vector != ""]

Exercise 7.5

Since the positions are distances in meters, and we would like to use those values for future calculations, we

must convert them from characters (text) to numeric.

station.position <- as.numeric(station.position)

Exercise 7.6

Finally, to include the new vector in the reef data frame, we will create a new data frame (

reef2

) by combining

columns (

cbind

) of the

reef

data frame with the newly created column (named

Station.Position

) as

column number 2, followed by all other columns from the

reef

data frame (removing column 1 which is

already present in position 1).

# create the new reef2 data frame by binding the first column of reef, with the

# station.position vector and the rest of the reef data frame (without the 1st column)

reef2 <- cbind (reef[, 1], station.position, reef[,- 1])

# check the column names attribute

names(reef2)

7

[1] "reef[, 1]" "station.position"

[3] "Transect" "Date"

[5] "Type" "T.C.in.situ"

[7] "Salinity" "Alkalinity.umol.kg."

[9] "Rhodamine.ppb." "Spec.pH.total."

[11] "T.K.Spec.pH" "Alk.S.Normalized.umol.kg."

[13] "Rhodamine.S.Normalized.ppb." "in.situ.pH.total."

[15] "in.situ.pCO2.uatm." "in.situ.CT.umol.kg."

[17] "in.situ.aragonite"

# assign meaningfull column names to the first two columns

names(reef2)[c(1,2 )] <- c("Station.ID","Station.Position")

# check the final column names

names (reef2)

[1] "Station.ID" "Station.Position"

[3] "Transect" "Date"

[5] "Type" "T.C.in.situ"

[7] "Salinity" "Alkalinity.umol.kg."

[9] "Rhodamine.ppb." "Spec.pH.total."

[11] "T.K.Spec.pH" "Alk.S.Normalized.umol.kg."

[13] "Rhodamine.S.Normalized.ppb." "in.situ.pH.total."

[15] "in.situ.pCO2.uatm." "in.situ.CT.umol.kg."

[17] "in.situ.aragonite"

Exercise 8. Exploring the data (1h30m)

Exercise 8.1

This reef experiment is a

case-control

study. How many observations (rows) are there for

Control

and

Experiment days?

(Hint:

reef2$Type

contains dates which are Control or Experiment days; the

table

function discussed

yesterday can be useful for counting).

(Answer: There are 166 and 360 observations for Control and Experiment days, respectively.)

Exercise 8.2

What is the time interval for this study?

(Hint : The column Date contains this information; min , max and/or range functions might help.)

(Answer: The study took place between 16/09/2014 and 10/10/2014.)

Exercise 8.3

This study comprises Control days (when no alkalinity is added to the solution pumped to the reef flat) and

Experiment days (when 600 gram of NaOH is added). From the time interval of the study, how many days

were "Control days", and how many days were "Experimental days"?

(Hint :unique and length functions will be useful. The Date and Type columns are pivotal).

(Answer: 7 were control days and 15 were experimental days.)

8

Exercise 8.4

Measurements were taken along two transects: upstream and downstream of the reef flat (Figure 3). Compare

the spreading of the locations of the sampling stations up- and downstream of the reef flat.

(i) Are the two spreads alike?

(ii) Do you think there is an experimental design reason that explains this difference?

(iii)

The standard deviation and the inter-quartile range seem to give contradictory results on which transect

has its sampling stations more spread out. Which metric, in your opinion, best reflects your visual

impression of spread?

# Assessing the spread of stations'positions at the upstream transect by

# looking at the standard deviation, inter-quartile range and range

# Upstream transect

up.pos <- unique (reef2$Station.Position[reef2$ Transect == "Up"])

# mean position of the upstream transect stations'positions

mean(up.pos)

[1] 0

# standard deviation of the upstream transect stations'positions

sd(up.pos)

[1] 8.284021

sqrt(var(up.pos)) # the standard deviation is the square root of the variance!

[1] 8.284021

# inter-quartile range of the upstream transect stations'positions

IQR(up.pos)

[1] 3

# range of the upstream transect stations'positions

range(up.pos)

[1] -16 16

# Downstream transect

dn.pos <- unique (reef2$Station.Position[reef2$ Transect == "Down"])

# mean position of the upstream transect stations'positions

mean(dn.pos)

# standard deviation of the downstream transect stations'positions

sd(dn.pos)

# inter-quartile range of the downstream transect stations'positions

IQR(dn.pos)

# range of the downstream transect station positions

range(dn.pos)

Answer:

The spread, as measured by the standard deviation

σ

, is surprisingly similar: upstream is

σ

~8.3

and downstream is

σ

~7.96. Accordingly, judging by the standard deviation alone, one might think that the

sampling stations would be slightly more spread out along the upstream transect. However, judging from the

picture, this contradicts our intuition, since the majority of upstream stations are very close to the centre.

This could be explained by the fact that the standard deviation is known to be very sensitive to outliers;

however, both transects have "outlier" stations at the edges: in positions -16 and 16 metres, as one can

observe from the output of

range

. For this case, the inter-quartile range (IQR) proves to be a more robust

9

Figure 3: Figure 3 | Sampling stations' locations (blue circles).

10

metric (IQR=3 upstream; IQR=8 downstream), working best at mathematically describing our intuition

when observing Figure 3. This difference in spread between the two transects was probably a choice taken

during the experimental design phase, reflecting the antecipated mixing and dilution of the solution as it

flowed from upstream to downstream. Therefore it made sense to concentrate the sampling effort close to the

source (upstream) and spread it out more at the downstream transect.

Exercise 8.5

summary

is a very useful function that outputs the summary statistics for an R object. Try it on a subset of

the

reef2

data frame: run

summary

for the variables:

Date

,

Type

,

Station.Position

and

Transect

. The

output should look like this:

Date Type Station.Position Transect

Min. :20140916 Control :166 Min. :-16.00000 Down:328

1st Qu.:20140921 Experiment:360 1st Qu.: -3.75000 Up :198

Median :20140928 Median : 0.00000

Mean :20140960 Mean : -0.01141

3rd Qu.:20141005 3rd Qu.: 3.00000

Max. :20141010 Max. : 16.00000

Appreciate how the output is differently presented for

Date

and

Station.Position

compared to

Type

and

Transect. Why is it differently presented?

Exercise 9. Exploring the data graphically (1h30m)

By default, R base alone allows the plotting of several, highly customizable graphics. There are however many

graphical packages developed by the community that greatly expand its plotting potential (e.g., ggplot2).

Nevertheless, in this tutorial we will focus only on a few of the most common plots that can be generated

with functions included in the base installation of R.

R graphics are created using a series of high- and low-level plotting commands. High-level commands create

new plots via functions such as

plot

,

hist

,

boxplot

, or

curve

, whereas low-level functions add to an existing

plot created with a high-level plotting function; examples are points , lines , text , axis , arrows, etc..

Graphical parameters are customizable via the function

par

, containing over 70 different customizable fields

(for details, see

?par

). In this exercise we will look into a few of the common plotting functions:

plot

,

hist

,

boxplot and curve; as well as several parameters that allow you to tweak the look 'n feel of your graphics.

Exercise 9.0 Introduction

Ocean acidification is the ongoing decrease in the pH of the Earth's oceans, caused by the uptake of carbon

dioxide (CO

2

) from the atmosphere. Seawater is slightly alkaline (pH > 8), and this acidification is a shift

towards less alkaline conditions rather than acidic conditions (pH < 7). An estimated 30–40% of the carbon

dioxide from human activity released into the atmosphere dissolves into oceans, rivers and lakes. To achieve

chemical equilibrium, some of it reacts with the water to form carbonic acid. Some of these extra carbonic

acid molecules react with a water molecule to give a bicarbonate ion and a hydronium ion, thus increasing

ocean acidity (H+ ion concentration).

Aragonite is a carbonate mineral, one of the two common, naturally occurring, crystal forms of calcium

carbonate, CaCO

3

(the other form being the mineral calcite). CaCO

3

saturation state

arag

was one of the

chemical parameters measured at the sampling stations.

The saturation state of seawater with respect to aragonite can be defined as the product of the concentrations

of dissolved calcium and carbonate ions in seawater divided by their product at equilibrium:

11

Figure 4:

Figure 4

| Ocean acidification and the resulting reduction in carbonate ions (climatecommis-

sion.angrygoats.net).

12

arag =[Ca2+ ][CO2

3]

[CaCO3 ]

Exercise 9.1 Scatterplots: Basic plotting with plot

Lets see if the Albright et al. study recapitulates the reported effect of acidic conditions leading to lower

levels of CaCO

3

(corresponding to a lower

arag

). To this end we will plot the aragonite saturation state

arag (reef2$in.situ.aragonite ) vs pH (reef2$in.situ.pH.total.).

# xvalues: pH

xvalues <- reef2$in.situ.pH.total.

# yvalues: aragonite saturation state

yvalues <- reef2$in.situ.aragonite

plot(xvalues, yvalues)

7.9 8.0 8.1 8.2 8.3 8.4

3 4 5 6

xvalues

yvalues

From the generated plot it is clear that the two variables are indeed correlated. The higher the pH, the higher

the

arag

. Notice how the axes' labels were automatically set based on the name of the variables passed as

arguments to the plot function.

To change the axes' labels, you may specify them explicitly by setting plot's arguments: xlab and ylab.

Generate the following plot:

13

7.9 8.0 8.1 8.2 8.3 8.4

3 4 5 6

pH

aragonite saturation state

Exercise 9.1.1 More parameters for plot

There are many parameters that allow you to customise your plot (see

?par

). Here are some of the most

commonly used:

Argument Description

main an overall title

for the plot

type

what type of plot

should be drawn:

"p" points, "l"

lines, "n" no

plotting (see

?plot)

sub

a sub title for the

plot

xlab a title for the x

axis

ylab a title for the y

axis

asp the y/x aspect

ratio

cex plotting text and

symbols

magnification

factor relative to

the default

14

Argument Description

cex.axis magnification to

be used for axis

annotation

relative to the

current setting of

cex

axes whether to draw

axes (TRUE) or

not (FALSE)

xlim x axis range

(should be a

vector of two

numbers: xmin

and xmax,

respectively)

ylim y axis range

(should be a

vector of two

numbers: ymin

and ymax,

respectively)

pch either an integer

or a single

character to be

used as the

defaults symbol

in plotting points

(see ?points)

col set plotting color

of each point (see

named colors

with colors())

Note: The demo("graphics") command shows examples of available plots in R, together with the R code that

can be used to generate it. The colors () command shows the names of the available colors.

Try them out! Start by adding a main title, changing the type of points and their color.

15

7.9 8.0 8.1 8.2 8.3 8.4

3 4 5 6

Aragonite Saturation State vs pH

pH

aragonite saturation state

Here is a contrived example using many parameters at once.

# define logical vector based on experiment type: TRUE ("Control"), FALSE ("Experiment")

type.logical <- reef2$ Type == "Control"

# check if "violet" is a named color: "violet" %in% colors()

# define colors according to experiment type (either "Control" or "Experiment")

plot.colours <- ifelse(type.logical, "orange" , "violet")

# define type of symbol for plotting points (see more options with ?points)

plot.points <- ifelse(type.logical, 22 , 1)

# draw contrived plot example

plot(xvalues, yvalues, xlab = "pH" , ylab = "aragonite saturation state" ,

main = "Arag Sat. State vs pH", sub = "an R for Absolute Beginners Contrived Plot Example",

xlim = c (7.5 , 8.5 ), ylim = c (1 , 7), cex = 1.5 , cex.axis = 1.5 , pch = plot.points,

col = plot.colours)

16

7.6 7.8 8.0 8.2 8.4

1 2 3 4 5 6 7

Arag Sat. State vs pH

an R for Absolute Beginners Contrived Plot Example

pH

aragonite saturation state

9.2 Histograms and Boxplots

The function hist () shows the frequency (number of occurrences) of each observation; and the function

boxplot () shows the distribution of the occurrences in each category (agegp, alcgp and tobg).

# basic histogram, with labels, title and bars each with a different color

# (rainbow function), using ~ 20 breaks

hist(reef2 $in.situ.aragonite, breaks = 20 , xlab = "Aragonite" , main = "Aragonite Histogram" ,

col = rainbow (20))

17

Aragonite Histogram

Aragonite

Frequency

3456

0 10 20 30 40 50 60

# basic boxplot of the cases per age group

boxplot(reef2 $in.situ.aragonite ~ reef2 $ Type, main = "Aragonite in Controls and Experiments" ,

border = "gray", lwd = 1, col = c ("orange" , "green"))

Control Experiment

3 4 5 6

Aragonite in Controls and Experiments

9.3 Curves

These are continuous plots (usually of known statistical distributions, like the Gaussian (dnorm), gamma,

18

beta, etc). Here we will see how to add lines and text to the plot (in specific locations/coordinates), as well

as an extra axis on top with a different color.

# multiple normal distribution curves, different mean and sd, and plot them

# in the same plot (add = TRUE)

curve(dnorm, from = - 3,to=5 , lwd = 2 , col = "red")

curve(dnorm(x, mean = 2 ), lwd = 2 , col = "blue" , add = TRUE )

curve(dnorm(x, mean = - 1), lwd = 2, col = "green", add = TRUE)

curve(dnorm(x, mean = 0 ,sd=1.5), lwd = 2 , lty = 2 , col = "red" , add = TRUE)

# add a vertical line at the mean of the standard 'red ' distribution

lines( c (0 , 0 ), c(0 , dnorm(0 )), lty = 1 , col = "red" )

# add free text to the plot, in coordinates x=4, y=0.2

text(4 , 0.2 , "Gaussian distributions")

# add extra axis, on top (side 3), from -3 to 5, with tick-marks from -3 to

# 5, and colored violet

axis(3 , -3:5, seq( -3,5), col.axis = "violet" )

−2 0 2 4

0.0 0.1 0.2 0.3 0.4

x

dnorm(x)

9.4 Multiple Graphs

Different types of graphs can be combined in the same plotting area. Start by trying to plot an histogram of

the temperature (in Kelvin) together with the gaussian distribution that best fits it (with appropriate mean

and standard deviation).

# plot the histogram

hist(reef $ T.K.Spec.pH, col = "red" , xlab = "Temp (K)" , freq = F, main = "Hist with Normal curve" )

# calculate the x and y values

temp.x <- seq (min(reef$ T.K.Spec.pH), max(reef2$ T.K.Spec.pH), length = 100 )

temp.y <- dnorm(temp.x, mean = mean (reef2$ T.K.Spec.pH), sd = sd (reef2$T.K.Spec.pH))

# plot the normal curve using the function lines

lines(temp.x, temp.y, col = "blue" , lwd = 2 )

19

Hist with Normal curve

Temp (K)

Density

295 296 297 298 299 300

0.0 0.1 0.2 0.3 0.4 0.5

Exercise 9.4.1 Arranging several plots in one page

To create a page with several plots located in side-by-side panels, we can use the function

par

with one of

the following parameters:

par(mfrow=c(r,c))

or

par(mfcol=c(r,c))

.

mfrow

adds images per line, from

left to right, and mfcol adds per column, from top to bottom.

Make sure you understand how

mfrow

parameter is specifying the order by which the plots fill up the layout.

# make a 2 by 2 array of plot panels

# fill up row by row

par(mfrow = c (2,2))

# create a new plot, type="n" means plot none

# first plot

#plot(c(0.5),type="n", axes = FALSE, ann=FALSE)

plot.new()

text(0.5 , 0.5 , "1" , cex = 5 )

box()

# second plot

plot.new()

#plot(c(0.5),type="n", axes = FALSE, ann=FALSE)

text(0.5 , 0.5 , "2" , cex = 5 )

box()

# third plot

plot.new()

#plot(c(0.5),type="n", axes = FALSE, ann=FALSE)

text(0.5 , 0.5 , "3" , cex = 5 )

box()

# fourth plot

plot.new()

#plot(c(0.5),type="n", axes = FALSE, ann=FALSE)

20

text(0.5 , 0.5 , "4" , cex = 5 )

box()

Here is the same example but with mfcol.

# make a 2 by 2 array of plot panels

# fill up column by column

par(mfcol = c (2,2))

# create a new plot, type="n" means plot none

# first plot

plot(c(1),type="n" , axes = FALSE , ann=FALSE )

text(1 , 1 , "1" , cex = 5 )

box()

# second plot

plot(c(1),type="n" , axes = FALSE , ann=FALSE )

text(1 , 1 , "2" , cex = 5 )

box()

# third plot

plot(c(1),type="n" , axes = FALSE , ann=FALSE )

text(1 , 1 , "3" , cex = 5 )

box()

# fourth plot

plot(c(1),type="n" , axes = FALSE , ann=FALSE )

text(1 , 1 , "4" , cex = 5 )

box()

21

Once terminated the panel plots, we must revert the graphical parameters to its default values, so that we

can go back to plotting one chart per page.

# reset the graphical display parameters to 1 row and 1 column

par(mfrow = c (1 , 1))

Now lets try to plot the real data from our tutorial dataset.

# set the graphical display parameters to 3 rows and 2 columns

par(mfrow = c (3 , 2 )) # mfrow adds plots per row, from left to right

# draw boxplots for experiments and controls, per each group

boxplot(reef2 $Salinity ~ reef2$Type, xlab = "Salinity" , border = "gray" , lwd = 1 ,

col = c ("violet" , "magenta"))

boxplot(reef2 $Alkalinity.umol.kg. ~ reef2 $ Type, xlab = "Alkalinity" , border = "gray" ,

lwd = 1, col = c ("yellow" , "yellow2"))

boxplot(reef2 $Spec.pH.total. ~ reef2 $ Type, border = "gray" , xlab = "pH" , lwd = 1 ,

col = c ("green" , "limegreen"))

boxplot(reef2 $Rhodamine.ppb. ~ reef2 $ Type, border = "gray" , xlab = "Rhodamine" ,

lwd = 1, col = c ("blue" , "lightskyblue"))

boxplot(reef2 $T.K.Spec.pH ~ reef2 $ Type, border = "gray" , xlab = "Temp (K)" ,

lwd = 1, col = c ("tan" , "tan4"))

boxplot(reef2 $in.situ.pCO2.uatm. ~ reef2$Type, border = "gray" , xlab = "pCO2" ,

lwd = 1, col = c ("orange" , "orange3"))

# add a title outside of the plotting area

title("Boxplots of Experiments and Controls" , outer = TRUE , line = -2, cex.main = 2)

22

Control Experiment

7.9 8.4

Boxplots of Experiments and Controls

Exercise 9.5 Export and save plots

RStudio allows the visualization of the plots before exporting/saving them to an image file (i.e. a bitmap

such as a

jpeg

or

png

which becomes pixelated when zoomed in), or as

pdf

or

svg

(vectorial formats that

can be zoomed and stretched to infinity, without loosing image quality). However, pdf files don't always

export correctly, so they require some "trial and error" until one can get the "perfect" image.

To save plots you can quickly go to the Plots tab in the Workspace window and click on Export . However,

this approach is not easily reproducible and can be time-consuming if the plots need to be generated many

times. Alternatively, in order to save the plots programmatically, you can just wrap the plotting functions

between two commands: pdf (or other depending on file format desired) and dev.off.

# pdf(...) or jpeg(...), png(...), svg(...), etc..

# Start graphics device

pdf("aragonite_ctrl_exp.pdf" , width=7 , height=5 )

# draw boxplots for experiments and controls, per each group

boxplot (reef2 $ in.situ.aragonite ~ reef2$ Type, main="Aragonite in Controls and Experiments" ,

border="gray", lwd=1, col=c ("orange","blue"))

dev.off () # close graphics device (stop writing to file)

Save one of the previous plots as pdf and open it with your pdf viewer (e.g. Acrobat Reader).

Exercise 10. Extra study

Exercise 10.1 Write a function of your own

23

To assess the fraction of alkalinity taken up by the reef, a passive tracer, i.e. the non-reactive dye Rhodamine

WT, was mixed with ambient sea water in the tank. Rhodamine WT concentration was then measured

fluorometrically. Given that this measurement is temperature dependent, it needs to be corrected. The

following formula provides this correction:

Fr =Fs ek(T s

Tr )

where

Fr

and

Fs

are the fluorescences at the reference and sample temperatures,

Tr

and

Ts

(in Kelvin), and

k= 0 .026 per Kelvin (2.6% correction per Kelvin).

Challenge:

Write a function named

f.r

that returns the

Fr

value given as input

Fs

,

Tr

and

Ts

. Also, include

an argument in the function that allows changing the accepted temperature units from Kelvin (default) to

Celsius.

Use your function

f.r

to calculate the temperature-corrected Rhodamine concentrations. Hint :

Fs

is

reef2$Rhodamine.ppb.

,

Tr

and

Ts

are

T.C.in.situ

and

T.K.Spec.pH

, respectively. Plot the calculated

temperature-corrected Rhodamine concentrations versus reef2$Rhodamine.S.Normalized.ppb..

Exercise 10.2 Scatterplots with Error Bars

R base does not provide plots with error bars. However, it is easy (however laborious) to take advantage of

the arrows function to mimic the same effect.

arrows(x, avg - sdev, x, avg + sdev, length = 0.05 , angle = 90 , code = 3 )

In the code above

x

is a vector of x-positions, and

avg-sdev

and

avg+sdev

are vectors of the lower and upper

y-positions of the error bars.

Using many of the elements you have seen today try to generate this set of two plots. See how these plots

compare with those of Figure 2c-d from Albright et al., 2016.

Control

Sampling station (m)

pH

−16 −12 −8 −4 0 4 8 12 16

7.9 8.0 8.1 8.2 8.3 8.4

Experiment

Sampling station (m)

−16 −12 −8 −4 0 4 8 12 16

7.9 8.0 8.1 8.2 8.3 8.4

END

24

ResearchGate has not been able to resolve any citations for this publication.

Notice that special characters like white spaces or parenthesis in the column names have been converted by R to dots

  • Transect Id
  • Date

The imported data has been loaded as a data frame having several columns, such as Station.ID, Transect, Date, etc.. Notice that special characters like white spaces or parenthesis in the column names have been converted by R to dots (.).

Posted by: pierreczepiel.blogspot.com

Source: https://www.researchgate.net/publication/331209857_R_for_Absolute_Beginners_-_Hands-on_R_Tutorial