Basics of coding in R

Week 1 Session 2

Claire Donnat

(based on the material by Lan Huong Nguyen)

July 1st, 2020

R as a calculator

R as a calculator

  • R can be used as a calculator, e.g.
23 + sin(pi/2) 
## [1] 24
abs(-10) + (17-3)^4
## [1] 38426
4 * exp(10) + sqrt(2) 
## [1] 88107.28
  • Intuitive arithmetic operators: addition (+), subtraction (-), multiplication (*), division: (/), exponentiation: (^), modulus: (%%)

  • Built-in constants: pi, LETTERS, letters, month.abb, month.name

Variables

Variables

  • Variables are objects used to store various information.
  • Variables are nothing but reserved memory locations for storing values.
  • In contrast to other programming languages like C or java, in R the variables are NOT declared as some data type/class (e.g. vectors, lists, data-frames).
  • When variables are assigned with R-Objects, the data type of the R-object becomes the data type of the variable.

Variable assignment

Variable assignment can be done using the following operators: =, <-, ->:

# Assignment using equal operator. 
var.1 = 34759

# Assignment using leftward operator. 
var.2 <-"learn R"

#Assignment using rightward operator. 
TRUE -> var.3 

The values of the variables can be printed with print() function, or cat().

print(var.1) 
## [1] 34759
cat("var.2 is ", var.2)
## var.2 is  learn R
cat("var.3 is ", var.3 ,"\n") 
## var.3 is  TRUE

Naming variables

Variable names must start with a letter, and can only contain:

  • letters
  • numbers
  • the character _
  • the character .
a <- 0
first.variable <- 1
SecondVariable <- 2
variable_2 <- 1 + first.variable
very_long_name.3 <- 4

Some words are reserved in R and cannot be used as object names:

  • Inf and -Inf which respectively stand for positive and negative infinity, R will return this when the value is too big, e.g. 2^1024
  • NULL denotes a null object. Often used as undeclared function argument.
  • NA represents a missing value (“Not Available”).
  • NaN means “Not a Number”. R will return this when a computation is undefined, e.g. 0/0.

Data types

Data types

Values in R are limited to only 6 atomic classes:

  • Logical: TRUE/FALSE or T/F
  • Numeric: 12.4, 30, 2, 1009, 3.141593
  • Integer: 2L, 34L, -21L, 0L
  • Complex: 3 + 2i, -10 - 4i
  • Character: 'a', '23.5', "good", "Hello world!", "TRUE"
  • Raw (holding raw bytes): as.raw(2), charToRaw("Hello")

Objects can have different structures based on atomic class and dimensions:

Dimensions Homogeneous Heterogeneous
1d vector list
2d matrix data.frame
nd array

R also supports more complicated objects built upon these.

Variable class

R is a dynamically typed language, which means that we can change a variable’s data type of the same variable again and again when using it in a program.

x <- "Hello" 
cat("The class of x is", class(x),"\n")
## The class of x is character
x <- 34.5 
cat("  Now the class of x is ", class(x),"\n")
##   Now the class of x is  numeric
x <- 27L 
cat("   Next the class of x becomes ", class(x),"\n") 
##    Next the class of x becomes  integer

You can see what variables are currently available in the workspace by calling

print(ls()) 
## [1] "a"                "first.variable"   "SecondVariable"   "var.1"            "var.2"            "var.3"            "variable_2"       "very_long_name.3" "x"

Vectors

Vector indexing

  • Elements of a vector can be accessed using indexing, with square brackets, [].

  • Unlike in many languages, in R indexing starts with 1.

  • Using negative integer value indices drops corresponding element of the vector.

  • Logical indexing (TRUE/FALSE) is allowed.

days <- c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat") 
(today <- days[5])
## [1] "Thurs"
# Accessing vector elements using position. 
(weekend.days <- days[c(1, 7)])
## [1] "Sun" "Sat"
# Accessing vector elements using negative indexing. 
(week.days <- days[c(-1,-7)])
## [1] "Mon"   "Tue"   "Wed"   "Thurs" "Fri"
# Accessing vector elements using logical indexing. 
(birthday <- days[c(F, F, F, F, T, F, F)])
## [1] "Thurs"

Logical operations


# Comparisons (==,!=,>,>=,<,<=)
1 == 2
## [1] FALSE
# Check whether number is even
# (%% is the modulus)
(5 %% 2) == 0
## [1] FALSE
# Logical indexing
x <- seq(1,10)
x[(x%%2) == 0]
## [1]  2  4  6  8 10
# Element-wise comparison
c(1,2,3) > c(3,2,1)
## [1] FALSE FALSE  TRUE
# Check whether numbers are even,
# one by one
(seq(1,4) %% 2) == 0
## [1] FALSE  TRUE FALSE  TRUE
# Logical indexing
x <- seq(1,10)
x[x>=5]
## [1]  5  6  7  8  9 10

Vector arithmetics

Two vectors of same length can be added, subtracted, multiplied or divided. Vectors can be concatenated with combine function c().

# Create two vectors. 
v1 <- c(1,4,7,3,8,15) 
v2 <- c(12,9,4,11,0,8)

# Vector addition. 
(vec.sum <- v1+v2)
## [1] 13 13 11 14  8 23
# Vector subtraction. 
(vec.difference <- v1-v2)
## [1] -11  -5   3  -8   8   7
# Vector multiplication. 
(vec.product <- v1*v2)
## [1]  12  36  28  33   0 120
# Vector division. 
(vec.ratio <- v1/v2)
## [1] 0.08333333 0.44444444 1.75000000 0.27272727        Inf 1.87500000
# Vector concatenation
vec.concat <- c(v1, v2)
# Size of vector
length(vec.concat)
## [1] 12

Recycling

  • Recycling is an automatic lengthening of vectors in certain settings.
# Element-wise multiplication
v1 <- c(1,2,3,4,5,6,7,8,9,10)
v1 * 2
##  [1]  2  4  6  8 10 12 14 16 18 20
  • When two vectors of different lengths, R will repeat the shorter vector until the length of the longer vector is reached.
# Element-wise multiplication
v1 * c(1,2)
##  [1]  1  4  3  8  5 12  7 16  9 20
v1 + c(3, 7, 10)
##  [1]  4  9 13  7 12 16 10 15 19 13

Note: a warning is not an error. It only informs you that your code continued to run, but perhaps it did not work as you intended.

Matrices

Matrices

Matrices in R are objects with homogeneous elements (of the same type), arranged in a 2D rectangular layout. A matrix can be created with a function:

matrix(data, nrow, ncol, byrow, dimnames)

where:

  • data is the input vector with elements of the matrix.
  • nrow is the number of rows to be crated
  • byrow is a logical value. If FALSE (the default) the matrix is filled by columns, otherwise the matrix is filled by rows.
  • dimnames is NULL or a list of length 2 giving the row and column names respectively

# Elements are arranged sequentially by column. 
(N <- matrix(seq(1,20), nrow = 4, byrow = FALSE))
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    9   13   17
## [2,]    2    6   10   14   18
## [3,]    3    7   11   15   19
## [4,]    4    8   12   16   20
# Elements are arranged sequentially by row. 
(M <- matrix(seq(1,20), nrow = 5, byrow = TRUE))
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
## [4,]   13   14   15   16
## [5,]   17   18   19   20

Accessing Elements of a Matrix

# Define the column and row names. 
rownames <- c("row1", "row2", "row3") 
colnames <- c("col1", "col2", "col3", "col4", "col5") 
(P <- matrix(c(5:19), nrow = 3, byrow = TRUE, 
             dimnames = list(rownames, colnames))) 
##      col1 col2 col3 col4 col5
## row1    5    6    7    8    9
## row2   10   11   12   13   14
## row3   15   16   17   18   19
P[2, 5] # the element in 2nd row and 5th column. 
## [1] 14
P[2, ] # the 2nd row. 
## col1 col2 col3 col4 col5 
##   10   11   12   13   14
P[, 3] # the 3rd column. 
## row1 row2 row3 
##    7   12   17
P[c(3,2), ] # the 3rd and 2nd row. 
##      col1 col2 col3 col4 col5
## row3   15   16   17   18   19
## row2   10   11   12   13   14
P[, c(3, 1)] # the 3rd and 1st column. 
##      col3 col1
## row1    7    5
## row2   12   10
## row3   17   15
P[1:2, 3:5] # Subset 1:2 row 3:5 column 
##      col3 col4 col5
## row1    7    8    9
## row2   12   13   14

Matrix Computations

Matrix addition and subtraction needs matrices of same dimensions:

# Create two 2x3 matrices. 
(A <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)) 
##      [,1] [,2] [,3]
## [1,]    3   -1    2
## [2,]    9    4    6
(B <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2))
##      [,1] [,2] [,3]
## [1,]    5    0    3
## [2,]    2    9    4
A + B  # Element-wise sum; (A - B) difference
##      [,1] [,2] [,3]
## [1,]    8   -1    5
## [2,]   11   13   10
A * B  # Element-wise multiplication
##      [,1] [,2] [,3]
## [1,]   15    0    6
## [2,]   18   36   24
A / B  # Element-wise division
##      [,1]      [,2]      [,3]
## [1,]  0.6      -Inf 0.6666667
## [2,]  4.5 0.4444444 1.5000000
t(A)   # Matrix transpose
##      [,1] [,2]
## [1,]    3    9
## [2,]   -1    4
## [3,]    2    6

Matrix Algebra

True matrix multiplication A x B, with \(A \in \mathbb{R}^{m \times n}\) and \(B \in \mathbb{R}^{m \times n}\):

\[ (AB)_{ij} = \sum_{k = 1}^p A_{ik}B_{kj} \]

# A is (2 x 3) and t(B) is (3 x 2)
A %*% t(B)     # (2 x 2)-matrix
##      [,1] [,2]
## [1,]   21    5
## [2,]   63   78
# t(A) is (3 x 2) and B is (2 x 3)
t(A) %*% B    # (3 x 3)-matrix
##      [,1] [,2] [,3]
## [1,]   33   81   45
## [2,]    3   36   13
## [3,]   22   54   30

More on matrix algebra here

Arrays

Arrays

  • In R, arrays are data objects with more than two dimensions,   e.g. a (4x3x2)-array has 2 tables of size 4 rows by 3 columns.
  • Arrays can store only one data type and are created using array().
  • Accessing and subsetting elements of an arrays is similar to accessing elements of a matrix.
row.names <- c("ROW1","ROW2","ROW3", "ROW4") 
column.names <- c("COL1","COL2","COL3") 
matrix.names <- c("Matrix1","Matrix2")

(arr <- array(
  seq(1, 24), dim = c(4,3,2), 
  dimnames = list(row.names, column.names,
matrix.names))) 
## , , Matrix1
## 
##      COL1 COL2 COL3
## ROW1    1    5    9
## ROW2    2    6   10
## ROW3    3    7   11
## ROW4    4    8   12
## 
## , , Matrix2
## 
##      COL1 COL2 COL3
## ROW1   13   17   21
## ROW2   14   18   22
## ROW3   15   19   23
## ROW4   16   20   24

Lists

Lists

Lists can contain elements of different types e.g. numbers, strings, vectors and/or another list. List is created using list() function.

# Unnamed list
v <- c("Jan","Feb","Mar")
M <- matrix(c(1,2,3,4),nrow=2)
lst <- list("green", 12.3)
(u.list <- list(v, M, lst))
## [[1]]
## [1] "Jan" "Feb" "Mar"
## 
## [[2]]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## [[3]]
## [[3]][[1]]
## [1] "green"
## 
## [[3]][[2]]
## [1] 12.3
# Access 2nd element
u.list[[2]]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
# Named list 
(n.list <- list(
  first = "Jane", last = "Doe", 
  gender = "Female", yearOfBirth = 1990)) 
## $first
## [1] "Jane"
## 
## $last
## [1] "Doe"
## 
## $gender
## [1] "Female"
## 
## $yearOfBirth
## [1] 1990
# Access 3rd element
n.list[[3]]
## [1] "Female"
# Access "yearOfBirth" element
n.list$yearOfBirth
## [1] 1990

Data-frames

Data-frames

A data frame is a table or a 2D array-like structure, whose:

  • Columns can store data of different types e.g. numeric, character etc.
  • Each column must contain the same number of data items.
  • The column names should be non-empty.
  • The row names should be unique.
# Create the data frame. 
employees <- data.frame(
  row.names = c("E1", "E2", "E3","E4", "E5"),
  name = c("Rick","Dan","Michelle","Ryan","Gary"), 
  salary = c(623.3,515.2,611.0,729.0,843.25), 
  start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27")),
  stringsAsFactors = FALSE ) 
# Print the data frame. 
employees
##        name salary start_date
## E1     Rick 623.30 2012-01-01
## E2      Dan 515.20 2013-09-23
## E3 Michelle 611.00 2014-11-15
## E4     Ryan 729.00 2014-05-11
## E5     Gary 843.25 2015-03-27

Useful functions for data-frames

# Get the structure of the data frame. 
str(employees) 
## 'data.frame':    5 obs. of  3 variables:
##  $ name      : chr  "Rick" "Dan" "Michelle" "Ryan" ...
##  $ salary    : num  623 515 611 729 843
##  $ start_date: Date, format: "2012-01-01" "2013-09-23" "2014-11-15" "2014-05-11" ...
# Print first few rows of the data frame. 
head(employees, 2) 
##    name salary start_date
## E1 Rick  623.3 2012-01-01
## E2  Dan  515.2 2013-09-23
# Print statistical summary of the data frame.
summary(employees)
##      name               salary        start_date        
##  Length:5           Min.   :515.2   Min.   :2012-01-01  
##  Class :character   1st Qu.:611.0   1st Qu.:2013-09-23  
##  Mode  :character   Median :623.3   Median :2014-05-11  
##                     Mean   :664.4   Mean   :2014-01-14  
##                     3rd Qu.:729.0   3rd Qu.:2014-11-15  
##                     Max.   :843.2   Max.   :2015-03-27

Subsetting data-frames

  • We can extract specific columns:
# using column names. 
employees$name
employees[, c("name", "salary")]

# # or using integer indexing
# employees[, 1]
# employees[, c(1, 2)]
## [1] "Rick"     "Dan"      "Michelle" "Ryan"     "Gary"
##        name salary
## E1     Rick 623.30
## E2      Dan 515.20
## E3 Michelle 611.00
## E4     Ryan 729.00
## E5     Gary 843.25
  • We can extract specific rows:
# using row names. 
employees["E1",]
employees[c("E2", "E3"), ]

# using integer indexing
employees[1, ]
employees[c(2, 3), ]
##    name salary start_date
## E1 Rick  623.3 2012-01-01
##        name salary start_date
## E2      Dan  515.2 2013-09-23
## E3 Michelle  611.0 2014-11-15

Adding data to data-frames

  • Add a new column using assignment operator:
# Add the "dept" coulmn. 
employees$dept <- 
  c("IT","Operations","IT","HR","Finance") 
employees
##        name salary start_date       dept
## E1     Rick 623.30 2012-01-01         IT
## E2      Dan 515.20 2013-09-23 Operations
## E3 Michelle 611.00 2014-11-15         IT
## E4     Ryan 729.00 2014-05-11         HR
## E5     Gary 843.25 2015-03-27    Finance
  • Adding a new row using rbind() function:
# Create the second data frame 
new.employees <- data.frame(
  row.names = paste0("E", 6:8), 
  name = c("Rasmi","Pranab","Tusar"), 
  salary = c(578.0,722.5,632.8), 
  start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")), 
  dept = c("IT","Operations","Fianance"), 
  stringsAsFactors = FALSE )

# Concatenate two data frames. 
(all.employees <- rbind(employees, new.employees)) 
##        name salary start_date       dept
## E1     Rick 623.30 2012-01-01         IT
## E2      Dan 515.20 2013-09-23 Operations
## E3 Michelle 611.00 2014-11-15         IT
## E4     Ryan 729.00 2014-05-11         HR
## E5     Gary 843.25 2015-03-27    Finance
## E6    Rasmi 578.00 2013-05-21         IT
## E7   Pranab 722.50 2013-07-30 Operations
## E8    Tusar 632.80 2014-06-17   Fianance

Factors

Factors

Factors are used to categorize the data and store it as levels. They are useful for variables which take on a limited number of unique values.

days <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
is.factor(month.name) 
## [1] FALSE
class(days) # Indeed these are strings of characters
## [1] "character"

If not specified, R will order character type by alphabetical order.

( days <- factor(days) ) # Convert to factors
## [1] Mon Tue Wed Thu Fri Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed
is.factor(days)
## [1] TRUE

Factors ordering

days.sample <- sample(days, 5)
days.sample
## [1] Thu Sun Tue Mon Fri
## Levels: Fri Mon Sat Sun Thu Tue Wed
# Create factor with given levels
(days.sample <- factor(days.sample, levels = days)) 
## [1] Thu Sun Tue Mon Fri
## Levels: Mon Tue Wed Thu Fri Sat Sun
# Create factor with ordered levels
(days.sample <- factor(days.sample, levels = days, ordered = TRUE)) 
## [1] Thu Sun Tue Mon Fri
## Levels: Mon < Tue < Wed < Thu < Fri < Sat < Sun

Note that factor labels are not the same as levels.

day_names <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
(days <- factor(days, levels = days, labels = day_names))
## [1] Monday    Tuesday   Wednesday Thursday  Friday    Saturday  Sunday   
## Levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday

Dates

R makes it easy to work with dates.

# Define a sequence of dates
x <- seq(from=as.Date("2018-01-01"),to=as.Date("2018-05-31"), by=1)
table(months(x)) 
## 
##    April February  January    March      May 
##       30       28       31       31       31
Sys.Date()     # What day  is it? 
## [1] "2020-06-22"
Sys.time()     # What time is it? 
## [1] "2020-06-22 10:31:15 PDT"
# Number of days until the New Year. 
as.Date('2019-01-01') - Sys.Date() 
## Time difference of -538 days

Type ?strptime for a list of possible date formats.

Random numbers

Random sampling

You can generate a random sample from the elements of a vector using the function sample.

v <- seq(1, 10)
sample(v, 5)                             # Sampling without replacement
## [1]  3  6  7 10  2
month.name
##  [1] "January"   "February"  "March"     "April"     "May"       "June"      "July"      "August"    "September" "October"   "November"  "December"
sample(month.name, 10, replace = TRUE)   # Sampling with replacement
##  [1] "June"      "June"      "June"      "February"  "March"     "February"  "September" "September" "October"   "April"

Tables – the contents of a discrete vector can be easily summarized in a table.

x <- sample(v, 1000, replace=TRUE)          # Random sample
table(x)
## x
##   1   2   3   4   5   6   7   8   9  10 
##  87  94 102 100 100  90 101  99 117 110