R Notes for Professionals

🔙 Quay lại trang tải sách pdf ebook R Notes for Professionals Ebooks Nhóm Zalo R R Notes for Professionals Notes for Professionals 400+ pages of professional hints and tricks GoalKicker.com Free Programming Books Disclaimer This is an unocial free book created for educational purposes and is not aliated with ocial R group(s) or company(s). All trademarks and registered trademarks are the property of their respective owners Contents About ................................................................................................................................................................................... 1 Chapter 1: Getting started with R Language .................................................................................................. 2 Section 1.1: Installing R ................................................................................................................................................... 2 Section 1.2: Hello World! ................................................................................................................................................ 3 Section 1.3: Getting Help ............................................................................................................................................... 3 Section 1.4: Interactive mode and R scripts ................................................................................................................ 3 Chapter 2: Variables .................................................................................................................................................... 7 Section 2.1: Variables, data structures and basic Operations .................................................................................. 7 Chapter 3: Arithmetic Operators ........................................................................................................................ 10 Section 3.1: Range and addition ................................................................................................................................. 10 Section 3.2: Addition and subtraction ....................................................................................................................... 10 Chapter 4: Matrices ................................................................................................................................................... 13 Section 4.1: Creating matrices .................................................................................................................................... 13 Chapter 5: Formula .................................................................................................................................................... 15 Section 5.1: The basics of formula ............................................................................................................................. 15 Chapter 6: Reading and writing strings .......................................................................................................... 17 Section 6.1: Printing and displaying strings ............................................................................................................... 17 Section 6.2: Capture output of operating system command ................................................................................. 18 Section 6.3: Reading from or writing to a file connection ....................................................................................... 19 Chapter 7: String manipulation with stringi package .............................................................................. 21 Section 7.1: Count pattern inside string ..................................................................................................................... 21 Section 7.2: Duplicating strings .................................................................................................................................. 21 Section 7.3: Paste vectors ........................................................................................................................................... 22 Section 7.4: Splitting text by some fixed pattern ...................................................................................................... 22 Chapter 8: Classes ...................................................................................................................................................... 23 Section 8.1: Inspect classes ......................................................................................................................................... 23 Section 8.2: Vectors and lists ..................................................................................................................................... 23 Section 8.3: Vectors ..................................................................................................................................................... 24 Chapter 9: Lists ............................................................................................................................................................ 25 Section 9.1: Introduction to lists .................................................................................................................................. 25 Section 9.2: Quick Introduction to Lists ..................................................................................................................... 25 Section 9.3: Serialization: using lists to pass information ........................................................................................ 27 Chapter 10: Hashmaps ............................................................................................................................................. 29 Section 10.1: Environments as hash maps ................................................................................................................ 29 Section 10.2: package:hash ........................................................................................................................................ 32 Section 10.3: package:listenv ...................................................................................................................................... 33 Chapter 11: Creating vectors ................................................................................................................................. 35 Section 11.1: Vectors from build in constants: Sequences of letters & month names ........................................... 35 Section 11.2: Creating named vectors ........................................................................................................................ 35 Section 11.3: Sequence of numbers ............................................................................................................................ 37 Section 11.4: seq() ......................................................................................................................................................... 37 Section 11.5: Vectors .................................................................................................................................................... 38 Section 11.6: Expanding a vector with the rep() function ......................................................................................... 39 Chapter 12: Date and Time .................................................................................................................................... 41 Section 12.1: Current Date and Time .......................................................................................................................... 41 Section 12.2: Go to the End of the Month .................................................................................................................. 41 Section 12.3: Go to First Day of the Month ................................................................................................................ 42 Section 12.4: Move a date a number of months consistently by months ............................................................. 42 Chapter 13: The Date class ..................................................................................................................................... 44 Section 13.1: Formatting Dates ................................................................................................................................... 44 Section 13.2: Parsing Strings into Date Objects ........................................................................................................ 44 Section 13.3: Dates ....................................................................................................................................................... 45 Chapter 14: Date-time classes (POSIXct and POSIXlt) ............................................................................ 47 Section 14.1: Formatting and printing date-time objects ......................................................................................... 47 Section 14.2: Date-time arithmetic ............................................................................................................................. 47 Section 14.3: Parsing strings into date-time objects ................................................................................................ 48 Chapter 15: The character class .......................................................................................................................... 50 Section 15.1: Coercion .................................................................................................................................................. 50 Chapter 16: Numeric classes and storage modes ...................................................................................... 51 Section 16.1: Numeric ................................................................................................................................................... 51 Chapter 17: The logical class ................................................................................................................................. 53 Section 17.1: Logical operators ................................................................................................................................... 53 Section 17.2: Coercion ................................................................................................................................................. 53 Section 17.3: Interpretation of NAs ............................................................................................................................. 53 Chapter 18: Data frames ......................................................................................................................................... 55 Section 18.1: Create an empty data.frame ................................................................................................................ 55 Section 18.2: Subsetting rows and columns from a data frame ............................................................................ 56 Section 18.3: Convenience functions to manipulate data.frames .......................................................................... 59 Section 18.4: Introduction ............................................................................................................................................ 60 Section 18.5: Convert all columns of a data.frame to character class .................................................................. 61 Chapter 19: Split function ....................................................................................................................................... 63 Section 19.1: Using split in the split-apply-combine paradigm ............................................................................... 63 Section 19.2: Basic usage of split ............................................................................................................................... 64 Chapter 20: Reading and writing tabular data in plain-text files (CSV, TSV, etc.) ................... 67 Section 20.1: Importing .csv files ................................................................................................................................ 67 Section 20.2: Importing with data.table .................................................................................................................... 68 Section 20.3: Exporting .csv files ................................................................................................................................ 69 Section 20.4: Import multiple csv files ....................................................................................................................... 69 Section 20.5: Importing fixed-width files ................................................................................................................... 69 Chapter 21: Pipe operators (%>% and others) ............................................................................................. 71 Section 21.1: Basic use and chaining .......................................................................................................................... 71 Section 21.2: Functional sequences ........................................................................................................................... 72 Section 21.3: Assignment with %<>% .......................................................................................................................... 73 Section 21.4: Exposing contents with %$% ................................................................................................................ 73 Section 21.5: Creating side eects with %T>% .......................................................................................................... 74 Section 21.6: Using the pipe with dplyr and ggplot2 ................................................................................................ 75 Chapter 22: Linear Models (Regression) ......................................................................................................... 76 Section 22.1: Linear regression on the mtcars dataset ........................................................................................... 76 Section 22.2: Using the 'predict' function .................................................................................................................. 78 Section 22.3: Weighting .............................................................................................................................................. 79 Section 22.4: Checking for nonlinearity with polynomial regression ..................................................................... 81 Section 22.5: Plotting The Regression (base) ........................................................................................................... 83 Section 22.6: Quality assessment .............................................................................................................................. 85 Chapter 23: data.table ............................................................................................................................................. 87 Section 23.1: Creating a data.table ............................................................................................................................ 87 Section 23.2: Special symbols in data.table ............................................................................................................. 88 Section 23.3: Adding and modifying columns .......................................................................................................... 89 Section 23.4: Writing code compatible with both data.frame and data.table ...................................................... 91 Section 23.5: Setting keys in data.table .................................................................................................................... 93 Chapter 24: Pivot and unpivot with data.table .......................................................................................... 95 Section 24.1: Pivot and unpivot tabular data with data.table - I ............................................................................. 95 Section 24.2: Pivot and unpivot tabular data with data.table - II ........................................................................... 96 Chapter 25: Bar Chart .............................................................................................................................................. 98 Section 25.1: barplot() function .................................................................................................................................. 98 Chapter 26: Base Plotting .................................................................................................................................... 104 Section 26.1: Density plot .......................................................................................................................................... 104 Section 26.2: Combining Plots .................................................................................................................................. 105 Section 26.3: Getting Started with R_Plots ............................................................................................................. 107 Section 26.4: Basic Plot ............................................................................................................................................. 108 Section 26.5: Histograms .......................................................................................................................................... 111 Section 26.6: Matplot ................................................................................................................................................ 113 Section 26.7: Empirical Cumulative Distribution Function ..................................................................................... 119 Chapter 27: boxplot ................................................................................................................................................. 121 Section 27.1: Create a box-and-whisker plot with boxplot() {graphics} .............................................................. 121 Section 27.2: Additional boxplot style parameters ................................................................................................ 125 Chapter 28: ggplot2 ................................................................................................................................................ 128 Section 28.1: Displaying multiple plots .................................................................................................................... 128 Section 28.2: Prepare your data for plotting ......................................................................................................... 131 Section 28.3: Add horizontal and vertical lines to plot .......................................................................................... 133 Section 28.4: Scatter Plots ........................................................................................................................................ 136 Section 28.5: Produce basic plots with qplot .......................................................................................................... 136 Section 28.6: Vertical and Horizontal Bar Chart .................................................................................................... 138 Section 28.7: Violin plot ............................................................................................................................................. 140 Chapter 29: Factors ................................................................................................................................................. 143 Section 29.1: Consolidating Factor Levels with a List ............................................................................................ 143 Section 29.2: Basic creation of factors ................................................................................................................... 144 Section 29.3: Changing and reordering factors ..................................................................................................... 145 Section 29.4: Rebuilding factors from zero ............................................................................................................ 150 Chapter 30: Pattern Matching and Replacement .................................................................................... 152 Section 30.1: Finding Matches .................................................................................................................................. 152 Section 30.2: Single and Global match ................................................................................................................... 153 Section 30.3: Making substitutions .......................................................................................................................... 154 Section 30.4: Find matches in big data sets ........................................................................................................... 154 Chapter 31: Run-length encoding ..................................................................................................................... 156 Section 31.1: Run-length Encoding with `rle` ............................................................................................................ 156 Section 31.2: Identifying and grouping by runs in base R ..................................................................................... 156 Section 31.3: Run-length encoding to compress and decompress vectors ........................................................ 157 Section 31.4: Identifying and grouping by runs in data.table ............................................................................... 158 Chapter 32: Speeding up tough-to-vectorize code ................................................................................. 159 Section 32.1: Speeding tough-to-vectorize for loops with Rcpp ........................................................................... 159 Section 32.2: Speeding tough-to-vectorize for loops by byte compiling ............................................................ 159 Chapter 33: Introduction to Geographical Maps ...................................................................................... 161 Section 33.1: Basic map-making with map() from the package maps ............................................................... 161 Section 33.2: 50 State Maps and Advanced Choropleths with Google Viz ......................................................... 164 Section 33.3: Interactive plotly maps ...................................................................................................................... 165 Section 33.4: Making Dynamic HTML Maps with Leaflet ...................................................................................... 167 Section 33.5: Dynamic Leaflet maps in Shiny applications .................................................................................. 168 Chapter 34: Set operations ................................................................................................................................. 171 Section 34.1: Set operators for pairs of vectors ..................................................................................................... 171 Section 34.2: Cartesian or "cross" products of vectors ......................................................................................... 171 Section 34.3: Set membership for vectors .............................................................................................................. 172 Section 34.4: Make unique / drop duplicates / select distinct elements from a vector .................................... 172 Section 34.5: Measuring set overlaps / Venn diagrams for vectors ................................................................... 173 Chapter 35: tidyverse ............................................................................................................................................. 174 Section 35.1: tidyverse: an overview ........................................................................................................................ 174 Section 35.2: Creating tbl_df’s ................................................................................................................................. 175 Chapter 36: Rcpp ...................................................................................................................................................... 176 Section 36.1: Extending Rcpp with Plugins .............................................................................................................. 176 Section 36.2: Inline Code Compile ............................................................................................................................ 176 Section 36.3: Rcpp Attributes ................................................................................................................................... 177 Section 36.4: Specifying Additional Build Dependencies ...................................................................................... 178 Chapter 37: Random Numbers Generator .................................................................................................. 179 Section 37.1: Random permutations ........................................................................................................................ 179 Section 37.2: Generating random numbers using various density functions ..................................................... 179 Section 37.3: Random number generator's reproducibility .................................................................................. 181 Chapter 38: Parallel processing ........................................................................................................................ 182 Section 38.1: Parallel processing with parallel package ........................................................................................ 182 Section 38.2: Parallel processing with foreach package ...................................................................................... 183 Section 38.3: Random Number Generation ............................................................................................................ 184 Section 38.4: mcparallelDo ....................................................................................................................................... 184 Chapter 39: Subsetting .......................................................................................................................................... 186 Section 39.1: Data frames ......................................................................................................................................... 186 Section 39.2: Atomic vectors .................................................................................................................................... 187 Section 39.3: Matrices ............................................................................................................................................... 188 Section 39.4: Lists ...................................................................................................................................................... 190 Section 39.5: Vector indexing ................................................................................................................................... 191 Section 39.6: Other objects ....................................................................................................................................... 192 Section 39.7: Elementwise Matrix Operations ........................................................................................................ 192 Chapter 40: Debugging ......................................................................................................................................... 194 Section 40.1: Using debug ........................................................................................................................................ 194 Section 40.2: Using browser ..................................................................................................................................... 194 Chapter 41: Installing packages ....................................................................................................................... 196 Section 41.1: Install packages from GitHub ............................................................................................................. 196 Section 41.2: Download and install packages from repositories ......................................................................... 197 Section 41.3: Install package from local source ..................................................................................................... 198 Section 41.4: Install local development version of a package .............................................................................. 198 Section 41.5: Using a CLI package manager -- basic pacman usage ................................................................. 199 Chapter 42: Inspecting packages .................................................................................................................... 200 Section 42.1: View Package Version ........................................................................................................................ 200 Section 42.2: View Loaded packages in Current Session ..................................................................................... 200 Section 42.3: View package information ................................................................................................................ 200 Section 42.4: View package's built-in data sets ..................................................................................................... 200 Section 42.5: List a package's exported functions ................................................................................................ 200 Chapter 43: Creating packages with devtools ......................................................................................... 201 Section 43.1: Creating and distributing packages .................................................................................................. 201 Section 43.2: Creating vignettes .............................................................................................................................. 203 Chapter 44: Using pipe assignment in your own package %<>%: How to ? .............................. 204 Section 44.1: Putting the pipe in a utility-functions file .......................................................................................... 204 Chapter 45: Arima Models ................................................................................................................................... 205 Section 45.1: Modeling an AR1 Process with Arima ................................................................................................ 205 Chapter 46: Distribution Functions ................................................................................................................. 210 Section 46.1: Normal distribution ............................................................................................................................. 210 Section 46.2: Binomial Distribution .......................................................................................................................... 210 Chapter 47: Shiny ..................................................................................................................................................... 214 Section 47.1: Create an app ...................................................................................................................................... 214 Section 47.2: Checkbox Group ................................................................................................................................. 214 Section 47.3: Radio Button ....................................................................................................................................... 215 Section 47.4: Debugging ........................................................................................................................................... 216 Section 47.5: Select box ............................................................................................................................................ 216 Section 47.6: Launch a Shiny app ............................................................................................................................ 217 Section 47.7: Control widgets ................................................................................................................................... 218 Chapter 48: spatial analysis ............................................................................................................................... 220 Section 48.1: Create spatial points from XY data set ............................................................................................. 220 Section 48.2: Importing a shape file (.shp) ............................................................................................................. 221 Chapter 49: sqldf ...................................................................................................................................................... 222 Section 49.1: Basic Usage Examples ....................................................................................................................... 222 Chapter 50: Code profiling .................................................................................................................................. 224 Section 50.1: Benchmarking using microbenchmark ............................................................................................ 224 Section 50.2: proc.time() ........................................................................................................................................... 225 Section 50.3: Microbenchmark ................................................................................................................................ 226 Section 50.4: System.time ........................................................................................................................................ 227 Section 50.5: Line Profiling ....................................................................................................................................... 227 Chapter 51: Control flow structures ................................................................................................................ 229 Section 51.1: Optimal Construction of a For Loop .................................................................................................. 229 Section 51.2: Basic For Loop Construction .............................................................................................................. 230 Section 51.3: The Other Looping Constructs: while and repeat ............................................................................ 230 Chapter 52: Column wise operation ................................................................................................................ 234 Section 52.1: sum of each column ........................................................................................................................... 234 Chapter 53: JSON ..................................................................................................................................................... 236 Section 53.1: JSON to / from R objects ................................................................................................................... 236 Chapter 54: RODBC ................................................................................................................................................. 238 Section 54.1: Connecting to Excel Files via RODBC ................................................................................................ 238 Section 54.2: SQL Server Management Database connection to get individual table ...................................... 238 Section 54.3: Connecting to relational databases ................................................................................................. 238 Chapter 55: lubridate ............................................................................................................................................. 239 Section 55.1: Parsing dates and datetimes from strings with lubridate .............................................................. 239 Section 55.2: Dierence between period and duration ........................................................................................ 240 Section 55.3: Instants ................................................................................................................................................ 240 Section 55.4: Intervals, Durations and Periods ....................................................................................................... 241 Section 55.5: Manipulating date and time in lubridate .......................................................................................... 242 Section 55.6: Time Zones ......................................................................................................................................... 243 Section 55.7: Parsing date and time in lubridate ................................................................................................... 243 Section 55.8: Rounding dates .................................................................................................................................. 243 Chapter 56: Time Series and Forecasting .................................................................................................... 245 Section 56.1: Creating a ts object ............................................................................................................................. 245 Section 56.2: Exploratory Data Analysis with time-series data ............................................................................ 245 Chapter 57: strsplit function ............................................................................................................................... 247 Section 57.1: Introduction .......................................................................................................................................... 247 Chapter 58: Web scraping and parsing ........................................................................................................ 248 Section 58.1: Basic scraping with rvest .................................................................................................................... 248 Section 58.2: Using rvest when login is required ................................................................................................... 248 Chapter 59: Generalized linear models ......................................................................................................... 250 Section 59.1: Logistic regression on Titanic dataset .............................................................................................. 250 Chapter 60: Reshaping data between long and wide forms ............................................................. 253 Section 60.1: Reshaping data ................................................................................................................................... 253 Section 60.2: The reshape function ......................................................................................................................... 254 Chapter 61: RMarkdown and knitr presentation ...................................................................................... 256 Section 61.1: Adding a footer to an ioslides presentation ...................................................................................... 256 Section 61.2: Rstudio example .................................................................................................................................. 257 Chapter 62: Scope of variables ......................................................................................................................... 259 Section 62.1: Environments and Functions ............................................................................................................. 259 Section 62.2: Function Exit ........................................................................................................................................ 259 Section 62.3: Sub functions ...................................................................................................................................... 260 Section 62.4: Global Assignment ............................................................................................................................. 260 Section 62.5: Explicit Assignment of Environments and Variables ...................................................................... 261 Chapter 63: Performing a Permutation Test .............................................................................................. 262 Section 63.1: A fairly general function ..................................................................................................................... 262 Chapter 64: xgboost ............................................................................................................................................... 265 Section 64.1: Cross Validation and Tuning with xgboost ....................................................................................... 265 Chapter 65: R code vectorization best practices ..................................................................................... 267 Section 65.1: By row operations ............................................................................................................................... 267 Chapter 66: Missing values .................................................................................................................................. 270 Section 66.1: Examining missing data ...................................................................................................................... 270 Section 66.2: Reading and writing data with NA values ....................................................................................... 270 Section 66.3: Using NAs of dierent classes .......................................................................................................... 270 Section 66.4: TRUE/FALSE and/or NA .................................................................................................................... 271 Chapter 67: Hierarchical Linear Modeling ................................................................................................... 272 Section 67.1: basic model fitting ............................................................................................................................... 272 Chapter 68: *apply family of functions (functionals) ............................................................................ 273 Section 68.1: Using built-in functionals .................................................................................................................... 273 Section 68.2: Combining multiple `data.frames` (`lapply`, `mapply`) .................................................................... 273 Section 68.3: Bulk File Loading ................................................................................................................................ 275 Section 68.4: Using user-defined functionals ......................................................................................................... 275 Chapter 69: Text mining ........................................................................................................................................ 277 Section 69.1: Scraping Data to build N-gram Word Clouds .................................................................................. 277 Chapter 70: ANOVA ................................................................................................................................................. 281 Section 70.1: Basic usage of aov() ........................................................................................................................... 281 Section 70.2: Basic usage of Anova() ..................................................................................................................... 281 Chapter 71: Raster and Image Analysis ........................................................................................................ 283 Section 71.1: Calculating GLCM Texture ................................................................................................................... 283 Section 71.2: Mathematical Morphologies .............................................................................................................. 285 Chapter 72: Survival analysis ............................................................................................................................. 287 Section 72.1: Random Forest Survival Analysis with randomForestSRC ............................................................. 287 Section 72.2: Introduction - basic fitting and plotting of parametric survival models with the survival package ............................................................................................................................................................. 288 Section 72.3: Kaplan Meier estimates of survival curves and risk set tables with survminer ........................... 289 Chapter 73: Fault-tolerant/resilient code ................................................................................................... 292 Section 73.1: Using tryCatch() .................................................................................................................................. 292 Chapter 74: Reproducible R ............................................................................................................................... 295 Section 74.1: Data reproducibility ............................................................................................................................ 295 Section 74.2: Package reproducibility ..................................................................................................................... 295 Chapter 75: Fourier Series and Transformations .................................................................................... 296 Section 75.1: Fourier Series ....................................................................................................................................... 297 Chapter 76: .Rprofile ............................................................................................................................................... 302 Section 76.1: .Rprofile - the first chunk of code executed ...................................................................................... 302 Section 76.2: .Rprofile example ................................................................................................................................ 303 Chapter 77: dplyr ...................................................................................................................................................... 304 Section 77.1: dplyr's single table verbs .................................................................................................................... 304 Section 77.2: Aggregating with %>% (pipe) operator ............................................................................................ 311 Section 77.3: Subset Observation (Rows) ............................................................................................................... 312 Section 77.4: Examples of NSE and string variables in dpylr ............................................................................... 313 Chapter 78: caret ..................................................................................................................................................... 314 Section 78.1: Preprocessing ...................................................................................................................................... 314 Chapter 79: Extracting and Listing Files in Compressed Archives .................................................. 315 Section 79.1: Extracting files from a .zip archive .................................................................................................... 315 Chapter 80: Probability Distributions with R .............................................................................................. 316 Section 80.1: PDF and PMF for dierent distributions in R .................................................................................... 316 Chapter 81: R in LaTeX with knitr ..................................................................................................................... 317 Section 81.1: R in LaTeX with Knitr and Code Externalization ............................................................................... 317 Section 81.2: R in LaTeX with Knitr and Inline Code Chunks ................................................................................. 317 Section 81.3: R in LaTex with Knitr and Internal Code Chunks .............................................................................. 318 Chapter 82: Web Crawling in R .......................................................................................................................... 319 Section 82.1: Standard scraping approach using the RCurl package ................................................................. 319 Chapter 83: Creating reports with RMarkdown ........................................................................................ 320 Section 83.1: Including bibliographies ...................................................................................................................... 320 Section 83.2: Including LaTeX Preample Commands ........................................................................................... 320 Section 83.3: Printing tables ..................................................................................................................................... 321 Section 83.4: Basic R-markdown document structure .......................................................................................... 323 Chapter 84: GPU-accelerated computing ................................................................................................... 326 Section 84.1: gpuR gpuMatrix objects ..................................................................................................................... 326 Section 84.2: gpuR vclMatrix objects ...................................................................................................................... 326 Chapter 85: heatmap and heatmap.2 ........................................................................................................... 327 Section 85.1: Examples from the ocial documentation ...................................................................................... 327 Section 85.2: Tuning parameters in heatmap.2 ..................................................................................................... 335 Chapter 86: Network analysis with the igraph package ...................................................................... 341 Section 86.1: Simple Directed and Non-directed Network Graphing ................................................................... 341 Chapter 87: Functional programming ........................................................................................................... 343 Section 87.1: Built-in Higher Order Functions ......................................................................................................... 343 Chapter 88: Get user input .................................................................................................................................. 344 Section 88.1: User input in R ..................................................................................................................................... 344 Chapter 89: Spark API (SparkR) ........................................................................................................................ 345 Section 89.1: Setup Spark context ............................................................................................................................ 345 Section 89.2: Cache data .......................................................................................................................................... 345 Section 89.3: Create RDDs (Resilient Distributed Datasets) ................................................................................. 346 Chapter 90: Meta: Documentation Guidelines ........................................................................................... 347 Section 90.1: Style ...................................................................................................................................................... 347 Section 90.2: Making good examples ..................................................................................................................... 347 Chapter 91: Input and output ............................................................................................................................. 348 Section 91.1: Reading and writing data frames ...................................................................................................... 348 Chapter 92: I/O for foreign tables (Excel, SAS, SPSS, Stata) ............................................................ 350 Section 92.1: Importing data with rio ....................................................................................................................... 350 Section 92.2: Read and write Stata, SPSS and SAS files ....................................................................................... 350 Section 92.3: Importing Excel files ........................................................................................................................... 351 Section 92.4: Import or Export of Feather file ........................................................................................................ 354 Chapter 93: I/O for database tables .............................................................................................................. 356 Section 93.1: Reading Data from MySQL Databases ............................................................................................ 356 Section 93.2: Reading Data from MongoDB Databases ...................................................................................... 356 Chapter 94: I/O for geographic data (shapefiles, etc.) ....................................................................... 357 Section 94.1: Import and Export Shapefiles ............................................................................................................ 357 Chapter 95: I/O for raster images .................................................................................................................. 358 Section 95.1: Load a multilayer raster ..................................................................................................................... 358 Chapter 96: I/O for R's binary format ........................................................................................................... 360 Section 96.1: Rds and RData (Rda) files ................................................................................................................. 360 Section 96.2: Enviromments ..................................................................................................................................... 360 Chapter 97: Recycling ............................................................................................................................................ 361 Section 97.1: Recycling use in subsetting ................................................................................................................ 361 Chapter 98: Expression: parse + eval ............................................................................................................. 362 Section 98.1: Execute code in string format ............................................................................................................ 362 Chapter 99: Regular Expression Syntax in R .............................................................................................. 363 Section 99.1: Use `grep` to find a string in a character vector .............................................................................. 363 Chapter 100: Regular Expressions (regex) ................................................................................................... 365 Section 100.1: Dierences between Perl and POSIX regex .................................................................................... 365 Section 100.2: Validate a date in a "YYYYMMDD" format ..................................................................................... 365 Section 100.3: Escaping characters in R regex patterns ....................................................................................... 366 Section 100.4: Validate US States postal abbreviations ........................................................................................ 366 Section 100.5: Validate US phone numbers ............................................................................................................ 366 Chapter 101: Combinatorics ................................................................................................................................. 368 Section 101.1: Enumerating combinations of a specified length ........................................................................... 368 Section 101.2: Counting combinations of a specified length ................................................................................. 369 Chapter 102: Solving ODEs in R .......................................................................................................................... 370 Section 102.1: The Lorenz model .............................................................................................................................. 370 Section 102.2: Lotka-Volterra or: Prey vs. predator ............................................................................................... 371 Section 102.3: ODEs in compiled languages - definition in R ................................................................................ 373 Section 102.4: ODEs in compiled languages - definition in C ................................................................................ 373 Section 102.5: ODEs in compiled languages - definition in fortran ...................................................................... 375 Section 102.6: ODEs in compiled languages - a benchmark test ......................................................................... 376 Chapter 103: Feature Selection in R -- Removing Extraneous Features ...................................... 378 Section 103.1: Removing features with zero or near-zero variance ..................................................................... 378 Section 103.2: Removing features with high numbers of NA ................................................................................ 378 Section 103.3: Removing closely correlated features ............................................................................................ 378 Chapter 104: Bibliography in RMD ................................................................................................................... 380 Section 104.1: Specifying a bibliography and cite authors .................................................................................... 380 Section 104.2: Inline references ................................................................................................................................ 381 Section 104.3: Citation styles .................................................................................................................................... 382 Chapter 105: Writing functions in R ................................................................................................................. 385 Section 105.1: Anonymous functions ........................................................................................................................ 385 Section 105.2: RStudio code snippets ...................................................................................................................... 385 Section 105.3: Named functions ............................................................................................................................... 386 Chapter 106: Color schemes for graphics .................................................................................................... 388 Section 106.1: viridis - print and colorblind friendly palettes ................................................................................. 388 Section 106.2: A handy function to glimse a vector of colors ............................................................................... 389 Section 106.3: colorspace - click&drag interface for colors .................................................................................. 390 Section 106.4: Colorblind-friendly palettes ............................................................................................................. 391 Section 106.5: RColorBrewer .................................................................................................................................... 392 Section 106.6: basic R color functions ..................................................................................................................... 393 Chapter 107: Hierarchical clustering with hclust ...................................................................................... 394 Section 107.1: Example 1 - Basic use of hclust, display of dendrogram, plot clusters ........................................ 394 Section 107.2: Example 2 - hclust and outliers ....................................................................................................... 397 Chapter 108: Random Forest Algorithm ....................................................................................................... 400 Section 108.1: Basic examples - Classification and Regression ............................................................................ 400 Chapter 109: RESTful R Services ....................................................................................................................... 402 Section 109.1: opencpu Apps .................................................................................................................................... 402 Chapter 110: Machine learning ........................................................................................................................... 403 Section 110.1: Creating a Random Forest model .................................................................................................... 403 Chapter 111: Using texreg to export models in a paper-ready way ............................................... 404 Section 111.1: Printing linear regression results ....................................................................................................... 404 Chapter 112: Publishing ........................................................................................................................................... 406 Section 112.1: Formatting tables ............................................................................................................................... 406 Section 112.2: Formatting entire documents ........................................................................................................... 406 Chapter 113: Implement State Machine Pattern using S4 Class ....................................................... 407 Section 113.1: Parsing Lines using State Machine .................................................................................................... 407 Chapter 114: Reshape using tidyr ..................................................................................................................... 419 Section 114.1: Reshape from long to wide format with spread() .......................................................................... 419 Section 114.2: Reshape from wide to long format with gather() .......................................................................... 419 Chapter 115: Modifying strings by substitution ......................................................................................... 421 Section 115.1: Rearrange character strings using capture groups ....................................................................... 421 Section 115.2: Eliminate duplicated consecutive elements .................................................................................... 421 Chapter 116: Non-standard evaluation and standard evaluation ................................................... 423 Section 116.1: Examples with standard dplyr verbs ................................................................................................ 423 Chapter 117: Randomization ................................................................................................................................ 425 Section 117.1: Random draws and permutations .................................................................................................... 425 Section 117.2: Setting the seed .................................................................................................................................. 427 Chapter 118: Object-Oriented Programming in R ..................................................................................... 428 Section 118.1: S3 .......................................................................................................................................................... 428 Chapter 119: Coercion ............................................................................................................................................. 429 Section 119.1: Implicit Coercion ................................................................................................................................. 429 Chapter 120: Standardize analyses by writing standalone R scripts ............................................ 430 Section 120.1: The basic structure of standalone R program and how to call it ................................................ 430 Section 120.2: Using littler to execute R scripts ...................................................................................................... 431 Chapter 121: Analyze tweets with R ................................................................................................................. 433 Section 121.1: Download Tweets ............................................................................................................................... 433 Section 121.2: Get text of tweets ............................................................................................................................... 433 Chapter 122: Natural language processing ................................................................................................. 435 Section 122.1: Create a term frequency matrix ...................................................................................................... 435 Chapter 123: R Markdown Notebooks (from RStudio) .......................................................................... 437 Section 123.1: Creating a Notebook ......................................................................................................................... 437 Section 123.2: Inserting Chunks ................................................................................................................................ 437 Section 123.3: Executing Chunk Code ...................................................................................................................... 438 Section 123.4: Execution Progress ............................................................................................................................ 439 Section 123.5: Preview Output .................................................................................................................................. 440 Section 123.6: Saving and Sharing ........................................................................................................................... 440 Chapter 124: Aggregating data frames ....................................................................................................... 442 Section 124.1: Aggregating with data.table ............................................................................................................. 442 Section 124.2: Aggregating with base R ................................................................................................................. 443 Section 124.3: Aggregating with dplyr ..................................................................................................................... 444 Chapter 125: Data acquisition ............................................................................................................................ 446 Section 125.1: Built-in datasets ................................................................................................................................. 446 Section 125.2: Packages to access open databases ............................................................................................. 446 Section 125.3: Packages to access restricted data ................................................................................................ 448 Section 125.4: Datasets within packages ................................................................................................................ 452 Chapter 126: R memento by examples .......................................................................................................... 454 Section 126.1: Plotting (using plot) ........................................................................................................................... 454 Section 126.2: Commonly used functions ............................................................................................................... 454 Section 126.3: Data types ......................................................................................................................................... 455 Chapter 127: Updating R version ...................................................................................................................... 457 Section 127.1: Installing from R Website .................................................................................................................. 457 Section 127.2: Updating from within R using installr Package ............................................................................. 457 Section 127.3: Deciding on the old packages ......................................................................................................... 457 Section 127.4: Updating Packages ........................................................................................................................... 459 Section 127.5: Check R Version ................................................................................................................................ 459 Credits ............................................................................................................................................................................ 460 You may also like ...................................................................................................................................................... 464 About Please feel free to share this PDF with anyone for free, latest version of this book can be downloaded from: https://goalkicker.com/RBook This R Notes for Professionals book is compiled from Stack Overflow Documentation, the content is written by the beautiful people at Stack Overflow. Text content is released under Creative Commons BY-SA, see credits at the end of this book whom contributed to the various chapters. Images may be copyright of their respective owners unless otherwise specified This is an unofficial free book created for educational purposes and is not affiliated with official R group(s) or company(s) nor Stack Overflow. All trademarks and registered trademarks are the property of their respective company owners The information presented in this book is not guaranteed to be correct nor accurate, use at your own risk Please send feedback and corrections to [email protected] GoalKicker.com – R Notes for Professionals 1 Chapter 1: Getting started with R Language Section 1.1: Installing R You might wish to install RStudio after you have installed R. RStudio is a development environment for R that simplifies many programming tasks. Windows only: Visual Studio (starting from version 2015 Update 3) now features a development environment for R called R Tools, that includes a live interpreter, IntelliSense, and a debugging module. If you choose this method, you won't have to install R as specified in the following section. For Windows 1. Go to the CRAN website, click on download R for Windows, and download the latest version of R. 2. Right-click the installer file and RUN as administrator. 3. Select the operational language for installation. 4. Follow the instructions for installation. For OSX / macOS Alternative 1 (0. Ensure XQuartz is installed ) 1. Go to the CRAN website and download the latest version of R. 2. Open the disk image and run the installer. 3. Follow the instructions for installation. This will install both R and the R-MacGUI. It will put the GUI in the /Applications/ Folder as R.app where it can either be double-clicked or dragged to the Doc. When a new version is released, the (re)-installation process will overwrite R.app but prior major versions of R will be maintained. The actual R code will be in the /Library/Frameworks/R.Framework/Versions/ directory. Using R within RStudio is also possible and would be using the same R code with a different GUI. Alternative 2 1. Install homebrew (the missing package manager for macOS) by following the instructions on https://brew.sh/ 2. brew install R Those choosing the second method should be aware that the maintainer of the Mac fork advises against it, and will not respond to questions about difficulties on the R-SIG-Mac Mailing List. For Debian, Ubuntu and derivatives You can get the version of R corresponding to your distro via apt-get. However, this version will frequently be quite far behind the most recent version available on CRAN. You can add CRAN to your list of recognized "sources". sudo apt-get install r-base You can get a more recent version directly from CRAN by adding CRAN to your sources list. Follow the directions from CRAN for more details. Note in particular the need to also execute this so that you can use GoalKicker.com – R Notes for Professionals 2 install.packages(). Linux packages are usually distributed as source files and need compilation: sudo apt-get install r-base-dev For Red Hat and Fedora sudo dnf install R For Archlinux R is directly available in the Extra package repo. sudo pacman -S r More info on using R under Archlinux can be found on the ArchWiki R page. Section 1.2: Hello World! "Hello World!" Also, check out the detailed discussion of how, when, whether and why to print a string. Section 1.3: Getting Help You can use function help() or ? to access documentations and search for help in R. For even more general searches, you can use help.search() or ??. #For help on the help function of R help() #For help on the paste function help(paste) #OR help("paste") #OR ?paste #OR ?"paste" Visit https://www.r-project.org/help.html for additional information Section 1.4: Interactive mode and R scripts The interactive mode The most basic way to use R is the interactive mode. You type commands and immediately get the result from R. Using R as a calculator Start R by typing R at the command prompt of your operating system or by executing RGui on Windows. Below you can see a screenshot of an interactive R session on Linux: GoalKicker.com – R Notes for Professionals 3 This is RGui on Windows, the most basic working environment for R under Windows: After the > sign, expressions can be typed in. Once an expression is typed, the result is shown by R. In the screenshot above, R is used as a calculator: Type GoalKicker.com – R Notes for Professionals 4 1+1 to immediately see the result, 2. The leading [1] indicates that R returns a vector. In this case, the vector contains only one number (2). The first plot R can be used to generate plots. The following example uses the data set PlantGrowth, which comes as an example data set along with R Type int the following all lines into the R prompt which do not start with ##. Lines starting with ## are meant to document the result which R will return. data(PlantGrowth) str(PlantGrowth) ## 'data.frame': 30 obs. of 2 variables: ## $ weight: num 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ... ## $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ... anova(lm(weight ~ group, data = PlantGrowth)) ## Analysis of Variance Table ## ## Response: weight ## Df Sum Sq Mean Sq F value Pr(>F) ## group 2 3.7663 1.8832 4.8461 0.01591 * ## Residuals 27 10.4921 0.3886 ## --- ## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 boxplot(weight ~ group, data = PlantGrowth, ylab = "Dry weight") The following plot is created: data(PlantGrowth) loads the example data set PlantGrowth, which is records of dry masses of plants which were subject to two different treatment conditions or no treatment at all (control group). The data set is made available under the name PlantGrowth. Such a name is also called a Variable. To load your own data, the following two documentation pages might be helpful: Reading and writing tabular data in plain-text files (CSV, TSV, etc.) I/O for foreign tables (Excel, SAS, SPSS, Stata) str(PlantGrowth) shows information about the data set which was loaded. The output indicates that PlantGrowth is a data.frame, which is R's name for a table. The data.frame contains of two columns and 30 rows. In this case, each row corresponds to one plant. Details of the two columns are shown in the lines starting with $: The first GoalKicker.com – R Notes for Professionals 5 column is called weight and contains numbers (num, the dry weight of the respective plant). The second column, group, contains the treatment that the plant was subjected to. This is categorial data, which is called factor in R. Read more information about data frames. To compare the dry masses of the three different groups, a one-way ANOVA is performed using anova(lm( ... )). weight ~ group means "Compare the values of the column weight, grouping by the values of the column group". This is called a Formula in R. data = ... specifies the name of the table where the data can be found. The result shows, among others, that there exists a significant difference (Column Pr(>F)), p = 0.01591) between some of the three groups. Post-hoc tests, like Tukey's Test, must be performed to determine which groups' means differ significantly. boxplot(...) creates a box plot of the data. where the values to be plotted come from. weight ~ group means: "Plot the values of the column weight versus the values of the column group. ylab = ... specifies the label of the y axis. More information: Base plotting Ctrl - D Type q() or to exit from the R session. R scripts To document your research, it is favourable to save the commands you use for calculation in a file. For that effect, you can create R scripts. An R script is a simple text file, containing R commands. Create a text file with the name plants.R, and fill it with the following text, where some commands are familiar from the code block above: data(PlantGrowth) anova(lm(weight ~ group, data = PlantGrowth)) png("plant_boxplot.png", width = 400, height = 300) boxplot(weight ~ group, data = PlantGrowth, ylab = "Dry weight") dev.off() Execute the script by typing into your terminal (The terminal of your operating system, not an interactive R session like in the previous section!) R --no-save plant_result.txt The file plant_result.txt contains the results of your calculation, as if you had typed them into the interactive R prompt. Thereby, your calculations are documented. The new commands png and dev.off are used for saving the boxplot to disk. The two commands must enclose the plotting command, as shown in the example above. png("FILENAME", width = ..., height = ...) opens a new PNG file with the specified file name, width and height in pixels. dev.off() will finish plotting and saves the plot to disk. No output is saved until dev.off() is called. GoalKicker.com – R Notes for Professionals 6 Chapter 2: Variables Section 2.1: Variables, data structures and basic Operations In R, data objects are manipulated using named data structures. The names of the objects might be called "variables" although that term does not have a specific meaning in the official R documentation. R names are case sensitive and may contain alphanumeric characters(a-z,A-z,0-9), the dot/period(.) and underscore(_). To create names for the data structures, we have to follow the following rules: Names that start with a digit or an underscore (e.g. 1a), or names that are valid numerical expressions (e.g. .11), or names with dashes ('-') or spaces can only be used when they are quoted: `1a` and `.11`. The names will be printed with backticks: list( '.11' ="a") #$`.11` #[1] "a" All other combinations of alphanumeric characters, dots and underscores can be used freely, where reference with or without backticks points to the same object. Names that begin with . are considered system names and are not always visible using the ls()-function. There is no restriction on the number of characters in a variable name. Some examples of valid object names are: foobar, foo.bar, foo_bar, .foobar In R, variables are assigned values using the infix-assignment operator <-. The operator = can also be used for assigning values to variables, however its proper use is for associating values with parameter names in function calls. Note that omitting spaces around operators may create confusion for users. The expression a<-1 is parsed as assignment (a <- 1) rather than as a logical comparison (a < -1). > foo <- 42 > fooEquals = 43 So foo is assigned the value of 42. Typing foo within the console will output 42, while typing fooEquals will output 43. > foo [1] 42 > fooEquals [1] 43 The following command assigns a value to the variable named x and prints the value simultaneously: > (x <- 5) [1] 5 # actually two function calls: first one to `<-`; second one to the `()`-function > is.function(`(`) [1] TRUE # Often used in R help page examples for its side-effect of printing. It is also possible to make assignments to variables using ->. > 5 -> x > x GoalKicker.com – R Notes for Professionals 7 [1] 5 > Types of data structures There are no scalar data types in R. Vectors of length-one act like scalars. Vectors: Atomic vectors must be sequence of same-class objects.: a sequence of numbers, or a sequence of logicals or a sequence of characters. v <- c(2, 3, 7, 10), v2 <- c("a", "b", "c") are both vectors. Matrices: A matrix of numbers, logical or characters. a <- matrix(data = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), nrow = 4, ncol = 3, byrow = F). Like vectors, matrix must be made of same-class elements. To extract elements from a matrix rows and columns must be specified: a[1,2] returns [1] 5 that is the element on the first row, second column. Lists: concatenation of different elements mylist <- list (course = 'stat', date = '04/07/2009', num_isc = 7, num_cons = 6, num_mat = as.character(c(45020, 45679, 46789, 43126, 42345, 47568, 45674)), results = c(30, 19, 29, NA, 25, 26 ,27) ). Extracting elements from a list can be done by name (if the list is named) or by index. In the given example mylist$results and mylist[[6]] obtains the same element. Warning: if you try mylist[6], R won't give you an error, but it extract the result as a list. While mylist[[6]][2] is permitted (it gives you 19), mylist[6][2] gives you an error. data.frame: object with columns that are vectors of equal length, but (possibly) different types. They are not matrices. exam <- data.frame(matr = as.character(c(45020, 45679, 46789, 43126, 42345, 47568, 45674)), res_S = c(30, 19, 29, NA, 25, 26, 27), res_O = c(3, 3, 1, NA, 3, 2, NA), res_TOT = c(30,22,30,NA,28,28,27)). Columns can be read by name exam$matr, exam[, 'matr'] or by index exam[1], exam[,1]. Rows can also be read by name exam['rowname', ] or index exam[1,]. Dataframes are actually just lists with a particular structure (rownames-attribute and equal length components) Common operations and some cautionary advice Default operations are done element by element. See ?Syntax for the rules of operator precedence. Most operators (and may other functions in base R) have recycling rules that allow arguments of unequal length. Given these objects: Example objects > a <- 1 > b <- 2 > c <- c(2,3,4) > d <- c(10,10,10) > e <- c(1,2,3,4) > f <- 1:6 > W <- cbind(1:4,5:8,9:12) > Z <- rbind(rep(0,3),1:3,rep(10,3),c(4,7,1)) Some vector operations > a+b # scalar + scalar [1] 3 > c+d # vector + vector [1] 12 13 14 > a*b # scalar * scalar [1] 2 > c*d # vector * vector (componentwise!) [1] 20 30 40 > c+a # vector + scalar [1] 3 4 5 > c^2 # [1] 4 9 16 > exp(c) GoalKicker.com – R Notes for Professionals 8 [1] 7.389056 20.085537 54.598150 Some vector operation Warnings! > c+e # warning but.. no errors, since recycling is assumed to be desired. [1] 3 5 7 6 Warning message: In c + e : longer object length is not a multiple of shorter object length R sums what it can and then reuses the shorter vector to fill in the blanks... The warning was given only because the two vectors have lengths that are not exactly multiples. c+f # no warning whatsoever. Some Matrix operations Warning! > Z+W # matrix + matrix #(componentwise) > Z*W # matrix* matrix#(Standard product is always componentwise) To use a matrix multiply: V %*% W > W + a # matrix+ scalar is still componentwise [,1] [,2] [,3] [1,] 2 6 10 [2,] 3 7 11 [3,] 4 8 12 [4,] 5 9 13 > W + c # matrix + vector... : no warnings and R does the operation in a column-wise manner [,1] [,2] [,3] [1,] 3 8 13 [2,] 5 10 12 [3,] 7 9 14 [4,] 6 11 16 "Private" variables A leading dot in a name of a variable or function in R is commonly used to denote that the variable or function is meant to be hidden. So, declaring the following variables > foo <- 'foo' > .foo <- 'bar' And then using the ls function to list objects will only show the first object. > ls() [1] "foo" However, passing all.names = TRUE to the function will show the 'private' variable > ls(all.names = TRUE) [1] ".foo" "foo" GoalKicker.com – R Notes for Professionals 9 Chapter 3: Arithmetic Operators Section 3.1: Range and addition Let's take an example of adding a value to a range (as it could be done in a loop for example): 3+1:5 Gives: [1] 4 5 6 7 8 This is because the range operator : has higher precedence than addition operator +. What happens during evaluation is as follows: 3+1:5 3+c(1, 2, 3, 4, 5) expansion of the range operator to make a vector of integers. c(4, 5, 6, 7, 8) Addition of 3 to each member of the vector. To avoid this behavior you have to tell the R interpreter how you want it to order the operations with ( ) like this: (3+1):5 Now R will compute what is inside the parentheses before expanding the range and gives: [1] 4 5 Section 3.2: Addition and subtraction The basic math operations are performed mainly on numbers or on vectors (lists of numbers). 1. Using single numbers We can simple enter the numbers concatenated with + for adding and - for subtracting: > 3 + 4.5 # [1] 7.5 > 3 + 4.5 + 2 # [1] 9.5 > 3 + 4.5 + 2 - 3.8 # [1] 5.7 > 3 + NA #[1] NA > NA + NA #[1] NA > NA - NA #[1] NA > NaN - NA #[1] NaN > NaN + NA #[1] NaN We can assign the numbers to variables (constants in this case) and do the same operations: GoalKicker.com – R Notes for Professionals 10 > a <- 3; B <- 4.5; cc <- 2; Dd <- 3.8 ;na<-NA;nan<-NaN > a + B # [1] 7.5 > a + B + cc # [1] 9.5 > a + B + cc - Dd # [1] 5.7 > B-nan #[1] NaN > a+na-na #[1] NA > a + na #[1] NA > B-nan #[1] NaN > a+na-na #[1] NA 2. Using vectors In this case we create vectors of numbers and do the operations using those vectors, or combinations with single numbers. In this case the operation is done considering each element of the vector: > A <- c(3, 4.5, 2, -3.8); > A # [1] 3.0 4.5 2.0 -3.8 > A + 2 # Adding a number # [1] 5.0 6.5 4.0 -1.8 > 8 - A # number less vector # [1] 5.0 3.5 6.0 11.8 > n <- length(A) #number of elements of vector A > n # [1] 4 > A[-n] + A[n] # Add the last element to the same vector without the last element # [1] -0.8 0.7 -1.8 > A[1:2] + 3 # vector with the first two elements plus a number # [1] 6.0 7.5 > A[1:2] - A[3:4] # vector with the first two elements less the vector with elements 3 and 4 # [1] 1.0 8.3 We can also use the function sum to add all elements of a vector: > sum(A) # [1] 5.7 > sum(-A) # [1] -5.7 > sum(A[-n]) + A[n] # [1] 5.7 We must take care with recycling, which is one of the characteristics of R, a behavior that happens when doing math operations where the length of vectors is different. Shorter vectors in the expression are recycled as often as need be (perhaps fractionally) until they match the length of the longest vector. In particular a constant is simply repeated. In this case a Warning is show. > B <- c(3, 5, -3, 2.7, 1.8) > B # [1] 3.0 5.0 -3.0 2.7 1.8 > A # [1] 3.0 4.5 2.0 -3.8 GoalKicker.com – R Notes for Professionals 11 > A + B # the first element of A is repeated # [1] 6.0 9.5 -1.0 -1.1 4.8 Warning message: In A + B : longer object length is not a multiple of shorter object length > B - A # the first element of A is repeated # [1] 0.0 0.5 -5.0 6.5 -1.2 Warning message: In B - A : longer object length is not a multiple of shorter object length In this case the correct procedure will be to consider only the elements of the shorter vector: > B[1:n] + A # [1] 6.0 9.5 -1.0 -1.1 > B[1:n] - A # [1] 0.0 0.5 -5.0 6.5 When using the sum function, again all the elements inside the function are added. > sum(A, B) # [1] 15.2 > sum(A, -B) # [1] -3.8 > sum(A)+sum(B) # [1] 15.2 > sum(A)-sum(B) # [1] -3.8 GoalKicker.com – R Notes for Professionals 12 Chapter 4: Matrices Matrices store data Section 4.1: Creating matrices Under the hood, a matrix is a special kind of vector with two dimensions. Like a vector, a matrix can only have one data class. You can create matrices using the matrix function as shown below. matrix(data = 1:6, nrow = 2, ncol = 3) ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 As you can see this gives us a matrix of all numbers from 1 to 6 with two rows and three columns. The data parameter takes a vector of values, nrow specifies the number of rows in the matrix, and ncol specifies the number of columns. By convention the matrix is filled by column. The default behavior can be changed with the byrow parameter as shown below: matrix(data = 1:6, nrow = 2, ncol = 3, byrow = TRUE) ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6 Matrices do not have to be numeric – any vector can be transformed into a matrix. For example: matrix(data = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE), nrow = 3, ncol = 2) ## [,1] [,2] ## [1,] TRUE FALSE ## [2,] TRUE FALSE ## [3,] TRUE FALSE matrix(data = c("a", "b", "c", "d", "e", "f"), nrow = 3, ncol = 2) ## [,1] [,2] ## [1,] "a" "d" ## [2,] "b" "e" ## [3,] "c" "f" Like vectors matrices can be stored as variables and then called later. The rows and columns of a matrix can have names. You can look at these using the functions rownames and colnames. As shown below, the rows and columns don't initially have names, which is denoted by NULL. However, you can assign values to them. mat1 <- matrix(data = 1:6, nrow = 2, ncol = 3, byrow = TRUE) rownames(mat1) ## NULL colnames(mat1) ## NULL rownames(mat1) <- c("Row 1", "Row 2") colnames(mat1) <- c("Col 1", "Col 2", "Col 3") mat1 ## Col 1 Col 2 Col 3 ## Row 1 1 2 3 ## Row 2 4 5 6 It is important to note that similarly to vectors, matrices can only have one data type. If you try to specify a matrix with multiple data types the data will be coerced to the higher order data class. GoalKicker.com – R Notes for Professionals 13 The class, is, and as functions can be used to check and coerce data structures in the same way they were used on the vectors in class 1. class(mat1) ## [1] "matrix" is.matrix(mat1) ## [1] TRUE as.vector(mat1) ## [1] 1 4 2 5 3 6 GoalKicker.com – R Notes for Professionals 14 Chapter 5: Formula Section 5.1: The basics of formula Statistical functions in R make heavy use of the so-called Wilkinson-Rogers formula notation1 . When running model functions like lm for the Linear Regressions, they need a formula. This formula specifies which regression coefficients shall be estimated. my_formula1 <- formula(mpg ~ wt) class(my_formula1) # gives "formula" mod1 <- lm(my_formula1, data = mtcars) coef(mod1) # gives (Intercept) wt # 37.285126 -5.344472 On the left side of the ~ (LHS) the dependent variable is specified, while the right hand side (RHS) contains the independent variables. Technically the formula call above is redundant because the tilde-operator is an infix function that returns an object with formula class: form <- mpg ~ wt class(form) #[1] "formula" The advantage of the formula function over ~ is that it also allows an environment for evaluation to be specified: form_mt <- formula(mpg ~ wt, env = mtcars) In this case, the output shows that a regression coefficient for wt is estimated, as well as (per default) an intercept parameter. The intercept can be excluded / forced to be 0 by including 0 or -1 in the formula: coef(lm(mpg ~ 0 + wt, data = mtcars)) coef(lm(mpg ~ wt -1, data = mtcars)) Interactions between variables a and b can added by included a:b to the formula: coef(lm(mpg ~ wt:vs, data = mtcars)) As it is (from a statistical point of view) generally advisable not have interactions in the model without the main effects, the naive approach would be to expand the formula to a + b + a:b. This works but can be simplified by writing a*b, where the * operator indicates factor crossing (when between two factor columns) or multiplication when one or both of the columns are 'numeric': coef(lm(mpg ~ wt*vs, data = mtcars)) Using the * notation expands a term to include all lower order effects, such that: coef(lm(mpg ~ wt*vs*hp, data = mtcars)) will give, in addition to the intercept, 7 regression coefficients. One for the three-way interaction, three for the two way interactions and three for the main effects. GoalKicker.com – R Notes for Professionals 15 If one wants, for example, to exclude the three-way interaction, but retain all two-way interactions there are two shorthands. First, using - we can subtract any particular term: coef(lm(mpg ~ wt*vs*hp - wt:vs:hp, data = mtcars)) Or, we can use the ^ notation to specify which level of interaction we require: coef(lm(mpg ~ (wt + vs + hp) ^ 2, data = mtcars)) Those two formula specifications should create the same model matrix. Finally, . is shorthand to use all available variables as main effects. In this case, the data argument is used to obtain the available variables (which are not on the LHS). Therefore: coef(lm(mpg ~ ., data = mtcars)) gives coefficients for the intercept and 10 independent variables. This notation is frequently used in machine learning packages, where one would like to use all variables for prediction or classification. Note that the meaning of . depends on context (see e.g. ?update.formula for a different meaning). 1. G. N. Wilkinson and C. E. Rogers. Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 22, No. 3 (1973), pp. 392-399 GoalKicker.com – R Notes for Professionals 16 Chapter 6: Reading and writing strings Section 6.1: Printing and displaying strings R has several built-in functions that can be used to print or display information, but print and cat are the most basic. As R is an interpreted language, you can try these out directly in the R console: print("Hello World") #[1] "Hello World" cat("Hello World\n") #Hello World Note the difference in both input and output for the two functions. (Note: there are no quote-characters in the value of x created with x <- "Hello World". They are added by print at the output stage.) cat takes one or more character vectors as arguments and prints them to the console. If the character vector has a length greater than 1, arguments are separated by a space (by default): cat(c("hello", "world", "\n")) #hello world Without the new-line character (\n) the output would be: cat("Hello World") #Hello World> The prompt for the next command appears immediately after the output. (Some consoles such as RStudio's may automatically append a newline to strings that do not end with a newline.) print is an example of a "generic" function, which means the class of the first argument passed is detected and a class-specific method is used to output. For a character vector like "Hello World", the result is similar to the output of cat. However, the character string is quoted and a number [1] is output to indicate the first element of a character vector (In this case, the first and only element): print("Hello World") #[1] "Hello World" This default print method is also what we see when we simply ask R to print a variable. Note how the output of typing s is the same as calling print(s) or print("Hello World"): s <- "Hello World" s #[1] "Hello World" Or even without assigning it to anything: "Hello World" #[1] "Hello World" If we add another character string as a second element of the vector (using the c() function to concatenate the elements together), then the behavior of print() looks quite a bit different from that of cat: print(c("Hello World", "Here I am.")) GoalKicker.com – R Notes for Professionals 17 #[1] "Hello World" "Here I am." Observe that the c() function does not do string-concatenation. (One needs to use paste for that purpose.) R shows that the character vector has two elements by quoting them separately. If we have a vector long enough to span multiple lines, R will print the index of the element starting each line, just as it prints [1] at the start of the first line. c("Hello World", "Here I am!", "This next string is really long.") #[1] "Hello World" "Here I am!" #[3] "This next string is really long." The particular behavior of print depends on the class of the object passed to the function. If we call print an object with a different class, such as "numeric" or "logical", the quotes are omitted from the output to indicate we are dealing with an object that is not character class: print(1) #[1] 1 print(TRUE) #[1] TRUE Factor objects get printed in the same fashion as character variables which often creates ambiguity when console output is used to display objects in SO question bodies. It is rare to use cat or print except in an interactive context. Explicitly calling print() is particularly rare (unless you wanted to suppress the appearance of the quotes or view an object that is returned as invisible by a function), as entering foo at the console is a shortcut for print(foo). The interactive console of R is known as a REPL, a "read-eval-print-loop". The cat function is best saved for special purposes (like writing output to an open file connection). Sometimes it is used inside functions (where calls to print() are suppressed), however using cat() inside a function to generate output to the console is bad practice. The preferred method is to message() or warning() for intermediate messages; they behave similarly to cat but can be optionally suppressed by the end user. The final result should simply returned so that the user can assign it to store it if necessary. message("hello world") #hello world suppressMessages(message("hello world")) Section 6.2: Capture output of operating system command Functions which return a character vector Base R has two functions for invoking a system command. Both require an additional parameter to capture the output of the system command. system("top -a -b -n 1", intern = TRUE) system2("top", "-a -b -n 1", stdout = TRUE) Both return a character vector. [1] "top - 08:52:03 up 70 days, 15:09, 0 users, load average: 0.00, 0.00, 0.00" [2] "Tasks: 125 total, 1 running, 124 sleeping, 0 stopped, 0 zombie" [3] "Cpu(s): 0.9%us, 0.3%sy, 0.0%ni, 98.7%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st" [4] "Mem: 12194312k total, 3613292k used, 8581020k free, 216940k buffers" [5] "Swap: 12582908k total, 2334156k used, 10248752k free, 1682340k cached" [6] "" [7] " PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND " GoalKicker.com – R Notes for Professionals 18 [8] "11300 root 20 0 1278m 375m 3696 S 0.0 3.2 124:40.92 trala " [9] " 6093 user1 20 0 1817m 269m 1888 S 0.0 2.3 12:17.96 R " [10] " 4949 user2 20 0 1917m 214m 1888 S 0.0 1.8 11:16.73 R " For illustration, the UNIX command top -a -b -n 1 is used. This is OS specific and may need to be amended to run the examples on your computer. Package devtools has a function to run a system command and capture the output without an additional parameter. It also returns a character vector. devtools::system_output("top", "-a -b -n 1") Functions which return a data frame The fread function in package data.table allows to execute a shell command and to read the output like read.table. It returns a data.table or a data.frame. fread("top -a -b -n 1", check.names = TRUE) PID USER PR NI VIRT RES SHR S X.CPU X.MEM TIME. COMMAND 1: 11300 root 20 0 1278m 375m 3696 S 0 3.2 124:40.92 trala 2: 6093 user1 20 0 1817m 269m 1888 S 0 2.3 12:18.56 R 3: 4949 user2 20 0 1917m 214m 1888 S 0 1.8 11:17.33 R 4: 7922 user3 20 0 3094m 131m 1892 S 0 1.1 21:04.95 R Note, that fread automatically has skipped the top 6 header lines. Here the parameter check.names = TRUE was added to convert %CPU, %MEN, and TIME+ to syntactically valid column names. Section 6.3: Reading from or writing to a file connection Not always we have liberty to read from or write to a local system path. For example if R code streaming map reduce must need to read and write to file connection. There can be other scenarios as well where one is going beyond local system and with advent of cloud and big data, this is becoming increasingly common. One of the way to do this is in logical sequence. Establish a file connection to read with file() command ("r" is for read mode): conn <- file("/path/example.data", "r") #when file is in local system conn1 <- file("stdin", "r") #when just standard input/output for files are available As this will establish just file connection, one can read the data from these file connections as follows: line <- readLines(conn, n=1, warn=FALSE) Here we are reading the data from file connection conn line by line as n=1. one can change value of n (say 10, 20 etc.) for reading data blocks for faster reading (10 or 20 lines block read in one go). To read complete file in one go set n=-1. After data processing or say model execution; one can write the results back to file connection using many different commands like writeLines(),cat() etc. which are capable of writing to a file connection. However all of these commands will leverage file connection established for writing. This could be done using file() command as: GoalKicker.com – R Notes for Professionals 19 conn2 <- file("/path/result.data", "w") #when file is in local system conn3 <- file("stdout", "w") #when just standard input/output for files are available Then write the data as follows: writeLines("text",conn2, sep = "\n") GoalKicker.com – R Notes for Professionals 20 Chapter 7: String manipulation with stringi package Section 7.1: Count pattern inside string With fixed pattern stri_count_fixed("babab", "b") # [1] 3 stri_count_fixed("babab", "ba") # [1] 2 stri_count_fixed("babab", "bab") # [1] 1 Natively: length(gregexpr("b","babab")[[1]]) # [1] 3 length(gregexpr("ba","babab")[[1]]) # [1] 2 length(gregexpr("bab","babab")[[1]]) # [1] 1 function is vectorized over string and pattern: stri_count_fixed("babab", c("b","ba")) # [1] 3 2 stri_count_fixed(c("babab","bbb","bca","abc"), c("b","ba")) # [1] 3 0 1 0 A base R solution: sapply(c("b","ba"),function(x)length(gregexpr(x,"babab")[[1]])) # b ba # 3 2 With regex First example - find a and any character after Second example - find a and any digit after stri_count_regex("a1 b2 a3 b4 aa", "a.") # [1] 3 stri_count_regex("a1 b2 a3 b4 aa", "a\\d") # [1] 2 Section 7.2: Duplicating strings stri_dup("abc",3) # [1] "abcabcabc" A base R solution that does the same would look like this: GoalKicker.com – R Notes for Professionals 21 paste0(rep("abc",3),collapse = "") # [1] "abcabcabc" Section 7.3: Paste vectors stri_paste(LETTERS,"-", 1:13) # [1] "A-1" "B-2" "C-3" "D-4" "E-5" "F-6" "G-7" "H-8" "I-9" "J-10" "K-11" "L-12" "M-13" # [14] "N-1" "O-2" "P-3" "Q-4" "R-5" "S-6" "T-7" "U-8" "V-9" "W-10" "X-11" "Y-12" "Z-13" Natively, we could do this in R via: > paste(LETTERS,1:13,sep="-") #[1] "A-1" "B-2" "C-3" "D-4" "E-5" "F-6" "G-7" "H-8" "I-9" "J-10" "K-11" "L-12" "M-13" #[14] "N-1" "O-2" "P-3" "Q-4" "R-5" "S-6" "T-7" "U-8" "V-9" "W-10" "X-11" "Y-12" "Z-13" Section 7.4: Splitting text by some fixed pattern Split vector of texts using one pattern: stri_split_fixed(c("To be or not to be.", "This is very short sentence.")," ") # [[1]] # [1] "To" "be" "or" "not" "to" "be." # # [[2]] # [1] "This" "is" "very" "short" "sentence." Split one text using many patterns: stri_split_fixed("Apples, oranges and pineaplles.",c(" ", ",", "s")) # [[1]] # [1] "Apples," "oranges" "and" "pineaplles." # # [[2]] # [1] "Apples" " oranges and pineaplles." # # [[3]] # [1] "Apple" ", orange" " and pineaplle" "." GoalKicker.com – R Notes for Professionals 22 Chapter 8: Classes The class of a data-object determines which functions will process its contents. The class-attribute is a character vector, and objects can have zero, one or more classes. If there is no class-attribute, there will still be an implicit class determined by an object's mode. The class can be inspected with the function class and it can be set or modified by the class<- function. The S3 class system was established early in S's history. The more complex S4 class system was established later Section 8.1: Inspect classes Every object in R is assigned a class. You can use class() to find the object's class and str() to see its structure, including the classes it contains. For example: class(iris) [1] "data.frame" str(iris) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 ... class(iris$Species) [1] "factor" We see that iris has the class data.frame and using str() allows us to examine the data inside. The variable Species in the iris data frame is of class factor, in contrast to the other variables which are of class numeric. The str() function also provides the length of the variables and shows the first couple of observations, while the class() function only provides the object's class. Section 8.2: Vectors and lists Data in R are stored in vectors. A typical vector is a sequence of values all having the same storage mode (e.g., characters vectors, numeric vectors). See ?atomic for details on the atomic implicit classes and their corresponding storage modes: "logical", "integer", "numeric" (synonym "double"), "complex", "character" and "raw". Many classes are simply an atomic vector with a class attribute on top: x <- 1826 class(x) <- "Date" x # [1] "1975-01-01" x <- as.Date("1970-01-01") class(x) #[1] "Date" is(x,"Date") #[1] TRUE is(x,"integer") #[1] FALSE is(x,"numeric") #[1] FALSE mode(x) #[1] "numeric" GoalKicker.com – R Notes for Professionals 23 Lists are a special type of vector where each element can be anything, even another list, hence the R term for lists: "recursive vectors": mylist <- list( A = c(5,6,7,8), B = letters[1:10], CC = list( 5, "Z") ) Lists have two very important uses: Since functions can only return a single value, it is common to return complicated results in a list: f <- function(x) list(xplus = x + 10, xsq = x^2) f(7) # $xplus # [1] 17 # # $xsq # [1] 49 Lists are also the underlying fundamental class for data frames. Under the hood, a data frame is a list of vectors all having the same length: L <- list(x = 1:2, y = c("A","B")) DF <- data.frame(L) DF # x y # 1 1 A # 2 2 B is.list(DF) # [1] TRUE The other class of recursive vectors is R expressions, which are "language"- objects Section 8.3: Vectors The most simple data structure available in R is a vector. You can make vectors of numeric values, logical values, and character strings using the c() function. For example: c(1, 2, 3) ## [1] 1 2 3 c(TRUE, TRUE, FALSE) ## [1] TRUE TRUE FALSE c("a", "b", "c") ## [1] "a" "b" "c" You can also join to vectors using the c() function. x <- c(1, 2, 5) y <- c(3, 4, 6) z <- c(x, y) z ## [1] 1 2 5 3 4 6 A more elaborate treatment of how to create vectors can be found in the "Creating vectors" topic GoalKicker.com – R Notes for Professionals 24 Chapter 9: Lists Section 9.1: Introduction to lists Lists allow users to store multiple elements (like vectors and matrices) under a single object. You can use the list function to create a list: l1 <- list(c(1, 2, 3), c("a", "b", "c")) l1 ## [[1]] ## [1] 1 2 3 ## ## [[2]] ## [1] "a" "b" "c" Notice the vectors that make up the above list are different classes. Lists allow users to group elements of different classes. Each element in a list can also have a name. List names are accessed by the names function, and are assigned in the same manner row and column names are assigned in a matrix. names(l1) ## NULL names(l1) <- c("vector1", "vector2") l1 ## $vector1 ## [1] 1 2 3 ## ## $vector2 ## [1] "a" "b" "c" It is often easier and safer to declare the list names when creating the list object. l2 <- list(vec = c(1, 3, 5, 7, 9), mat = matrix(data = c(1, 2, 3), nrow = 3)) l2 ## $vec ## [1] 1 3 5 7 9 ## ## $mat ## [,1] ## [1,] 1 ## [2,] 2 ## [3,] 3 names(l2) ## [1] "vec" "mat" Above the list has two elements, named "vec" and "mat," a vector and matrix, resepcively. Section 9.2: Quick Introduction to Lists In general, most of the objects you would interact with as a user would tend to be a vector; e.g numeric vector, logical vector. These objects can only take in a single type of variable (a numeric vector can only have numbers inside it). A list would be able to store any type variable in it, making it to the generic object that can store any type of variables we would need. GoalKicker.com – R Notes for Professionals 25 Example of initializing a list exampleList1 <- list('a', 'b') exampleList2 <- list(1, 2) exampleList3 <- list('a', 1, 2) In order to understand the data that was defined in the list, we can use the str function. str(exampleList1) str(exampleList2) str(exampleList3) Subsetting of lists distinguishes between extracting a slice of the list, i.e. obtaining a list containing a subset of the elements in the original list, and extracting a single element. Using the [ operator commonly used for vectors produces a new list. # Returns List exampleList3[1] exampleList3[1:2] To obtain a single element use [[ instead. # Returns Character exampleList3[[1]] List entries may be named: exampleList4 <- list( num = 1:3, numeric = 0.5, char = c('a', 'b') ) The entries in named lists can be accessed by their name instead of their index. exampleList4[['char']] Alternatively the $ operator can be used to access named elements. exampleList4$num This has the advantage that it is faster to type and may be easier to read but it is important to be aware of a potential pitfall. The $ operator uses partial matching to identify matching list elements and may produce unexpected results. exampleList5 <- exampleList4[2:3] exampleList4$num # c(1, 2, 3) exampleList5$num # 0.5 exampleList5[['num']] # NULL GoalKicker.com – R Notes for Professionals 26 Lists can be particularly useful because they can store objects of different lengths and of various classes. ## Numeric vector exampleVector1 <- c(12, 13, 14) ## Character vector exampleVector2 <- c("a", "b", "c", "d", "e", "f") ## Matrix exampleMatrix1 <- matrix(rnorm(4), ncol = 2, nrow = 2) ## List exampleList3 <- list('a', 1, 2) exampleList6 <- list( num = exampleVector1, char = exampleVector2, mat = exampleMatrix1, list = exampleList3 ) exampleList6 #$num #[1] 12 13 14 # #$char #[1] "a" "b" "c" "d" "e" "f" # #$mat # [,1] [,2] #[1,] 0.5013050 -1.88801542 #[2,] 0.4295266 0.09751379 # #$list #$list[[1]] #[1] "a" # #$list[[2]] #[1] 1 # #$list[[3]] #[1] 2 Section 9.3: Serialization: using lists to pass information There exist cases in which it is necessary to put data of different types together. In Azure ML for example, it is necessary to pass information from a R script module to another one exclusively throught dataframes. Suppose we have a dataframe and a number: > df name height team fun_index title age desc Y 1 Andrea 195 Lazio 97 6 33 eccellente 1 2 Paja 165 Fiorentina 87 6 31 deciso 1 3 Roro 190 Lazio 65 6 28 strano 0 4 Gioele 70 Lazio 100 0 2 simpatico 1 5 Cacio 170 Juventus 81 3 33 duro 0 6 Edola 171 Lazio 72 5 32 svampito 1 7 Salami 175 Inter 75 3 30 doppiopasso 1 8 Braugo 180 Inter 79 5 32 gjn 0 9 Benna 158 Juventus 80 6 28 esaurito 0 10 Riggio 182 Lazio 92 5 31 certezza 1 11 Giordano 185 Roma 79 5 29 buono 1 GoalKicker.com – R Notes for Professionals 27 > number <- "42" We can access to this information: > paste(df$name[4],"is a",df3$team[4], "supporter." ) [1] "Gioele is a Lazio supporter." > paste("The answer to THE question is", number ) [1] "The answer to THE question is 42" In order to put different types of data in a dataframe we have to use the list object and the serialization. In particular we have to put the data in a generic list and then put the list in a particular dataframe: l <- list(df,number) dataframe_container <- data.frame(out2 = as.integer(serialize(l, connection=NULL))) Once we have stored the information in the dataframe, we need to deserialize it in order to use it: #----- unserialize ----------------------------------------+ unser_obj <- unserialize(as.raw(dataframe_container$out2)) #----- taking back the elements----------------------------+ df_mod <- unser_obj[1][[1]] number_mod <- unser_obj[2][[1]] Then, we can verify that the data are transfered correctly: > paste(df_mod$name[4],"is a",df_mod$team[4], "supporter." ) [1] "Gioele is a Lazio supporter." > paste("The answer to THE question is", number_mod ) [1] "The answer to THE question is 42" GoalKicker.com – R Notes for Professionals 28 Chapter 10: Hashmaps Section 10.1: Environments as hash maps Note: in the subsequent passages, the terms hash map and hash table are used interchangeably and refer to the same concept, namely, a data structure providing efficient key lookup through use of an internal hash function. Introduction Although R does not provide a native hash table structure, similar functionality can be achieved by leveraging the fact that the environment object returned from new.env (by default) provides hashed key lookups. The following two statements are equivalent, as the hash parameter defaults to TRUE: H <- new.env(hash = TRUE) H <- new.env() Additionally, one may specify that the internal hash table is pre-allocated with a particular size via the size parameter, which has a default value of 29. Like all other R objects, environments manage their own memory and will grow in capacity as needed, so while it is not necessary to request a non-default value for size, there may be a slight performance advantage in doing so if the object will (eventually) contain a very large number of elements. It is worth noting that allocating extra space via size does not, in itself, result in an object with a larger memory footprint: object.size(new.env()) # 56 bytes object.size(new.env(size = 10e4)) # 56 bytes Insertion Insertion of elements may be done using either of the [[<- or $<- methods provided for the environment class, but not by using "single bracket" assignment ([<-): H <- new.env() H[["key"]] <- rnorm(1) key2 <- "xyz" H[[key2]] <- data.frame(x = 1:3, y = letters[1:3]) H$another_key <- matrix(rbinom(9, 1, 0.5) > 0, nrow = 3) H["error"] <- 42 #Error in H["error"] <- 42 : # object of type 'environment' is not subsettable Like other facets of R, the first method (object[[key]] <- value) is generally preferred to the second (object$key <- value) because in the former case, a variable maybe be used instead of a literal value (e.g key2 in the example above). As is generally the case with hash map implementations, the environment object will not store duplicate keys. Attempting to insert a key-value pair for an existing key will replace the previously stored value: GoalKicker.com – R Notes for Professionals 29 H[["key3"]] <- "original value" H[["key3"]] <- "new value" H[["key3"]] #[1] "new value" Key Lookup Likewise, elements may be accessed with [[ or $, but not with [: H[["key"]] #[1] 1.630631 H[[key2]] ## assuming key2 <- "xyz" # x y # 1 1 a # 2 2 b # 3 3 c H$another_key # [,1] [,2] [,3] # [1,] TRUE TRUE TRUE # [2,] FALSE FALSE FALSE # [3,] TRUE TRUE TRUE H[1] #Error in H[1] : object of type 'environment' is not subsettable Inspecting the Hash Map Being just an ordinary environment, the hash map can be inspected by typical means: names(H) #[1] "another_key" "xyz" "key" "key3" ls(H) #[1] "another_key" "key" "key3" "xyz" str(H) # ls.str(H) # another_key : logi [1:3, 1:3] TRUE FALSE TRUE TRUE FALSE TRUE ... # key : num 1.63 # key3 : chr "new value" # xyz : 'data.frame': 3 obs. of 2 variables: # $ x: int 1 2 3 # $ y: chr "a" "b" "c" Elements can be removed using rm: rm(list = c("key", "key3"), envir = H) ls.str(H) # another_key : logi [1:3, 1:3] TRUE FALSE TRUE TRUE FALSE TRUE ... # xyz : 'data.frame': 3 obs. of 2 variables: # $ x: int 1 2 3 GoalKicker.com – R Notes for Professionals 30 # $ y: chr "a" "b" "c" Flexibility One of the major benefits of using environment objects as hash tables is their ability to store virtually any type of object as a value, even other environments: H2 <- new.env() H2[["a"]] <- LETTERS H2[["b"]] <- as.list(x = 1:5, y = matrix(rnorm(10), 2)) H2[["c"]] <- head(mtcars, 3) H2[["d"]] <- Sys.Date() H2[["e"]] <- Sys.time() H2[["f"]] <- (function() { H3 <- new.env() for (i in seq_along(names(H2))) { H3[[names(H2)[i]]] <- H2[[names(H2)[i]]] } H3 })() ls.str(H2) # a : chr [1:26] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" ... # b : List of 5 # $ : int 1 # $ : int 2 # $ : int 3 # $ : int 4 # $ : int 5 # c : 'data.frame': 3 obs. of 11 variables: # $ mpg : num 21 21 22.8 # $ cyl : num 6 6 4 # $ disp: num 160 160 108 # $ hp : num 110 110 93 # $ drat: num 3.9 3.9 3.85 # $ wt : num 2.62 2.88 2.32 # $ qsec: num 16.5 17 18.6 # $ vs : num 0 0 1 # $ am : num 1 1 1 # $ gear: num 4 4 4 # $ carb: num 4 4 1 # d : Date[1:1], format: "2016-08-03" # e : POSIXct[1:1], format: "2016-08-03 19:25:14" # f : ls.str(H2$f) # a : chr [1:26] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" ... # b : List of 5 # $ : int 1 # $ : int 2 # $ : int 3 # $ : int 4 # $ : int 5 # c : 'data.frame': 3 obs. of 11 variables: # $ mpg : num 21 21 22.8 # $ cyl : num 6 6 4 # $ disp: num 160 160 108 # $ hp : num 110 110 93 # $ drat: num 3.9 3.9 3.85 # $ wt : num 2.62 2.88 2.32 GoalKicker.com – R Notes for Professionals 31 # $ qsec: num 16.5 17 18.6 # $ vs : num 0 0 1 # $ am : num 1 1 1 # $ gear: num 4 4 4 # $ carb: num 4 4 1 # d : Date[1:1], format: "2016-08-03" # e : POSIXct[1:1], format: "2016-08-03 19:25:14" Limitations One of the major limitations of using environment objects as hash maps is that, unlike many aspects of R, vectorization is not supported for element lookup / insertion: names(H2) #[1] "a" "b" "c" "d" "e" "f" H2[[c("a", "b")]] #Error in H2[[c("a", "b")]] : # wrong arguments for subsetting an environment Keys <- c("a", "b") H2[[Keys]] #Error in H2[[Keys]] : wrong arguments for subsetting an environment Depending on the nature of the data being stored in the object, it may be possible to use vapply or list2env for assigning many elements at once: E1 <- new.env() invisible({ vapply(letters, function(x) { E1[[x]] <- rnorm(1) logical(0) }, FUN.VALUE = logical(0)) }) all.equal(sort(names(E1)), letters) #[1] TRUE Keys <- letters E2 <- list2env( setNames( as.list(rnorm(26)), nm = Keys), envir = NULL, hash = TRUE ) all.equal(sort(names(E2)), letters) #[1] TRUE Neither of the above are particularly concise, but may be preferable to using a for loop, etc. when the number of key-value pairs is large. Section 10.2: package:hash The hash package offers a hash structure in R. However, it terms of timing for both inserts and reads it compares unfavorably to using environments as a hash. This documentation simply acknowledges its existence and provides sample timing code below for the above stated reasons. There is no identified case where hash is an appropriate GoalKicker.com – R Notes for Professionals 32 solution in R code today. Consider: # Generic unique string generator unique_strings <- function(n){ string_i <- 1 string_len <- 1 ans <- character(n) chars <- c(letters,LETTERS) new_strings <- function(len,pfx){ for(i in 1:length(chars)){ if (len == 1){ ans[string_i] <<- paste(pfx,chars[i],sep='') string_i <<- string_i + 1 } else { new_strings(len-1,pfx=paste(pfx,chars[i],sep='')) } if (string_i > n) return () } } while(string_i <= n){ new_strings(string_len,'') string_len <- string_len + 1 } sample(ans) } # Generate timings using an enviornment timingsEnv <- plyr::adply(2^(10:15),.mar=1,.fun=function(i){ strings <- unique_strings(i) ht1 <- new.env(hash=TRUE) lapply(strings, function(s){ ht1[[s]] <<- 0L}) data.frame( size=c(i,i), seconds=c( system.time(for (j in 1:i) ht1[[strings[j]]]==0L)[3]), type = c('1_hashedEnv') ) }) timingsHash <- plyr::adply(2^(10:15),.mar=1,.fun=function(i){ strings <- unique_strings(i) ht <- hash::hash() lapply(strings, function(s) ht[[s]] <<- 0L) data.frame( size=c(i,i), seconds=c( system.time(for (j in 1:i) ht[[strings[j]]]==0L)[3]), type = c('3_stringHash') ) }) Section 10.3: package:listenv Although package:listenv implements a list-like interface to environments, its performance relative to environments for hash-like purposes is poor on hash retrieval. However, if the indexes are numeric, it can be quite fast on retrieval. However, they have other advantages, e.g. compatibility with package:future. Covering this package for that purpose goes beyond the scope of the current topic. However, the timing code provided here can be used in conjunction with the example for package:hash for write timings. GoalKicker.com – R Notes for Professionals 33 timingsListEnv <- plyr::adply(2^(10:15),.mar=1,.fun=function(i){ strings <- unique_strings(i) le <- listenv::listenv() lapply(strings, function(s) le[[s]] <<- 0L) data.frame( size=c(i,i), seconds=c( system.time(for (k in 1:i) le[[k]]==0L)[3]), type = c('2_numericListEnv') ) }) GoalKicker.com – R Notes for Professionals 34 Chapter 11: Creating vectors Section 11.1: Vectors from build in constants: Sequences of letters & month names R has a number of build in constants. The following constants are available: LETTERS: the 26 upper-case letters of the Roman alphabet letters: the 26 lower-case letters of the Roman alphabet month.abb: the three-letter abbreviations for the English month names month.name: the English names for the months of the year pi: the ratio of the circumference of a circle to its diameter From the letters and month constants, vectors can be created. 1) Sequences of letters: > letters [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" > LETTERS[7:9] [1] "G" "H" "I" > letters[c(1,5,3,2,4)] [1] "a" "e" "c" "b" "d" 2) Sequences of month abbreviations or month names: > month.abb [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" > month.name[1:4] [1] "January" "February" "March" "April" > month.abb[c(3,6,9,12)] [1] "Mar" "Jun" "Sep" "Dec" Section 11.2: Creating named vectors Named vector can be created in several ways. With c: xc <- c('a' = 5, 'b' = 6, 'c' = 7, 'd' = 8) which results in: > xc a b c d 5 6 7 8 with list: xl <- list('a' = 5, 'b' = 6, 'c' = 7, 'd' = 8) which results in: GoalKicker.com – R Notes for Professionals 35 > xl $a [1] 5 $b [1] 6 $c [1] 7 $d [1] 8 With the setNames function, two vectors of the same length can be used to create a named vector: x <- 5:8 y <- letters[1:4] xy <- setNames(x, y) which results in a named integer vector: > xy a b c d 5 6 7 8 As can be seen, this gives the same result as the c method. You may also use the names function to get the same result: xy <- 5:8 names(xy) <- letters[1:4] With such a vector it is also possibly to select elements by name: > xy["c"] c 7 This feature makes it possible to use such a named vector as a look-up vector/table to match the values to values of another vector or column in dataframe. Considering the following dataframe: mydf <- data.frame(let = c('c','a','b','d')) > mydf let 1 c 2 a 3 b 4 d Suppose you want to create a new variable in the mydf dataframe called num with the correct values from xy in the rows. Using the match function the appropriate values from xy can be selected: mydf$num <- xy[match(mydf$let, names(xy))] which results in: GoalKicker.com – R Notes for Professionals 36 > mydf let num 1 c 7 2 a 5 3 b 6 4 d 8 Section 11.3: Sequence of numbers Use the : operator to create sequences of numbers, such as for use in vectorizing larger chunks of your code: x <- 1:5 x ## [1] 1 2 3 4 5 This works both ways 10:4 # [1] 10 9 8 7 6 5 4 and even with floating point numbers 1.25:5 # [1] 1.25 2.25 3.25 4.25 or negatives -4:4 #[1] -4 -3 -2 -1 0 1 2 3 4 Section 11.4: seq() seq is a more flexible function than the : operator allowing to specify steps other than 1. The function creates a sequence from the start (default is 1) to the end including that number. You can supply only the end (to) parameter seq(5) # [1] 1 2 3 4 5 As well as the start seq(2, 5) # or seq(from=2, to=5) # [1] 2 3 4 5 And finally the step (by) seq(2, 5, 0.5) # or seq(from=2, to=5, by=0.5) # [1] 2.0 2.5 3.0 3.5 4.0 4.5 5.0 seq can optionally infer the (evenly spaced) steps when alternatively the desired length of the output (length.out) is supplied seq(2,5, length.out = 10) GoalKicker.com – R Notes for Professionals 37 # [1] 2.0 2.3 2.6 2.9 3.2 3.5 3.8 4.1 4.4 4.7 5.0 If the sequence needs to have the same length as another vector we can use the along.with as a shorthand for length.out = length(x) x = 1:8 seq(2,5,along.with = x) # [1] 2.000000 2.428571 2.857143 3.285714 3.714286 4.142857 4.571429 5.000000 There are two useful simplified functions in the seq family: seq_along, seq_len, and seq.int. seq_along and seq_len functions construct the natural (counting) numbers from 1 through N where N is determined by the function argument, the length of a vector or list with seq_along, and the integer argument with seq_len. seq_along(x) # [1] 1 2 3 4 5 6 7 8 Note that seq_along returns the indices of an existing object. # counting numbers 1 through 10 seq_len(10) [1] 1 2 3 4 5 6 7 8 9 10 # indices of existing vector (or list) with seq_along letters[1:10] [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" seq_along(letters[1:10]) [1] 1 2 3 4 5 6 7 8 9 10 seq.intis the same as seq maintained for ancient compatibility. There is also an old function sequencethat creates a vector of sequences from a non negative argument. sequence(4) # [1] 1 2 3 4 sequence(c(3, 2)) # [1] 1 2 3 1 2 sequence(c(3, 2, 5)) # [1] 1 2 3 1 2 1 2 3 4 5 Section 11.5: Vectors Vectors in R can have different types (e.g. integer, logical, character). The most general way of defining a vector is by using the function vector(). vector('integer',2) # creates a vector of integers of size 2. vector('character',2) # creates a vector of characters of size 2. vector('logical',2) # creates a vector of logicals of size 2. However, in R, the shorthand functions are generally more popular. integer(2) # is the same as vector('integer',2) and creates an integer vector with two elements character(2) # is the same as vector('integer',2) and creates an character vector with two elements logical(2) # is the same as vector('logical',2) and creates an logical vector with two elements Creating vectors with values, other than the default values, is also possible. Often the function c() is used for this. The c is short for combine or concatenate. GoalKicker.com – R Notes for Professionals 38 c(1, 2) # creates a integer vector of two elements: 1 and 2. c('a', 'b') # creates a character vector of two elements: a and b. c(T,F) # creates a logical vector of two elements: TRUE and FALSE. Important to note here is that R interprets any integer (e.g. 1) as an integer vector of size one. The same holds for numerics (e.g. 1.1), logicals (e.g. T or F), or characters (e.g. 'a'). Therefore, you are in essence combining vectors, which in turn are vectors. Pay attention that you always have to combine similar vectors. Otherwise, R will try to convert the vectors in vectors of the same type. c(1,1.1,'a',T) # all types (integer, numeric, character and logical) are converted to the 'lowest' type which is character. Finding elements in vectors can be done with the [ operator. vec_int <- c(1,2,3) vec_char <- c('a','b','c') vec_int[2] # accessing the second element will return 2 vec_char[2] # accessing the second element will return 'b' This can also be used to change values vec_int[2] <- 5 # change the second value from 2 to 5 vec_int # returns [1] 1 5 3 Finally, the : operator (short for the function seq()) can be used to quickly create a vector of numbers. vec_int <- 1:10 vec_int # returns [1] 1 2 3 4 5 6 7 8 9 10 This can also be used to subset vectors (from easy to more complex subsets) vec_char <- c('a','b','c','d','e') vec_char[2:4] # returns [1] "b" "c" "d" vec_char[c(1,3,5)] # returns [1] "a" "c" "e" Section 11.6: Expanding a vector with the rep() function The rep function can be used to repeat a vector in a fairly flexible manner. # repeat counting numbers, 1 through 5 twice rep(1:5, 2) [1] 1 2 3 4 5 1 2 3 4 5 # repeat vector with incomplete recycling rep(1:5, 2, length.out=7) [1] 1 2 3 4 5 1 2 The each argument is especially useful for expanding a vector of statistics of observational/experimental units into a vector of data.frame with repeated observations of these units. # same except repeat each integer next to each other rep(1:5, each=2) [1] 1 1 2 2 3 3 4 4 5 5 GoalKicker.com – R Notes for Professionals 39 A nice feature of rep regarding involving expansion to such a data structure is that expansion of a vector to an unbalanced panel can be accomplished by replacing the length argument with a vector that dictates the number of times to repeat each element in the vector: # automated length repetition rep(1:5, 1:5) [1] 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 # hand-fed repetition length vector rep(1:5, c(1,1,1,2,2)) [1] 1 2 3 4 4 5 5 This should expose the possibility of allowing an external function to feed the second argument of rep in order to dynamically construct a vector that expands according to the data. As with seq, faster, simplified versions of rep are rep_len and rep.int. These drop some attributes that rep maintains and so may be most useful in situations where speed is a concern and additional aspects of the repeated vector are unnecessary. # repeat counting numbers, 1 through 5 twice rep.int(1:5, 2) [1] 1 2 3 4 5 1 2 3 4 5 # repeat vector with incomplete recycling rep_len(1:5, length.out=7) [1] 1 2 3 4 5 1 2 GoalKicker.com – R Notes for Professionals 40 Chapter 12: Date and Time R comes with classes for dates, date-times and time differences; see ?Dates, ?DateTimeClasses, ?difftime and follow the "See Also" section of those docs for further documentation. Related Docs: Dates and Date-Time Classes. Section 12.1: Current Date and Time R is able to access the current date, time and time zone: Sys.Date() # Returns date as a Date object ## [1] "2016-07-21" Sys.time() # Returns date & time at current locale as a POSIXct object ## [1] "2016-07-21 10:04:39 CDT" as.numeric(Sys.time()) # Seconds from UNIX Epoch (1970-01-01 00:00:00 UTC) ## [1] 1469113479 Sys.timezone() # Time zone at current location ## [1] "Australia/Melbourne" Use OlsonNames() to view the time zone names in Olson/IANA database on the current system: str(OlsonNames()) ## chr [1:589] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa" "Africa/Algiers" "Africa/Asmara" "Africa/Asmera" "Africa/Bamako" ... Section 12.2: Go to the End of the Month Let's say we want to go to the last day of the month, this function will help on it: eom <- function(x, p=as.POSIXlt(x)) as.Date(modifyList(p, list(mon=p$mon + 1, mday=0))) Test: x <- seq(as.POSIXct("2000-12-10"),as.POSIXct("2001-05-10"),by="months") > data.frame(before=x,after=eom(x)) before after 1 2000-12-10 2000-12-31 2 2001-01-10 2001-01-31 3 2001-02-10 2001-02-28 4 2001-03-10 2001-03-31 5 2001-04-10 2001-04-30 6 2001-05-10 2001-05-31 > Using a date in a string format: > eom('2000-01-01') [1] "2000-01-31" GoalKicker.com – R Notes for Professionals 41 Section 12.3: Go to First Day of the Month Let's say we want to go to the first day of a given month: date <- as.Date("2017-01-20") > as.POSIXlt(cut(date, "month")) [1] "2017-01-01 EST" Section 12.4: Move a date a number of months consistently by months Let's say we want to move a given date a numof months. We can define the following function, that uses the mondate package: moveNumOfMonths <- function(date, num) { as.Date(mondate(date) + num) } It moves consistently the month part of the date and adjusting the day, in case the date refers to the last day of the month. For example: Back one month: > moveNumOfMonths("2017-10-30",-1) [1] "2017-09-30" Back two months: > moveNumOfMonths("2017-10-30",-2) [1] "2017-08-30" Forward two months: > moveNumOfMonths("2017-02-28", 2) [1] "2017-04-30" It moves two months from the last day of February, therefore the last day of April. Let's se how it works for backward and forward operations when it is the last day of the month: > moveNumOfMonths("2016-11-30", 2) [1] "2017-01-31" > moveNumOfMonths("2017-01-31", -2) [1] "2016-11-30" Because November has 30 days, we get the same date in the backward operation, but: > moveNumOfMonths("2017-01-30", -2) [1] "2016-11-30" > moveNumOfMonths("2016-11-30", 2) [1] "2017-01-31" GoalKicker.com – R Notes for Professionals 42 Because January has 31 days, then moving two months from last day of November will get the last day of January. GoalKicker.com – R Notes for Professionals 43 Chapter 13: The Date class Section 13.1: Formatting Dates To format Dates we use the format(date, format="%Y-%m-%d") function with either the POSIXct (given from as.POSIXct()) or POSIXlt (given from as.POSIXlt()) d = as.Date("2016-07-21") # Current Date Time Stamp format(d,"%a") # Abbreviated Weekday ## [1] "Thu" format(d,"%A") # Full Weekday ## [1] "Thursday" format(d,"%b") # Abbreviated Month ## [1] "Jul" format(d,"%B") # Full Month ## [1] "July" format(d,"%m") # 00-12 Month Format ## [1] "07" format(d,"%d") # 00-31 Day Format ## [1] "21" format(d,"%e") # 0-31 Day Format ## [1] "21" format(d,"%y") # 00-99 Year ## [1] "16" format(d,"%Y") # Year with Century ## [1] "2016" For more, see ?strptime. Section 13.2: Parsing Strings into Date Objects R contains a Date class, which is created with as.Date(), which takes a string or vector of strings, and if the date is not in ISO 8601 date format YYYY-MM-DD, a formatting string of strptime-style tokens. as.Date('2016-08-01') # in ISO format, so does not require formatting string ## [1] "2016-08-01" as.Date('05/23/16', format = '%m/%d/%y') ## [1] "2016-05-23" as.Date('March 23rd, 2016', '%B %drd, %Y') # add separators and literals to format ## [1] "2016-03-23" as.Date(' 2016-08-01 foo') # leading whitespace and all trailing characters are ignored ## [1] "2016-08-01" as.Date(c('2016-01-01', '2016-01-02')) # [1] "2016-01-01" "2016-01-02" GoalKicker.com – R Notes for Professionals 44 Section 13.3: Dates To coerce a variable to a date use the as.Date() function. > x <- as.Date("2016-8-23") > x [1] "2016-08-23" > class(x) [1] "Date" The as.Date() function allows you to provide a format argument. The default is %Y-%m-%d, which is Year-month day. > as.Date("23-8-2016", format="%d-%m-%Y") # To read in an European-style date [1] "2016-08-23" The format string can be placed either within a pair of single quotes or double quotes. Dates are usually expressed in a variety of forms such as: "d-m-yy" or "d-m-YYYY" or "m-d-yy" or "m-d-YYYY" or "YYYY-m-d" or "YYYY-d-m". These formats can also be expressed by replacing "-" by "/". Furher, dates are also expressed in the forms, say, "Nov 6, 1986" or "November 6, 1986" or "6 Nov, 1986" or "6 November, 1986" and so on. The as.Date() function accepts all such character strings and when we mention the appropriate format of the string, it always outputs the date in the form "YYYY-m-d". Suppose we have a date string "9-6-1962" in the format "%d-%m-%Y". # # It tries to interprets the string as YYYY-m-d # > as.Date("9-6-1962") [1] "0009-06-19" #interprets as "%Y-%m-%d" > as.Date("9/6/1962") [1] "0009-06-19" #again interprets as "%Y-%m-%d" > # It has no problem in understanding, if the date is in form YYYY-m-d or YYYY/m/d # > as.Date("1962-6-9") [1] "1962-06-09" # no problem > as.Date("1962/6/9") [1] "1962-06-09" # no problem > By specifying the correct format of the input string, we can get the desired results. We use the following codes for specifying the formats to the as.Date() function. Format Code Meaning %d day %m month %y year in 2-digits %Y year in 4-digits %b abbreviated month in 3 chars %B full name of the month Consider the following example specifying the format parameter: GoalKicker.com – R Notes for Professionals 45 > as.Date("9-6-1962",format="%d-%m-%Y") [1] "1962-06-09" > The parameter name format can be omitted. > as.Date("9-6-1962", "%d-%m-%Y") [1] "1962-06-09" > Some times, names of the months abbreviated to the first three characters are used in the writing the dates. In which case we use the format specifier %b. > as.Date("6Nov1962","%d%b%Y") [1] "1962-11-06" > Note that, there are no either '-' or '/' or white spaces between the members in the date string. The format string should exactly match that input string. Consider the following example: > as.Date("6 Nov, 1962","%d %b, %Y") [1] "1962-11-06" > Note that, there is a comma in the date string and hence a comma in the format specification too. If comma is omitted in the format string, it results in an NA. An example usage of %B format specifier is as follows: > as.Date("October 12, 2016", "%B %d, %Y") [1] "2016-10-12" > > as.Date("12 October, 2016", "%d %B, %Y") [1] "2016-10-12" > %y format is system specific and hence, should be used with caution. Other parameters used with this function are origin and tz( time zone). GoalKicker.com – R Notes for Professionals 46 Chapter 14: Date-time classes (POSIXct and POSIXlt) R includes two date-time classes -- POSIXct and POSIXlt -- see ?DateTimeClasses. Section 14.1: Formatting and printing date-time objects # test date-time object options(digits.secs = 3) d = as.POSIXct("2016-08-30 14:18:30.58", tz = "UTC") format(d,"%S") # 00-61 Second as integer ## [1] "30" format(d,"%OS") # 00-60.99… Second as fractional ## [1] "30.579" format(d,"%M") # 00-59 Minute ## [1] "18" format(d,"%H") # 00-23 Hours ## [1] "14" format(d,"%I") # 01-12 Hours ## [1] "02" format(d,"%p") # AM/PM Indicator ## [1] "PM" format(d,"%z") # Signed offset ## [1] "+0000" format(d,"%Z") # Time Zone Abbreviation ## [1] "UTC" See ?strptime for details on the format strings here, as well as other formats. Section 14.2: Date-time arithmetic To add/subtract time, use POSIXct, since it stores times in seconds ## adding/subtracting times - 60 seconds as.POSIXct("2016-01-01") + 60 # [1] "2016-01-01 00:01:00 AEDT" ## adding 3 hours, 14 minutes, 15 seconds as.POSIXct("2016-01-01") + ( (3 * 60 * 60) + (14 * 60) + 15) # [1] "2016-01-01 03:14:15 AEDT" More formally, as.difftime can be used to specify time periods to add to a date or datetime object. E.g.: as.POSIXct("2016-01-01") + as.difftime(3, units="hours") + as.difftime(14, units="mins") + as.difftime(15, units="secs") # [1] "2016-01-01 03:14:15 AEDT" GoalKicker.com – R Notes for Professionals 47 To find the difference between dates/times use difftime() for differences in seconds, minutes, hours, days or weeks. # using POSIXct objects difftime( as.POSIXct("2016-01-01 12:00:00"), as.POSIXct("2016-01-01 11:59:59"), unit = "secs") # Time difference of 1 secs To generate sequences of date-times use seq.POSIXt() or simply seq. Section 14.3: Parsing strings into date-time objects The functions for parsing a string into POSIXct and POSIXlt take similar parameters and return a similar-looking result, but there are differences in how that date-time is stored; see "Remarks." as.POSIXct("11:38", # time string format = "%H:%M") # formatting string ## [1] "2016-07-21 11:38:00 CDT" strptime("11:38", # identical, but makes a POSIXlt object format = "%H:%M") ## [1] "2016-07-21 11:38:00 CDT" as.POSIXct("11 AM", format = "%I %p") ## [1] "2016-07-21 11:00:00 CDT" Note that date and timezone are imputed. as.POSIXct("11:38:22", # time string without timezone format = "%H:%M:%S", tz = "America/New_York") # set time zone ## [1] "2016-07-21 11:38:22 EDT" as.POSIXct("2016-07-21 00:00:00", format = "%F %T") # shortcut tokens for "%Y-%m-%d" and "%H:%M:%S" See ?strptime for details on the format strings here. Notes Missing elements If a date element is not supplied, then that from the current date is used. If a time element is not supplied, then that from midnight is used, i.e. 0s. If no timezone is supplied in either the string or the tz parameter, the local timezone is used. Time zones The accepted values of tz depend on the location. CST is given with "CST6CDT" or "America/Chicago" For supported locations and time zones use: In R: OlsonNames() Alternatively, try in R: system("cat $R_HOME/share/zoneinfo/zone.tab") These locations are given by Internet Assigned Numbers Authority (IANA) List of tz database time zones (Wikipedia) GoalKicker.com – R Notes for Professionals 48 IANA TZ Data (2016e) GoalKicker.com – R Notes for Professionals 49 Chapter 15: The character class Characters are what other languages call 'string vectors.' Section 15.1: Coercion To check whether a value is a character use the is.character() function. To coerce a variable to a character use the as.character() function. x <- "The quick brown fox jumps over the lazy dog" class(x) [1] "character" is.character(x) [1] TRUE Note that numerics can be coerced to characters, but attempting to coerce a character to numeric may result in NA. as.numeric("2") [1] 2 as.numeric("fox") [1] NA Warning message: NAs introduced by coercion GoalKicker.com – R Notes for Professionals 50 Chapter 16: Numeric classes and storage modes Section 16.1: Numeric Numeric represents integers and doubles and is the default mode assigned to vectors of numbers. The function is.numeric() will evaluate whether a vector is numeric. It is important to note that although integers and doubles will pass is.numeric(), the function as.numeric() will always attempt to convert to type double. x <- 12.3 y <- 12L #confirm types typeof(x) [1] "double" typeof(y) [1] "integer" # confirm both numeric is.numeric(x) [1] TRUE is.numeric(y) [1] TRUE # logical to numeric as.numeric(TRUE) [1] 1 # While TRUE == 1, it is a double and not an integer is.integer(as.numeric(TRUE)) [1] FALSE Doubles are R's default numeric value. They are double precision vectors, meaning that they take up 8 bytes of memory for each value in the vector. R has no single precision data type and so all real numbers are stored in the double precision format. is.double(1) TRUE is.double(1.0) TRUE is.double(1L) FALSE Integers are whole numbers that can be written without a fractional component. Integers are represented by a number with an L after it. Any number without an L after it will be considered a double. typeof(1) [1] "double" class(1) [1] "numeric" typeof(1L) [1] "integer" class(1L) [1] "integer" Though in most cases using an integer or double will not matter, sometimes replacing doubles with integers will GoalKicker.com – R Notes for Professionals 51 consume less memory and operational time. A double vector uses 8 bytes per element while an integer vector uses only 4 bytes per element. As the size of vectors increases, using proper types can dramatically speed up processes. # test speed on lots of arithmetic microbenchmark( for( i in 1:100000){ 2L * i 10L + i }, for( i in 1:100000){ 2.0 * i 10.0 + i } ) Unit: milliseconds expr min lq mean median uq max neval for (i in 1:1e+05) { 2L * i 10L + i } 40.74775 42.34747 50.70543 42.99120 65.46864 94.11804 100 for (i in 1:1e+05) { 2 * i 10 + i } 41.07807 42.38358 53.52588 44.26364 65.84971 83.00456 100 GoalKicker.com – R Notes for Professionals 52 Chapter 17: The logical class Logical is a mode (and an implicit class) for vectors. Section 17.1: Logical operators There are two sorts of logical operators: those that accept and return vectors of any length (elementwise operators: !, |, &, xor()) and those that only evaluate the first element in each argument (&&, ||). The second sort is primarily used as the cond argument to the if function. Logical Operator Meaning Syntax ! Not !x & element-wise (vectorized) and x & y && and (single element only) x && y | element-wise (vectorized) or x | y || or (single element only) x || y xor element-wise (vectorized) exclusive OR xor(x,y) Note that the || operator evaluates the left condition and if the left condition is TRUE the right side is never evaluated. This can save time if the first is the result of a complex operation. The && operator will likewise return FALSE without evaluation of the second argument when the first element of the first argument is FALSE. > x <- 5 > x > 6 || stop("X is too small") Error: X is too small > x > 3 || stop("X is too small") [1] TRUE To check whether a value is a logical you can use the is.logical() function. Section 17.2: Coercion To coerce a variable to a logical use the as.logical() function. > x <- 2 > z <- x > 4 > z [1] FALSE > class(x) [1] "numeric" > as.logical(2) [1] TRUE When applying as.numeric() to a logical, a double will be returned. NA is a logical value and a logical operator with an NA will return NA if the outcome is ambiguous. Section 17.3: Interpretation of NAs See Missing values for details. > TRUE & NA [1] NA > FALSE & NA GoalKicker.com – R Notes for Professionals 53 [1] FALSE > TRUE || NA [1] TRUE > FALSE || NA [1] NA GoalKicker.com – R Notes for Professionals 54 Chapter 18: Data frames Section 18.1: Create an empty data.frame A data.frame is a special kind of list: it is rectangular. Each element (column) of the list has same length, and where each row has a "row name". Each column has its own class, but the class of one column can be different from the class of another column (unlike a matrix, where all elements must have the same class). In principle, a data.frame could have no rows and no columns: > structure(list(character()), class = "data.frame") NULL <0 rows> (or 0-length row.names) But this is unusual. It is more common for a data.frame to have many columns and many rows. Here is a data.frame with three rows and two columns (a is numeric class and b is character class): > structure(list(a = 1:3, b = letters[1:3]), class = "data.frame") [1] a b <0 rows> (or 0-length row.names) In order for the data.frame to print, we will need to supply some row names. Here we use just the numbers 1:3: > structure(list(a = 1:3, b = letters[1:3]), class = "data.frame", row.names = 1:3) a b 1 1 a 2 2 b 3 3 c Now it becomes obvious that we have a data.frame with 3 rows and 2 columns. You can check this using nrow(), ncol(), and dim(): > x <- structure(list(a = numeric(3), b = character(3)), class = "data.frame", row.names = 1:3) > nrow(x) [1] 3 > ncol(x) [1] 2 > dim(x) [1] 3 2 R provides two other functions (besides structure()) that can be used to create a data.frame. The first is called, intuitively, data.frame(). It checks to make sure that the column names you supplied are valid, that the list elements are all the same length, and supplies some automatically generated row names. This means that the output of data.frame() might now always be exactly what you expect: > str(data.frame("a a a" = numeric(3), "b-b-b" = character(3))) 'data.frame': 3 obs. of 2 variables: $ a.a.a: num 0 0 0 $ b.b.b: Factor w/ 1 level "": 1 1 1 The other function is called as.data.frame(). This can be used to coerce an object that is not a data.frame into being a data.frame by running it through data.frame(). As an example, consider a matrix: > m <- matrix(letters[1:9], nrow = 3) > m GoalKicker.com – R Notes for Professionals 55 [,1] [,2] [,3] [1,] "a" "d" "g" [2,] "b" "e" "h" [3,] "c" "f" "i" And the result: > as.data.frame(m) V1 V2 V3 1 a d g 2 b e h 3 c f i > str(as.data.frame(m)) 'data.frame': 3 obs. of 3 variables: $ V1: Factor w/ 3 levels "a","b","c": 1 2 3 $ V2: Factor w/ 3 levels "d","e","f": 1 2 3 $ V3: Factor w/ 3 levels "g","h","i": 1 2 3 Section 18.2: Subsetting rows and columns from a data frame Syntax for accessing rows and columns: [, [[, and $ This topic covers the most common syntax to access specific rows and columns of a data frame. These are Like a matrix with single brackets data[rows, columns] Using row and column numbers Using column (and row) names Like a list: With single brackets data[columns] to get a data frame With double brackets data[[one_column]] to get a vector With $ for a single column data$column_name We will use the built-in mtcars data frame to illustrate. Like a matrix: data[rows, columns] With numeric indexes Using the built in data frame mtcars, we can extract rows and columns using [] brackets with a comma included. Indices before the comma are rows: # get the first row mtcars[1, ] # get the first five rows mtcars[1:5, ] Similarly, after the comma are columns: # get the first column mtcars[, 1] # get the first, third and fifth columns: mtcars[, c(1, 3, 5)] As shown above, if either rows or columns are left blank, all will be selected. mtcars[1, ] indicates the first row with all the columns. With column (and row) names GoalKicker.com – R Notes for Professionals 56 So far, this is identical to how rows and columns of matrices are accessed. With data.frames, most of the time it is preferable to use a column name to a column index. This is done by using a character with the column name instead of numeric with a column number: # get the mpg column mtcars[, "mpg"] # get the mpg, cyl, and disp columns mtcars[, c("mpg", "cyl", "disp")] Though less common, row names can also be used: mtcars["Mazda Rx4", ] Rows and columns together The row and column arguments can be used together: # first four rows of the mpg column mtcars[1:4, "mpg"] # 2nd and 5th row of the mpg, cyl, and disp columns mtcars[c(2, 5), c("mpg", "cyl", "disp")] A warning about dimensions: When using these methods, if you extract multiple columns, you will get a data frame back. However, if you extract a single column, you will get a vector, not a data frame under the default options. ## multiple columns returns a data frame class(mtcars[, c("mpg", "cyl")]) # [1] "data.frame" ## single column returns a vector class(mtcars[, "mpg"]) # [1] "numeric" There are two ways around this. One is to treat the data frame as a list (see below), the other is to add a drop = FALSE argument. This tells R to not "drop the unused dimensions": class(mtcars[, "mpg", drop = FALSE]) # [1] "data.frame" Note that matrices work the same way - by default a single column or row will be a vector, but if you specify drop = FALSE you can keep it as a one-column or one-row matrix. Like a list Data frames are essentially lists, i.e., they are a list of column vectors (that all must have the same length). Lists can be subset using single brackets [ for a sub-list, or double brackets [[ for a single element. With single brackets data[columns] When you use single brackets and no commas, you will get column back because data frames are lists of columns. mtcars["mpg"] mtcars[c("mpg", "cyl", "disp")] my_columns <- c("mpg", "cyl", "hp") mtcars[my_columns] GoalKicker.com – R Notes for Professionals 57 Single brackets like a list vs. single brackets like a matrix The difference between data[columns] and data[, columns] is that when treating the data.frame as a list (no comma in the brackets) the object returned will be a data.frame. If you use a comma to treat the data.frame like a matrix then selecting a single column will return a vector but selecting multiple columns will return a data.frame. ## When selecting a single column ## like a list will return a data frame class(mtcars["mpg"]) # [1] "data.frame" ## like a matrix will return a vector class(mtcars[, "mpg"]) # [1] "numeric" With double brackets data[[one_column]] To extract a single column as a vector when treating your data.frame as a list, you can use double brackets [[. This will only work for a single column at a time. # extract a single column by name as a vector mtcars[["mpg"]] # extract a single column by name as a data frame (as above) mtcars["mpg"] Using $ to access columns A single column can be extracted using the magical shortcut $ without using a quoted column name: # get the column "mpg" mtcars$mpg Columns accessed by $ will always be vectors, not data frames. Drawbacks of $ for accessing columns The $ can be a convenient shortcut, especially if you are working in an environment (such as RStudio) that will auto complete the column name in this case. However, $ has drawbacks as well: it uses non-standard evaluation to avoid the need for quotes, which means it will not work if your column name is stored in a variable. my_column <- "mpg" # the below will not work mtcars$my_column # but these will work mtcars[, my_column] # vector mtcars[my_column] # one-column data frame mtcars[[my_column]] # vector Due to these concerns, $ is best used in interactive R sessions when your column names are constant. For programmatic use, for example in writing a generalizable function that will be used on different data sets with different column names, $ should be avoided. Also note that the default behaviour is to use partial matching only when extracting from recursive objects (except environments) by $ # give you the values of "mpg" column # as "mtcars" has only one column having name starting with "m" GoalKicker.com – R Notes for Professionals 58 mtcars$m # will give you "NULL" # as "mtcars" has more than one columns having name starting with "d" mtcars$d Advanced indexing: negative and logical indices Whenever we have the option to use numbers for a index, we can also use negative numbers to omit certain indices or a boolean (logical) vector to indicate exactly which items to keep. Negative indices omit elements mtcars[1, ] # first row mtcars[ -1, ] # everything but the first row mtcars[-(1:10), ] # everything except the first 10 rows Logical vectors indicate specific elements to keep We can use a condition such as < to generate a logical vector, and extract only the rows that meet the condition: # logical vector indicating TRUE when a row has mpg less than 15 # FALSE when a row has mpg >= 15 test <- mtcars$mpg < 15 # extract these rows from the data frame mtcars[test, ] We can also bypass the step of saving the intermediate variable # extract all columns for rows where the value of cyl is 4. mtcars[mtcars$cyl == 4, ] # extract the cyl, mpg, and hp columns where the value of cyl is 4 mtcars[mtcars$cyl == 4, c("cyl", "mpg", "hp")] Section 18.3: Convenience functions to manipulate data.frames Some convenience functions to manipulate data.frames are subset(), transform(), with() and within(). subset The subset() function allows you to subset a data.frame in a more convenient way (subset also works with other classes): subset(mtcars, subset = cyl == 6, select = c("mpg", "hp")) mpg hp Mazda RX4 21.0 110 Mazda RX4 Wag 21.0 110 Hornet 4 Drive 21.4 110 Valiant 18.1 105 Merc 280 19.2 123 Merc 280C 17.8 123 Ferrari Dino 19.7 175 In the code above we asking only for the lines in which cyl == 6 and for the columns mpg and hp. You could achieve the same result using [] with the following code: mtcars[mtcars$cyl == 6, c("mpg", "hp")] GoalKicker.com – R Notes for Professionals 59 transform The transform() function is a convenience function to change columns inside a data.frame. For instance the following code adds another column named mpg2 with the result of mpg^2 to the mtcars data.frame: mtcars <- transform(mtcars, mpg2 = mpg^2) with and within Both with() and within() let you to evaluate expressions inside the data.frame environment, allowing a somewhat cleaner syntax, saving you the use of some $ or []. For example, if you want to create, change and/or remove multiple columns in the airquality data.frame: aq <- within(airquality, { lOzone <- log(Ozone) # creates new column Month <- factor(month.abb[Month]) # changes Month Column cTemp <- round((Temp - 32) * 5/9, 1) # creates new column S.cT <- Solar.R / cTemp # creates new column rm(Day, Temp) # removes columns }) Section 18.4: Introduction Data frames are likely the data structure you will used most in your analyses. A data frame is a special kind of list that stores same-length vectors of different classes. You create data frames using the data.frame function. The example below shows this by combining a numeric and a character vector into a data frame. It uses the : operator, which will create a vector containing all integers from 1 to 3. df1 <- data.frame(x = 1:3, y = c("a", "b", "c")) df1 ## x y ## 1 1 a ## 2 2 b ## 3 3 c class(df1) ## [1] "data.frame" Data frame objects do not print with quotation marks, so the class of the columns is not always obvious. df2 <- data.frame(x = c("1", "2", "3"), y = c("a", "b", "c")) df2 ## x y ## 1 1 a ## 2 2 b ## 3 3 c Without further investigation, the "x" columns in df1 and df2 cannot be differentiated. The str function can be used to describe objects with more detail than class. str(df1) ## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3 str(df2) ## 'data.frame': 3 obs. of 2 variables: GoalKicker.com – R Notes for Professionals 60 ## $ x: Factor w/ 3 levels "1","2","3": 1 2 3 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3 Here you see that df1 is a data.frame and has 3 observations of 2 variables, "x" and "y." Then you are told that "x" has the data type integer (not important for this class, but for our purposes it behaves like a numeric) and "y" is a factor with three levels (another data class we are not discussing). It is important to note that, by default, data frames coerce characters to factors. The default behavior can be changed with the stringsAsFactors parameter: df3 <- data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE) str(df3) ## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: chr "a" "b" "c" Now the "y" column is a character. As mentioned above, each "column" of a data frame must have the same length. Trying to create a data.frame from vectors with different lengths will result in an error. (Try running data.frame(x = 1:3, y = 1:4) to see the resulting error.) As test-cases for data frames, some data is provided by R by default. One of them is iris, loaded as follows: mydataframe <- iris str(mydataframe) Section 18.5: Convert all columns of a data.frame to character class A common task is to convert all columns of a data.frame to character class for ease of manipulation, such as in the cases of sending data.frames to a RDBMS or merging data.frames containing factors where levels may differ between input data.frames. The best time to do this is when the data is read in - almost all input methods that create data frames have an options stringsAsFactors which can be set to FALSE. If the data has already been created, factor columns can be converted to character columns as shown below. bob <- data.frame(jobs = c("scientist", "analyst"), pay = c(160000, 100000), age = c(30, 25)) str(bob) 'data.frame': 2 obs. of 3 variables: $ jobs: Factor w/ 2 levels "analyst","scientist": 2 1 $ pay : num 160000 100000 $ age : num 30 25 # Convert *all columns* to character bob[] <- lapply(bob, as.character) str(bob) 'data.frame': 2 obs. of 3 variables: $ jobs: chr "scientist" "analyst" $ pay : chr "160000" "1e+05" $ age : chr "30" "25" # Convert only factor columns to character bob[] <- lapply(bob, function(x) { GoalKicker.com – R Notes for Professionals 61 if is.factor(x) x <- as.character(x) return(x) }) GoalKicker.com – R Notes for Professionals 62 Chapter 19: Split function Section 19.1: Using split in the split-apply-combine paradigm A popular form of data analysis is split-apply-combine, in which you split your data into groups, apply some sort of processing on each group, and then combine the results. Let's consider a data analysis where we want to obtain the two cars with the best miles per gallon (mpg) for each cylinder count (cyl) in the built-in mtcars dataset. First, we split the mtcars data frame by the cylinder count: (spl <- split(mtcars, mtcars$cyl)) # $`4` # mpg cyl disp hp drat wt qsec vs am gear carb # Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 # Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 # Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 # Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 # ... # # $`6` # mpg cyl disp hp drat wt qsec vs am gear carb # Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 # Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 # Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 # Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 # ... # # $`8` # mpg cyl disp hp drat wt qsec vs am gear carb # Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 # Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 # Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 # Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 # ... This has returned a list of data frames, one for each cylinder count. As indicated by the output, we could obtain the relevant data frames with spl$`4`, spl$`6`, and spl$`8` (some might find it more visually appealing to use spl$"4" or spl[["4"]] instead). Now, we can use lapply to loop through this list, applying our function that extracts the cars with the best 2 mpg values from each of the list elements: (best2 <- lapply(spl, function(x) tail(x[order(x$mpg),], 2))) # $`4` # mpg cyl disp hp drat wt qsec vs am gear carb # Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 # Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 # # $`6` # mpg cyl disp hp drat wt qsec vs am gear carb # Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 # Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 # # $`8` # mpg cyl disp hp drat wt qsec vs am gear carb # Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 # Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2 GoalKicker.com – R Notes for Professionals 63 Finally, we can combine everything together using rbind. We want to call rbind(best2[["4"]], best2[["6"]], best2[["8"]]), but this would be tedious if we had a huge list. As a result, we use: do.call(rbind, best2) # mpg cyl disp hp drat wt qsec vs am gear carb # 4.Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 # 4.Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 # 6.Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 # 6.Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 # 8.Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 # 8.Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 This returns the result of rbind (argument 1, a function) with all the elements of best2 (argument 2, a list) passed as arguments. With simple analyses like this one, it can be more compact (and possibly much less readable!) to do the whole split apply-combine in a single line of code: do.call(rbind, lapply(split(mtcars, mtcars$cyl), function(x) tail(x[order(x$mpg),], 2))) It is also worth noting that the lapply(split(x,f), FUN) combination can be alternatively framed using the ?by function: by(mtcars, mtcars$cyl, function(x) tail(x[order(x$mpg),], 2)) do.call(rbind, by(mtcars, mtcars$cyl, function(x) tail(x[order(x$mpg),], 2))) Section 19.2: Basic usage of split split allows to divide a vector or a data.frame into buckets with regards to a factor/group variables. This ventilation into buckets takes the form of a list, that can then be used to apply group-wise computation (for loops or lapply/sapply). First example shows the usage of split on a vector: Consider following vector of letters: testdata <- c("e", "o", "r", "g", "a", "y", "w", "q", "i", "s", "b", "v", "x", "h", "u") Objective is to separate those letters into voyels and consonants, ie split it accordingly to letter type. Let's first create a grouping vector: vowels <- c('a','e','i','o','u','y') letter_type <- ifelse(testdata %in% vowels, "vowels", "consonants") Note that letter_type has the same length that our vector testdata. Now we can split this test data in the two groups, vowels and consonants : split(testdata, letter_type) #$consonants #[1] "r" "g" "w" "q" "s" "b" "v" "x" "h" #$vowels #[1] "e" "o" "a" "y" "i" "u" GoalKicker.com – R Notes for Professionals 64 Hence, the result is a list which names are coming from our grouping vector/factor letter_type. split has also a method to deal with data.frames. Consider for instance iris data: data(iris) By using split, one can create a list containing one data.frame per iris specie (variable: Species): > liris <- split(iris, iris$Species) > names(liris) [1] "setosa" "versicolor" "virginica" > head(liris$setosa) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa (contains only data for setosa group). One example operation would be to compute correlation matrix per iris specie; one would then use lapply: > (lcor <- lapply(liris, FUN=function(df) cor(df[,1:4]))) $setosa Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 1.0000000 0.7425467 0.2671758 0.2780984 Sepal.Width 0.7425467 1.0000000 0.1777000 0.2327520 Petal.Length 0.2671758 0.1777000 1.0000000 0.3316300 Petal.Width 0.2780984 0.2327520 0.3316300 1.0000000 $versicolor Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 1.0000000 0.5259107 0.7540490 0.5464611 Sepal.Width 0.5259107 1.0000000 0.5605221 0.6639987 Petal.Length 0.7540490 0.5605221 1.0000000 0.7866681 Petal.Width 0.5464611 0.6639987 0.7866681 1.0000000 $virginica Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 1.0000000 0.4572278 0.8642247 0.2811077 Sepal.Width 0.4572278 1.0000000 0.4010446 0.5377280 Petal.Length 0.8642247 0.4010446 1.0000000 0.3221082 Petal.Width 0.2811077 0.5377280 0.3221082 1.0000000 Then we can retrieve per group the best pair of correlated variables: (correlation matrix is reshaped/melted, diagonal is filtered out and selecting best record is performed) > library(reshape) > (topcor <- lapply(lcor, FUN=function(cormat){ correlations <- melt(cormat,variable_name="correlatio); filtered <- correlations[correlations$X1 != correlations$X2,]; filtered[which.max(filtered$correlation),] })) GoalKicker.com – R Notes for Professionals 65 $setosa X1 X2 correlation 2 Sepal.Width Sepal.Length 0.7425467 $versicolor X1 X2 correlation 12 Petal.Width Petal.Length 0.7866681 $virginica X1 X2 correlation 3 Petal.Length Sepal.Length 0.8642247 Note that one computations are performed on such groupwise level, one may be interested in stacking the results, which can be done with: > (result <- do.call("rbind", topcor)) X1 X2 correlation setosa Sepal.Width Sepal.Length 0.7425467 versicolor Petal.Width Petal.Length 0.7866681 virginica Petal.Length Sepal.Length 0.8642247 GoalKicker.com – R Notes for Professionals 66 Chapter 20: Reading and writing tabular data in plain-text files (CSV, TSV, etc.) Parameter Details file name of the CSV file to read header logical: does the .csv file contain a header row with column names? sep character: symbol that separates the cells on each row quote character: symbol used to quote character strings dec character: symbol used as decimal separator fill logical: when TRUE, rows with unequal length are filled with blank fields. comment.char character: character used as comment in the csv file. Lines preceded by this character are ignored. ... extra arguments to be passed to read.table Section 20.1: Importing .csv files Importing using base R Comma separated value files (CSVs) can be imported using read.csv, which wraps read.table, but uses sep = "," to set the delimiter to a comma. # get the file path of a CSV included in R's utils package csv_path <- system.file("misc", "exDIF.csv", package = "utils") # path will vary based on installation location csv_path ## [1] "/Library/Frameworks/R.framework/Resources/library/utils/misc/exDIF.csv" df <- read.csv(csv_path) df ## Var1 Var2 ## 1 2.70 A ## 2 3.14 B ## 3 10.00 A ## 4 -7.00 A A user friendly option, file.choose, allows to browse through the directories: df <- read.csv(file.choose()) Notes Unlike read.table, read.csv defaults to header = TRUE, and uses the first row as column names. All these functions will convert strings to factor class by default unless either as.is = TRUE or stringsAsFactors = FALSE. The read.csv2 variant defaults to sep = ";" and dec = "," for use on data from countries where the comma is used as a decimal point and the semicolon as a field separator. Importing using packages The readr package's read_csv function offers much faster performance, a progress bar for large files, and more popular default options than standard read.csv, including stringsAsFactors = FALSE. GoalKicker.com – R Notes for Professionals 67 library(readr) df <- read_csv(csv_path) df ## # A tibble: 4 x 2 ## Var1 Var2 ## ## 1 2.70 A ## 2 3.14 B ## 3 10.00 A ## 4 -7.00 A Section 20.2: Importing with data.table The data.table package introduces the function fread. While it is similar to read.table, fread is usually faster and more flexible, guessing the file's delimiter automatically. # get the file path of a CSV included in R's utils package csv_path <- system.file("misc", "exDIF.csv", package = "utils") # path will vary based on R installation location csv_path ## [1] "/Library/Frameworks/R.framework/Resources/library/utils/misc/exDIF.csv" dt <- fread(csv_path) dt ## Var1 Var2 ## 1: 2.70 A ## 2: 3.14 B ## 3: 10.00 A ## 4: -7.00 A Where argument input is a string representing: the filename (e.g. "filename.csv"), a shell command that acts on a file (e.g. "grep 'word' filename"), or the input itself (e.g. "input1, input2 \n A, B \n C, D"). fread returns an object of class data.table that inherits from class data.frame, suitable for use with the data.table's usage of []. To return an ordinary data.frame, set the data.table parameter to FALSE: df <- fread(csv_path, data.table = FALSE) class(df) ## [1] "data.frame" df ## Var1 Var2 ## 1 2.70 A ## 2 3.14 B ## 3 10.00 A ## 4 -7.00 A Notes fread does not have all same options as read.table. One missing argument is na.comment, which may lead GoalKicker.com – R Notes for Professionals 68 in unwanted behaviors if the source file contains #. fread uses only " for quote parameter. fread uses few (5) lines to guess variables types. Section 20.3: Exporting .csv files Exporting using base R Data can be written to a CSV file using write.csv(): write.csv(mtcars, "mtcars.csv") Commonly-specified parameters include row.names = FALSE and na = "". Exporting using packages readr::write_csv is significantly faster than write.csv and does not write row names. library(readr) write_csv(mtcars, "mtcars.csv") Section 20.4: Import multiple csv files files = list.files(pattern="*.csv") data_list = lapply(files, read.table, header = TRUE) This read every file and adds it to a list. Afterwards, if all data.frame have the same structure they can be combined into one big data.frame: df <- do.call(rbind, data_list) Section 20.5: Importing fixed-width files Fixed-width files are text files in which columns are not separated by any character delimiter, like , or ;, but rather have a fixed character length (width). Data is usually padded with white spaces. An example: Column1 Column2 Column3 Column4Column5 1647 pi 'important' 3.141596.28318 1731 euler 'quite important' 2.718285.43656 1979 answer 'The Answer.' 42 42 Let's assume this data table exists in the local file constants.txt in the working directory. Importing with base R df <- read.fwf('constants.txt', widths = c(8,10,18,7,8), header = FALSE, skip = 1) df #> V1 V2 V3 V4 V5 GoalKicker.com – R Notes for Professionals 69