R for Data Science: Import, Tidy, Transform, Visualize, and Model Data

"R for Data Science: Import, Tidy, Transform, Visualize, and Model Data 🔙 Quay lại trang tải sách pdf ebook R for Data Science: Import, Tidy, Transform, Visualize, and Model Data Ebooks Nhóm Zalo R for Data Science IMPORT, TIDY, TRANSFORM, VISUALIZE, AND MODEL DATA Hadley Wickham & Garrett Grolemund R for Data Science Import, Tidy, Transform, Visualize, and Model Data Hadley Wickham and Garrett Grolemund Beijing Boston Farnham Sebastopol Tokyo R for Data Science by Hadley Wickham and Garrett Grolemund Copyright © 2017 Garrett Grolemund, Hadley Wickham. All rights reserved. Printed in Canada. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Marie Beaugureau and Mike Loukides Production Editor: Nicholas Adams Copyeditor: Kim Cofer Proofreader: Charles Roumeliotis December 2016: First Edition Revision History for the First Edition 2016-12-06: First Release Indexer: Wendy Catalano Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest See http://oreilly.com/catalog/errata.csp?isbn=9781491910399 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. R for Data Sci‐ ence, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is sub‐ ject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-91039-9 [TI] Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Part I. Explore 1. Data Visualization with ggplot2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Introduction 3 First Steps 4 Aesthetic Mappings 7 Common Problems 13 Facets 14 Geometric Objects 16 Statistical Transformations 22 Position Adjustments 27 Coordinate Systems 31 The Layered Grammar of Graphics 34 2. Workow: Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Coding Basics 37 What’s in a Name? 38 Calling Functions 39 3. Data Transformation with dplyr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Introduction 43 Filter Rows with filter() 45 Arrange Rows with arrange() 50 Select Columns with select() 51 iii Add New Variables with mutate() 54 Grouped Summaries with summarize() 59 Grouped Mutates (and Filters) 73 4. Workow: Scripts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Running Code 78 RStudio Diagnostics 79 5. Exploratory Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Introduction 81 Questions 82 Variation 83 Missing Values 91 Covariation 93 Patterns and Models 105 ggplot2 Calls 108 Learning More 108 6. Workow: Projects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 What Is Real? 111 Where Does Your Analysis Live? 113 Paths and Directories 113 RStudio Projects 114 Summary 116 Part II. Wrangle 7. Tibbles with tibble. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Introduction 119 Creating Tibbles 119 Tibbles Versus data.frame 121 Interacting with Older Code 123 8. Data Import with readr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Introduction 125 Getting Started 125 Parsing a Vector 129 Parsing a File 137 Writing to a File 143 Other Types of Data 145 iv | Table of Contents 9. Tidy Data with tidyr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Introduction 147 Tidy Data 148 Spreading and Gathering 151 Separating and Pull 157 Missing Values 161 Case Study 163 Nontidy Data 168 10. Relational Data with dplyr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Introduction 171 nycflights13 172 Keys 175 Mutating Joins 178 Filtering Joins 188 Join Problems 191 Set Operations 192 11. Strings with stringr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Introduction 195 String Basics 195 Matching Patterns with Regular Expressions 200 Tools 207 Other Types of Pattern 218 Other Uses of Regular Expressions 221 stringi 222 12. Factors with forcats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Introduction 223 Creating Factors 224 General Social Survey 225 Modifying Factor Order 227 Modifying Factor Levels 232 13. Dates and Times with lubridate. . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Introduction 237 Creating Date/Times 238 Date-Time Components 243 Time Spans 249 Time Zones 254 Table of Contents | v Part III. Program 14. Pipes with magrittr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Introduction 261 Piping Alternatives 261 When Not to Use the Pipe 266 Other Tools from magrittr 266 15. Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Introduction 269 When Should You Write a Function? 270 Functions Are for Humans and Computers 273 Conditional Execution 276 Function Arguments 280 Return Values 285 Environment 288 16. Vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Introduction 291 Vector Basics 292 Important Types of Atomic Vector 293 Using Atomic Vectors 296 Recursive Vectors (Lists) 302 Attributes 307 Augmented Vectors 309 17. Iteration with purrr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Introduction 313 For Loops 314 For Loop Variations 317 For Loops Versus Functionals 322 The Map Functions 325 Dealing with Failure 329 Mapping over Multiple Arguments 332 Walk 335 Other Patterns of For Loops 336 vi | Table of Contents Part IV. Model 18. Model Basics with modelr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 Introduction 345 A Simple Model 346 Visualizing Models 354 Formulas and Model Families 358 Missing Values 371 Other Model Families 372 19. Model Building. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 Introduction 375 Why Are Low-Quality Diamonds More Expensive? 376 What Affects the Number of Daily Flights? 384 Learning More About Models 396 20. Many Models with purrr and broom. . . . . . . . . . . . . . . . . . . . . . . . . 397 Introduction 397 gapminder 398 List-Columns 409 Creating List-Columns 411 Simplifying List-Columns 416 Making Tidy Data with broom 419 Part V. Communicate 21. R Markdown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 Introduction 423 R Markdown Basics 424 Text Formatting with Markdown 427 Code Chunks 428 Troubleshooting 435 YAML Header 435 Learning More 438 22. Graphics for Communication with ggplot2. . . . . . . . . . . . . . . . . . . 441 Introduction 441 Label 442 Annotations 445 Table of Contents | vii Scales 451 Zooming 461 Themes 462 Saving Your Plots 464 Learning More 467 23. R Markdown Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469 Introduction 469 Output Options 470 Documents 470 Notebooks 471 Presentations 472 Dashboards 473 Interactivity 474 Websites 477 Other Formats 477 Learning More 478 24. R Markdown Workow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 viii | Table of Contents Preface Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. The goal of R for Data Science is to help you learn the most important tools in R that will allow you to do data science. After reading this book, you’ll have the tools to tackle a wide variety of data science challenges, using the best parts of R. What You Will Learn Data science is a huge field, and there’s no way you can master it by reading a single book. The goal of this book is to give you a solid foundation in the most important tools. Our model of the tools needed in a typical data science project looks something like this: First you must import your data into R. This typically means that you take data stored in a file, database, or web API, and load it into a data frame in R. If you can’t get your data into R, you can’t do data science on it! ix Once you’ve imported your data, it is a good idea to tidy it. Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observa‐ tion. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions. Once you have tidy data, a common first step is to transform it. Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like comput‐ ing velocity from speed and time), and calculating a set of summary statistics (like counts or means). Together, tidying and transforming are called wrangling, because getting your data in a form that’s natu‐ ral to work with often feels like a fight! Once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualization and modeling. These have complementary strengths and weaknesses so any real analysis will iterate between them many times. Visualization is a fundamentally human activity. A good visualiza‐ tion will show you things that you did not expect, or raise new ques‐ tions about the data. A good visualization might also hint that you’re asking the wrong question, or you need to collect different data. Vis‐ ualizations can surprise you, but don’t scale particularly well because they require a human to interpret them. Models are complementary tools to visualization. Once you have made your questions sufficiently precise, you can use a model to answer them. Models are a fundamentally mathematical or compu‐ tational tool, so they generally scale well. Even when they don’t, it’s usually cheaper to buy more computers than it is to buy more brains! But every model makes assumptions, and by its very nature a model cannot question its own assumptions. That means a model cannot fundamentally surprise you. The last step of data science is communication, an absolutely critical part of any data analysis project. It doesn’t matter how well your models and visualization have led you to understand the data unless you can also communicate your results to others. x | Preface Surrounding all these tools is programming. Programming is a cross cutting tool that you use in every part of the project. You don’t need to be an expert programmer to be a data scientist, but learning more about programming pays off because becoming a better program‐ mer allows you to automate common tasks, and solve new problems with greater ease. You’ll use these tools in every data science project, but for most projects they’re not enough. There’s a rough 80-20 rule at play; you can tackle about 80% of every project using the tools that you’ll learn in this book, but you’ll need other tools to tackle the remain‐ ing 20%. Throughout this book we’ll point you to resources where you can learn more. How This Book Is Organized The previous description of the tools of data science is organized roughly according to the order in which you use them in an analysis (although of course you’ll iterate through them multiple times). In our experience, however, this is not the best way to learn them: • Starting with data ingest and tidying is suboptimal because 80% of the time it’s routine and boring, and the other 20% of the time it’s weird and frustrating. That’s a bad place to start learn‐ ing a new subject! Instead, we’ll start with visualization and transformation of data that’s already been imported and tidied. That way, when you ingest and tidy your own data, your moti‐ vation will stay high because you know the pain is worth it. • Some topics are best explained with other tools. For example, we believe that it’s easier to understand how models work if you already know about visualization, tidy data, and programming. • Programming tools are not necessarily interesting in their own right, but do allow you to tackle considerably more challenging problems. We’ll give you a selection of programming tools in the middle of the book, and then you’ll see they can combine with the data science tools to tackle interesting modeling prob‐ lems. Within each chapter, we try to stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. Each section of the book is paired with exercises to help you practice what you’ve learned. While it’s tempt‐ Preface | xi ing to skip the exercises, there’s no better way to learn than practic‐ ing on real problems. What You Won’t Learn There are some important topics that this book doesn’t cover. We believe it’s important to stay ruthlessly focused on the essentials so you can get up and running as quickly as possible. That means this book can’t cover every important topic. Big Data This book proudly focuses on small, in-memory datasets. This is the right place to start because you can’t tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1–2 Gb of data. If you’re routinely working with larger data (10–100 Gb, say), you should learn more about data.table. This book doesn’t teach data.table because it has a very concise interface, which makes it harder to learn since it offers fewer linguistic cues. But if you’re working with large data, the performance payoff is worth the extra effort required to learn it. If your data is bigger than this, carefully consider if your big data problem might actually be a small data problem in disguise. While the complete data might be big, often the data needed to answer a specific question is small. You might be able to find a subset, sub‐ sample, or summary that fits in memory and still allows you to answer the question that you’re interested in. The challenge here is finding the right small data, which often requires a lot of iteration. Another possibility is that your big data problem is actually a large number of small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a mil‐ lion. Fortunately each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like Hadoop or Spark) that allows you to send different datasets to different computers for processing. Once you’ve figured out how to answer the question for a single subset using the tools xii | Preface described in this book, you learn new tools like sparklyr, rhipe, and ddr to solve it for the full dataset. Python, Julia, and Friends In this book, you won’t learn anything about Python, Julia, or any other programming language useful for data science. This isn’t because we think these tools are bad. They’re not! And in practice, most data science teams use a mix of languages, often at least R and Python. However, we strongly believe that it’s best to master one tool at a time. You will get better faster if you dive deep, rather than spread‐ ing yourself thinly over many topics. This doesn’t mean you should only know one thing, just that you’ll generally learn faster if you stick to one thing at a time. You should strive to learn new things throughout your career, but make sure your understanding is solid before you move on to the next interesting thing. We think R is a great place to start your data science journey because it is an environment designed from the ground up to support data science. R is not just a programming language, but it is also an inter‐ active environment for doing data science. To support interaction, R is a much more flexible language than many of its peers. This flexi‐ bility comes with its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process. These mini languages help you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer. Nonrectangular Data This book focuses exclusively on rectangular data: collections of val‐ ues that are each associated with a variable and an observation. There are lots of datasets that do not naturally fit in this paradigm: including images, sounds, trees, and text. But rectangular data frames are extremely common in science and industry, and we believe that they’re a great place to start your data science journey. Hypothesis Conrmation It’s possible to divide data analysis into two camps: hypothesis gen‐ eration and hypothesis confirmation (sometimes called confirma‐ Preface | xiii tory analysis). The focus of this book is unabashedly on hypothesis generation, or data exploration. Here you’ll look deeply at the data and, in combination with your subject knowledge, generate many interesting hypotheses to help explain why the data behaves the way it does. You evaluate the hypotheses informally, using your skepti‐ cism to challenge the data in multiple ways. The complement of hypothesis generation is hypothesis confirma‐ tion. Hypothesis confirmation is hard for two reasons: • You need a precise mathematical model in order to generate fal‐ sifiable predictions. This often requires considerable statistical sophistication. • You can only use an observation once to confirm a hypothesis. As soon as you use it more than once you’re back to doing exploratory analysis. This means to do hypothesis confirmation you need to “preregister” (write out in advance) your analysis plan, and not deviate from it even when you have seen the data. We’ll talk a little about some strategies you can use to make this easier in Part IV. It’s common to think about modeling as a tool for hypothesis confir‐ mation, and visualization as a tool for hypothesis generation. But that’s a false dichotomy: models are often used for exploration, and with a little care you can use visualization for confirmation. The key difference is how often you look at each observation: if you look only once, it’s confirmation; if you look more than once, it’s explora‐ tion. Prerequisites We’ve made a few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and it’s helpful if you have some programming experience already. If you’ve never programmed before, you might find Hands-On Programming with R by Garrett to be a useful adjunct to this book. There are four things you need to run the code in this book: R, RStudio, a collection of R packages called the tidyverse, and a hand‐ ful of other packages. Packages are the fundamental units of repro‐ xiv | Preface ducible R code. They include reusable functions, the documentation that describes how to use them, and sample data. R To download R, go to CRAN, the comprehensive R archive network. CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages. Don’t try and pick a mirror that’s close to you: instead use the cloud mirror, https:// cloud.r-project.org, which automatically figures it out for you. A new major version of R comes out once a year, and there are 2–3 minor releases each year. It’s a good idea to update regularly. Upgrading can be a bit of a hassle, especially for major versions, which require you to reinstall all your packages, but putting it off only makes it worse. RStudio RStudio is an integrated development environment, or IDE, for R programming. Download and install it from http://www.rstu dio.com/download. RStudio is updated a couple of times a year. When a new version is available, RStudio will let you know. It’s a good idea to upgrade regularly so you can take advantage of the lat‐ est and greatest features. For this book, make sure you have RStudio 1.0.0. When you start RStudio, you’ll see two key regions in the interface: Preface | xv For now, all you need to know is that you type R code in the console pane, and press Enter to run it. You’ll learn more as we go along! The Tidyverse You’ll also need to install some R packages. An R package is a collec‐ tion of functions, data, and documentation that extends the capabili‐ ties of base R. Using packages is key to the successful use of R. The majority of the packages that you will learn in this book are part of the so-called tidyverse. The packages in the tidyverse share a com‐ mon philosophy of data and R programming, and are designed to work together naturally. You can install the complete tidyverse with a single line of code: install.packages("tidyverse") On your own computer, type that line of code in the console, and then press Enter to run it. R will download the packages from CRAN and install them onto your computer. If you have problems installing, make sure that you are connected to the internet, and that https://cloud.r-project.org/ isn’t blocked by your firewall or proxy. You will not be able to use the functions, objects, and help files in a package until you load it with library(). Once you have installed a package, you can load it with the library() function: library(tidyverse) #> Loading tidyverse: ggplot2 #> Loading tidyverse: tibble #> Loading tidyverse: tidyr #> Loading tidyverse: readr #> Loading tidyverse: purrr #> Loading tidyverse: dplyr #> Conflicts with tidy packages -------------------------------- #> filter(): dplyr, stats #> lag(): dplyr, stats This tells you that tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and dplyr packages. These are considered to be the core of the tidyverse because you’ll use them in almost every analy‐ sis. Packages in the tidyverse change fairly frequently. You can see if updates are available, and optionally install them, by running tidy verse_update(). xvi | Preface Other Packages There are many other excellent packages that are not part of the tidyverse, because they solve problems in a different domain, or are designed with a different set of underlying principles. This doesn’t make them better or worse, just different. In other words, the com‐ plement to the tidyverse is not the messyverse, but many other uni‐ verses of interrelated packages. As you tackle more data science projects with R, you’ll learn new packages and new ways of thinking about data. In this book we’ll use three data packages from outside the tidyverse: install.packages(c("nycflights13", "gapminder", "Lahman")) These packages provide data on airline flights, world development, and baseball that we’ll use to illustrate key data science ideas. Running R Code The previous section showed you a couple of examples of running R code. Code in the book looks like this: 1 + 2 #> [1] 3 If you run the same code in your local console, it will look like this: > 1 + 2 [1] 3 There are two main differences. In your console, you type after the >, called the prompt; we don’t show the prompt in the book. In the book, output is commented out with #>; in your console it appears directly after your code. These two differences mean that if you’re working with an electronic version of the book, you can easily copy code out of the book and into the console. Throughout the book we use a consistent set of conventions to refer to code: • Functions are in a code font and followed by parentheses, like sum() or mean(). • Other R objects (like data or function arguments) are in a code font, without parentheses, like flights or x. Preface | xvii • If we want to make it clear what package an object comes from, we’ll use the package name followed by two colons, like dplyr::mutate() or nycflights13::flights. This is also valid R code. Getting Help and Learning More This book is not an island; there is no single resource that will allow you to master R. As you start to apply the techniques described in this book to your own data you will soon find questions that I do not answer. This section describes a few tips on how to get help, and to help you keep learning. If you get stuck, start with Google. Typically, adding “R” to a query is enough to restrict it to relevant results: if the search isn’t useful, it often means that there aren’t any R-specific results available. Google is particularly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web. (If the error message isn’t in English, run Sys.setenv(LANGUAGE = "en") and re-run the code; you’re more likely to find help for English error messages.) If Google doesn’t help, try stackoverflow. Start by spending a little time searching for an existing answer; including [R] restricts your search to questions and answers that use R. If you don’t find any‐ thing useful, prepare a minimal reproducible example or reprex. A good reprex makes it easier for other people to help you, and often you’ll figure out the problem yourself in the course of making it. There are three things you need to include to make your example reproducible: required packages, data, and code: • Packages should be loaded at the top of the script, so it’s easy to see which ones the example needs. This is a good time to check that you’re using the latest version of each package; it’s possible you’ve discovered a bug that’s been fixed since you installed the package. For packages in the tidyverse, the easiest way to check is to run tidyverse_update(). • The easiest way to include data in a question is to use dput() to generate the R code to re-create it. For example, to re-create the mtcars dataset in R, I’d perform the following steps: xviii | Preface 1. Run dput(mtcars) in R. 2. Copy the output. 3. In my reproducible script, type mtcars <- then paste. Try and find the smallest subset of your data that still reveals the problem. • Spend a little bit of time ensuring that your code is easy for oth‐ ers to read: — Make sure you’ve used spaces and your variable names are concise, yet informative. — Use comments to indicate where your problem lies. — Do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand, and the easier it is to fix. Finish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script in. You should also spend some time preparing yourself to solve prob‐ lems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way is to follow what Hadley, Garrett, and everyone else at RStudio are doing on the RStu‐ dio blog. This is where we post announcements about new packages, new IDE features, and in-person courses. You might also want to follow Hadley (@hadleywickham) or Garrett (@statgarrett) on Twit‐ ter, or follow @rstudiotips to keep up with new features in the IDE. To keep up with the R community more broadly, we recommend reading http://www.r-bloggers.com: it aggregates over 500 blogs about R from around the world. If you’re an active Twitter user, fol‐ low the #rstats hashtag. Twitter is one of the key tools that Hadley uses to keep up with new developments in the community. Acknowledgments This book isn’t just the product of Hadley and Garrett, but is the result of many conversations (in person and online) that we’ve had with the many people in the R community. There are a few people Preface | xix we’d like to thank in particular, because they have spent many hours answering our dumb questions and helping us to better think about data science: • Jenny Bryan and Lionel Henry for many helpful discussions around working with lists and list-columns. • The three chapters on workflow were adapted (with permission) from “R basics, workspace and working directory, RStudio projects” by Jenny Bryan. • Genevera Allen for discussions about models, modeling, the statistical learning perspective, and the difference between hypothesis generation and hypothesis confirmation. • Yihui Xie for his work on the bookdown package, and for tire‐ lessly responding to my feature requests. • Bill Behrman for his thoughtful reading of the entire book, and for trying it out with his data science class at Stanford. • The #rstats twitter community who reviewed all of the draft chapters and provided tons of useful feedback. • Tal Galili for augmenting his dendextend package to support a section on clustering that did not make it into the final draft. This book was written in the open, and many people contributed pull requests to fix minor problems. Special thanks goes to everyone who contributed via GitHub (listed in alphabetical order): adi prad‐ han, Ahmed ElGabbas, Ajay Deonarine, @Alex, Andrew Landgraf, @batpigandme, @behrman, Ben Marwick, Bill Behrman, Brandon Greenwell, Brett Klamer, Christian G. Warden, Christian Mongeau, Colin Gillespie, Cooper Morris, Curtis Alexander, Daniel Gromer, David Clark, Derwin McGeary, Devin Pastoor, Dylan Cashman, Earl Brown, Eric Watt, Etienne B. Racine, Flemming Villalona, Gregory Jefferis, @harrismcgehee, Hengni Cai, Ian Lyttle, Ian Sealy, Jakub Nowosad, Jennifer (Jenny) Bryan, @jennybc, Jeroen Janssens, Jim Hester, @jjchern, Joanne Jang, John Sears, Jon Calder, Jonathan Page, @jonathanflint, Julia Stewart Lowndes, Julian During, Justinas Petuchovas, Kara Woo, @kdpsingh, Kenny Darrell, Kirill Sevastya‐ nenko, @koalabearski, @KyleHumphrey, Lawrence Wu, Matthew Sedaghatfar, Mine Cetinkaya-Rundel, @MJMarshall, Mustafa Ascha, @nate-d-olson, Nelson Areal, Nick Clark, @nickelas, @nwaff, @OaCantona, Patrick Kennedy, Peter Hurford, Rademeyer Ver‐ maak, Radu Grosu, @rlzijdeman, Robert Schuessler, @robinlovelace, xx | Preface @robinsones, S’busiso Mkhondwane, @seamus-mckinsey, @seanp‐ williams, Shannon Ellis, @shoili, @sibusiso16, @spirgel, Steve Mor‐ timer, @svenski, Terence Teo, Thomas Klebel, TJ Mahr, Tom Prior, Will Beasley, Yihui Xie. Online Version An online version of this book is available at http://r4ds.had.co.nz. It will continue to evolve in between reprints of the physical book. The source of the book is available at https://github.com/hadley/r4ds. The book is powered by https://bookdown.org, which makes it easy to turn R markdown files into HTML, PDF, and EPUB. This book was built with: devtools::session_info(c("tidyverse")) #> Session info ------------------------------------------------ #> setting value #> version R version 3.3.1 (2016-06-21) #> system x86_64, darwin13.4.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> tz America/Los_Angeles #> date 2016-10-10 #> Packages ---------------------------------------------------- #> package * version date source #> assertthat 0.1 2013-12-06 CRAN (R 3.3.0) #> BH 1.60.0-2 2016-05-07 CRAN (R 3.3.0) #> broom 0.4.1 2016-06-24 CRAN (R 3.3.0) #> colorspace 1.2-6 2015-03-11 CRAN (R 3.3.0) #> curl 2.1 2016-09-22 CRAN (R 3.3.0) #> DBI 0.5-1 2016-09-10 CRAN (R 3.3.0) #> dichromat 2.0-0 2013-01-24 CRAN (R 3.3.0) #> digest 0.6.10 2016-08-02 CRAN (R 3.3.0) #> dplyr * 0.5.0 2016-06-24 CRAN (R 3.3.0) #> forcats 0.1.1 2016-09-16 CRAN (R 3.3.0) #> foreign 0.8-67 2016-09-13 CRAN (R 3.3.0) #> ggplot2 * 2.1.0.9001 2016-10-06 local #> gtable 0.2.0 2016-02-26 CRAN (R 3.3.0) #> haven 1.0.0 2016-09-30 local #> hms 0.2-1 2016-07-28 CRAN (R 3.3.1) #> httr 1.2.1 2016-07-03 cran (@1.2.1) #> jsonlite 1.1 2016-09-14 CRAN (R 3.3.0) #> labeling 0.3 2014-08-23 CRAN (R 3.3.0) #> lattice 0.20-34 2016-09-06 CRAN (R 3.3.0) #> lazyeval 0.2.0 2016-06-12 CRAN (R 3.3.0) #> lubridate 1.6.0 2016-09-13 CRAN (R 3.3.0) #> magrittr 1.5 2014-11-22 CRAN (R 3.3.0) Preface | xxi #> MASS 7.3-45 2016-04-21 CRAN (R 3.3.1) #> mime 0.5 2016-07-07 cran (@0.5) #> mnormt 1.5-4 2016-03-09 CRAN (R 3.3.0) #> modelr 0.1.0 2016-08-31 CRAN (R 3.3.0) #> munsell 0.4.3 2016-02-13 CRAN (R 3.3.0) #> nlme 3.1-128 2016-05-10 CRAN (R 3.3.1) #> openssl 0.9.4 2016-05-25 cran (@0.9.4) #> plyr 1.8.4 2016-06-08 cran (@1.8.4) #> psych 1.6.9 2016-09-17 CRAN (R 3.3.0) #> purrr * 0.2.2 2016-06-18 CRAN (R 3.3.0) #> R6 2.1.3 2016-08-19 CRAN (R 3.3.0) #> RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.3.0) #> Rcpp 0.12.7 2016-09-05 CRAN (R 3.3.0) #> readr * 1.0.0 2016-08-03 CRAN (R 3.3.0) #> readxl 0.1.1 2016-03-28 CRAN (R 3.3.0) #> reshape2 1.4.1 2014-12-06 CRAN (R 3.3.0) #> rvest 0.3.2 2016-06-17 CRAN (R 3.3.0) #> scales 0.4.0.9003 2016-10-06 local #> selectr 0.3-0 2016-08-30 CRAN (R 3.3.0) #> stringi 1.1.2 2016-10-01 CRAN (R 3.3.1) #> stringr 1.1.0 2016-08-19 cran (@1.1.0) #> tibble * 1.2 2016-08-26 CRAN (R 3.3.0) #> tidyr * 0.6.0 2016-08-12 CRAN (R 3.3.0) #> tidyverse * 1.0.0 2016-09-09 CRAN (R 3.3.0) #> xml2 1.0.0.9001 2016-09-30 local Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Bold Indicates the names of R packages. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, data‐ bases, data types, environment variables, statements, and key‐ words. Constant width bold Shows commands or other text that should be typed literally by the user. xxii | Preface Constant width italic Shows text that should be replaced with user-supplied values or by values determined by context. This element signifies a tip or suggestion. Using Code Examples Source code is available for download at https://github.com/hadley/ r4ds. This book is here to help you get your job done. In general, if exam‐ ple code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usu‐ ally includes the title, author, publisher, and ISBN. For example: “R for Data Science by Hadley Wickham and Garrett Grolemund (O’Reilly). Copyright 2017 Garrett Grolemund, Hadley Wickham, 978-1-491-91039-9.” If you feel your use of code examples falls outside fair use or the per‐ mission given above, feel free to contact us at permis‐ sions@oreilly.com. O’Reilly Safari Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educa‐ tors, and individuals. Preface | xxiii Members have access to thousands of books, training videos, Learn‐ ing Paths, interactive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among oth‐ ers. For more information, please visit http://oreilly.com/safari. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http:// bit.ly/r-for-data-science. To comment or ask technical questions about this book, send email to bookquestions@oreilly.com. For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia xxiv | Preface PART I Explore The goal of the first part of this book is to get you up to speed with the basic tools of data exploration as quickly as possible. Data explo‐ ration is the art of looking at your data, rapidly generating hypothe‐ ses, quickly testing them, then repeating again and again and again. The goal of data exploration is to generate many promising leads that you can later explore in more depth. In this part of the book you will learn some useful tools that have an immediate payoff: • Visualization is a great place to start with R programming, because the payoff is so clear: you get to make elegant and infor‐ mative plots that help you understand data. In Chapter 1 you’ll dive into visualization, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots. • Visualization alone is typically not enough, so in Chapter 3 you’ll learn the key verbs that allow you to select important vari‐ ables, filter out key observations, create new variables, and com‐ pute summaries. • Finally, in Chapter 5, you’ll combine visualization and transfor‐ mation with your curiosity and skepticism to ask and answer interesting questions about data. Modeling is an important part of the exploratory process, but you don’t have the skills to effectively learn or apply it yet. We’ll come back to it in Part IV, once you’re better equipped with more data wrangling and programming tools. Nestled among these three chapters that teach you the tools of exploration are three chapters that focus on your R workflow. In Chapter 2, Chapter 4, and Chapter 6 you’ll learn good practices for writing and organizing your R code. These will set you up for suc‐ cess in the long run, as they’ll give you the tools to stay organized when you tackle real projects. CHAPTER 1 Data Visualization with ggplot2 Introduction The simple graph has brought more information to the data analy‐ st’s mind than any other device. —John Tukey This chapter will teach you how to visualize your data using ggplot2. R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one sys‐ tem and applying it in many places. If you’d like to learn more about the theoretical underpinnings of ggplot2 before you start, I’d recommend reading “A Layered Gram‐ mar of Graphics”. Prerequisites This chapter focuses on ggplot2, one of the core members of the tidyverse. To access the datasets, help pages, and functions that we will use in this chapter, load the tidyverse by running this code: library(tidyverse) #> Loading tidyverse: ggplot2 #> Loading tidyverse: tibble #> Loading tidyverse: tidyr #> Loading tidyverse: readr #> Loading tidyverse: purrr 3 #> Loading tidyverse: dplyr #> Conflicts with tidy packages -------------------------------- #> filter(): dplyr, stats #> lag(): dplyr, stats That one line of code loads the core tidyverse, packages that you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded). If you run this code and get the error message “there is no package called ‘tidyverse’,” you’ll need to first install it, then run library() once again: install.packages("tidyverse") library(tidyverse) You only need to install a package once, but you need to reload it every time you start a new session. If we need to be explicit about where a function (or dataset) comes from, we’ll use the special form package::function(). For example, ggplot2::ggplot() tells you explicitly that we’re using the ggplot() function from the ggplot2 package. First Steps Let’s use our first graph to answer a question: do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear? The mpg Data Frame You can test your answer with the mpg data frame found in ggplot2 (aka ggplot2::mpg). A data frame is a rectangular collection of vari‐ ables (in the columns) and observations (in the rows). mpg contains observations collected by the US Environment Protection Agency on 38 models of cars: mpg #> # A tibble: 234 × 11 #> manufacturer model displ year cyl trans drv #> #> 1 audi a4 1.8 1999 4 auto(l5) f #> 2 audi a4 1.8 1999 4 manual(m5) f 4 | Chapter 1: Data Visualization with ggplot2 #> 3 audi a4 2.0 2008 4 manual(m6) f #> 4 audi a4 2.0 2008 4 auto(av) f #> 5 audi a4 2.8 1999 6 auto(l5) f #> 6 audi a4 2.8 1999 6 manual(m5) f #> # ... with 228 more rows, and 4 more variables: #> # cty , hwy , fl , class Among the variables in mpg are: • displ, a car’s engine size, in liters. • hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same dis‐ tance. To learn more about mpg, open its help page by running ?mpg. Creating a ggplot To plot mpg, run this code to put displ on the x-axis and hwy on the y-axis: ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) The plot shows a negative relationship between engine size (displ) and fuel efficiency (hwy). In other words, cars with big engines use more fuel. Does this confirm or refute your hypothesis about fuel efficiency and engine size? First Steps | 5 With ggplot2, you begin a plot with the function ggplot(). ggplot() creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph. So ggplot(data = mpg) creates an empty graph, but it’s not very inter‐ esting so I’m not going to show it here. You complete your graph by adding one or more layers to ggplot(). The function geom_point() adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom func‐ tions that each add a different type of layer to a plot. You’ll learn a whole bunch of them throughout this chapter. Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual proper‐ ties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y-axes. ggplot2 looks for the mapped variable in the data argu‐ ment, in this case, mpg. A Graphing Template Let’s turn this code into a reusable template for making graphs with ggplot2. To make a graph, replace the bracketed sections in the fol‐ lowing code with a dataset, a geom function, or a collection of map‐ pings: ggplot(data = ) + (mapping = aes()) The rest of this chapter will show you how to complete and extend this template to make different types of graphs. We will begin with the component. Exercises 1. Run ggplot(data = mpg). What do you see? 2. How many rows are in mtcars? How many columns? 3. What does the drv variable describe? Read the help for ?mpg to find out. 4. Make a scatterplot of hwy versus cyl. 6 | Chapter 1: Data Visualization with ggplot2 5. What happens if you make a scatterplot of class versus drv? Why is the plot not useful? Aesthetic Mappings The greatest value of a picture is when it forces us to notice what we never expected to see. —John Tukey In the following plot, one group of points (highlighted in red) seems to fall outside of the linear trend. These cars have a higher mileage than you might expect. How can you explain these cars? Let’s hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the class value for each car. The class vari‐ able of the mpg dataset classifies cars into groups such as compact, midsize, and SUV. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs became popular). You can add a third variable, like class, to a two-dimensional scat‐ terplot by mapping it to an aesthetic. An aesthetic is a visual prop‐ erty of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one shown next) in different ways by changing the values of its aesthetic properties. Since we already use the word “value” to Aesthetic Mappings | 7 describe data, let’s use the word “level” to describe aesthetic proper‐ ties. Here we change the levels of a point’s size, shape, and color to make the point small, triangular, or blue: You can convey information about your data by mapping the aes‐ thetics in your plot to the variables in your dataset. For example, you can map the colors of your points to the class variable to reveal the class of each car: ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class)) (If you prefer British English, like Hadley, you can use colour instead of color.) To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside aes(). ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot2 will 8 | Chapter 1: Data Visualization with ggplot2 also add a legend that explains which levels correspond to which values. The colors reveal that many of the unusual points are two-seater cars. These cars don’t seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines. In the preceding example, we mapped class to the color aesthetic, but we could have mapped class to the size aesthetic in the same way. In this case, the exact size of each point would reveal its class affiliation. We get a warning here, because mapping an unordered variable (class) to an ordered aesthetic (size) is not a good idea: ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, size = class)) #> Warning: Using size for a discrete variable is not advised. Or we could have mapped class to the alpha aesthetic, which con‐ trols the transparency of the points, or the shape of the points: # Top ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, alpha = class)) # Bottom ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, shape = class)) Aesthetic Mappings | 9 What happened to the SUVs? ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use this aesthetic. For each aesthetic you use, the aes() to associate the name of the aesthetic with a variable to display. The aes() function gathers together each of the aesthetic mappings used by a layer and passes them to the layer’s mapping argument. The syntax highlights a use‐ ful insight about x and y: the x and y locations of a point are them‐ selves aesthetics, visual properties that you can map to variables to display information about the data. Once you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis 10 | Chapter 1: Data Visualization with ggplot2 line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values. You can also set the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue: ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue") Here, the color doesn’t convey information about a variable, but only changes the appearance of the plot. To set an aesthetic man‐ ually, set the aesthetic by name as an argument of your geom func‐ tion; i.e., it goes outside of aes(). You’ll need to pick a value that makes sense for that aesthetic: • The name of a color as a character string. • The size of a point in mm. • The shape of a point as a number, as shown in Figure 1-1. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the color and fill aesthetics. The hollow shapes (0–14) have a border determined by color; the solid shapes (15–18) are filled with color; and the filled shapes (21–24) have a border of color and are filled with fill. Aesthetic Mappings | 11 Figure 1-1. R has 25 built-in shapes that are identified by numbers Exercises 1. What’s gone wrong with this code? Why are the points not blue? ggplot(data = mpg) + geom_point( mapping = aes(x = displ, y = hwy, color = "blue") ) 2. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset.) How can you see this information when you run mpg? 3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical versus contin‐ uous variables? 4. What happens if you map the same variable to multiple aesthet‐ ics? 5. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point.) 12 | Chapter 1: Data Visualization with ggplot2 6. What happens if you map an aesthetic to something other than a variable name, like aes(color = displ < 5)? Common Problems As you start to run R code, you’re likely to run into problems. Don’t worry—it happens to everyone. I have been writing R code for years, and every day I still write code that doesn’t work! Start by carefully comparing the code that you’re running to the code in the book. R is extremely picky, and a misplaced character can make all the difference. Make sure that every ( is matched with a ) and every " is paired with another ". Sometimes you’ll run the code and nothing happens. Check the left-hand side of your con‐ sole: if it’s a +, it means that R doesn’t think you’ve typed a complete expression and it’s waiting for you to finish it. In this case, it’s usu‐ ally easy to start from scratch again by pressing Esc to abort process‐ ing the current command. One common problem when creating ggplot2 graphics is to put the + in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you haven’t accidentally written code like this: ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) If you’re still stuck, try the help. You can get help about any R func‐ tion by running ?function_name in the console, or selecting the function name and pressing F1 in RStudio. Don’t worry if the help doesn’t seem that helpful—instead skip down to the examples and look for code that matches what you’re trying to do. If that doesn’t help, carefully read the error message. Sometimes the answer will be buried there! But when you’re new to R, the answer might be in the error message but you don’t yet know how to under‐ stand it. Another great tool is Google: trying googling the error mes‐ sage, as it’s likely someone else has had the same problem, and has received help online. Common Problems | 13 Facets One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data. To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete: ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow = 2) To facet your plot on the combination of two variables, add facet_grid() to your plot call. The first argument of facet_grid() is also a formula. This time the formula should contain two variable names separated by a ~: ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl) 14 | Chapter 1: Data Visualization with ggplot2 If you prefer to not facet in the rows or columns dimension, use a . instead of a variable name, e.g., + facet_grid(. ~ cyl). Exercises 1. What happens if you facet on a continuous variable? 2. What do the empty cells in a plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot? ggplot(data = mpg) + geom_point(mapping = aes(x = drv, y = cyl)) 3. What plots does the following code make? What does . do? ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ .) ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(. ~ cyl) 4. Take the first faceted plot in this section: ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow = 2) What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset? Facets | 15 5. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol variables? 6. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why? Geometric Objects How are these two plots similar? Both plots contain the same x variable and the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different visual object to represent the data. In ggplot2 syntax, we say that they use different geoms. A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom. As we see in the preceding plots, you can use different geoms to plot the same data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data. To change the geom in your plot, change the geom function that you add to ggplot(). For instance, to make the preceding plots, you can use this code: # left ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) # right ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy)) 16 | Chapter 1: Data Visualization with ggplot2 Every geom function in ggplot2 takes a mapping argument. How‐ ever, not every aesthetic works with every geom. You could set the shape of a point, but you couldn’t set the “shape” of a line. On the other hand, you could set the linetype of a line. geom_smooth() will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype: ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv)) Here geom_smooth() separates the cars into three lines based on their drv value, which describes a car’s drivetrain. One line describes all of the points with a 4 value, one line describes all of the points with an f value, and one line describes all of the points with an r value. Here, 4 stands for four-wheel drive, f for front-wheel drive, and r for rear-wheel drive. If this sounds strange, we can make it more clear by overlaying the lines on top of the raw data and then coloring everything according to drv. Geometric Objects | 17 Notice that this plot contains two geoms in the same graph! If this makes you excited, buckle up. In the next section, we will learn how to place multiple geoms in the same plot. ggplot2 provides over 30 geoms, and extension packages provide even more (see https://www.ggplot2-exts.org for a sampling). The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at http://rstudio.com/cheatsheets. To learn more about any single geom, use help: ?geom_smooth. Many geoms, like geom_smooth(), use a single geometric object to display multiple rows of data. For these geoms, you can set the group aesthetic to a categorical variable to draw multiple objects. ggplot2 will draw a separate object for each unique value of the grouping variable. In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the linetype example). It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms: ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy)) ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, group = drv)) ggplot(data = mpg) + geom_smooth( mapping = aes(x = displ, y = hwy, color = drv), show.legend = FALSE ) To display multiple geoms in the same plot, add multiple geom functions to ggplot(): ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + geom_smooth(mapping = aes(x = displ, y = hwy)) 18 | Chapter 1: Data Visualization with ggplot2 This, however, introduces some duplication in our code. Imagine if you wanted to change the y-axis to display cty instead of hwy. You’d need to change the variable in two places, and you might forget to update one. You can avoid this type of repetition by passing a set of mappings to ggplot(). ggplot2 will treat these mappings as global mappings that apply to each geom in the graph. In other words, this code will produce the same plot as the previous code: ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() + geom_smooth() If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings for that layer only. This makes it pos‐ sible to display different aesthetics in different layers: ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth() Geometric Objects | 19 You can use the same idea to specify different data for each layer. Here, our smooth line displays just a subset of the mpg dataset, the subcompact cars. The local data argument in geom_smooth() over‐ rides the global data argument in ggplot() for that layer only: ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth( data = filter(mpg, class == "subcompact"), se = FALSE ) (You’ll learn how filter() works in the next chapter: for now, just know that this command selects only the subcompact cars.) Exercises 1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart? 2. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions: ggplot( data = mpg, mapping = aes(x = displ, y = hwy, color = drv) ) + geom_point() + geom_smooth(se = FALSE) 3. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter? 4. What does the se argument to geom_smooth() do? 20 | Chapter 1: Data Visualization with ggplot2 5. Will these two graphs look different? Why/why not? ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() + geom_smooth() ggplot() + geom_point( data = mpg, mapping = aes(x = displ, y = hwy) ) + geom_smooth( data = mpg, mapping = aes(x = displ, y = hwy) ) 6. Re-create the R code necessary to generate the following graphs. Geometric Objects | 21 Statistical Transformations Next, let’s take a look at a bar chart. Bar charts seem simple, but they are interesting because they reveal something subtle about plots. Consider a basic bar chart, as drawn with geom_bar(). The follow‐ ing chart displays the total number of diamonds in the diamonds dataset, grouped by cut. The diamonds dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond. The chart shows that more diamonds are available with high-quality cuts than with low quality cuts: ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) On the x-axis, the chart displays cut, a ‘variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calcu‐ late new values to plot: • Bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin. • Smoothers fit a model to your data and then plot predictions from the model. 22 | Chapter 1: Data Visualization with ggplot2 • Boxplots compute a robust summary of the distribution and display a specially formatted box. The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. The following figure describes how this process works with geom_bar(). You can learn which stat a geom uses by inspecting the default value for the stat argument. For example, ?geom_bar shows the default value for stat is “count,” which means that geom_bar() uses stat_count(). stat_count() is documented on the same page as geom_bar(), and if you scroll down you can find a section called “Computed variables.” That tells that it computes two new variables: count and prop. You can generally use geoms and stats interchangeably. For example, you can re-create the previous plot using stat_count() instead of geom_bar(): ggplot(data = diamonds) + stat_count(mapping = aes(x = cut)) Statistical Transformations | 23 This works because every geom has a default stat, and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. There are three reasons you might need to use a stat explicitly: • You might want to override the default stat. In the following code, I change the stat of geom_bar() from count (the default) to identity. This lets me map the height of the bars to the raw values of a y variable. Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows. demo <- tribble( ~a, ~b, "bar_1", 20, "bar_2", 30, "bar_3", 40 ) ggplot(data = demo) + geom_bar( mapping = aes(x = a, y = b), stat = "identity" ) (Don’t worry that you haven’t seen <- or tibble() before. You might be able to guess at their meaning from the context, and you’ll learn exactly what they do soon!) 24 | Chapter 1: Data Visualization with ggplot2 • You might want to override the default mapping from trans‐ formed variables to aesthetics. For example, you might want to display a bar chart of proportion, rather than count: ggplot(data = diamonds) + geom_bar( mapping = aes(x = cut, y = ..prop.., group = 1) ) To find the variables computed by the stat, look for the help sec‐ tion titled “Computed variables.” • You might want to draw greater attention to the statistical trans‐ formation in your code. For example, you might use stat_sum mary(), which summarizes the y values for each unique x value, to draw attention to the summary that you’re computing: ggplot(data = diamonds) + stat_summary( mapping = aes(x = cut, y = depth), fun.ymin = min, fun.ymax = max, fun.y = median ) Statistical Transformations | 25 ggplot2 provides over 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g., ?stat_bin. To see a com‐ plete list of stats, try the ggplot2 cheatsheet. Exercises 1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom func‐ tion instead of the stat function? 2. What does geom_col() do? How is it different to geom_bar()? 3. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common? 4. What variables does stat_smooth() compute? What parame‐ ters control its behavior? 5. In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs? ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop..)) ggplot(data = diamonds) + geom_bar( mapping = aes(x = cut, fill = color, y = ..prop..) ) 26 | Chapter 1: Data Visualization with ggplot2 Position Adjustments There’s one more piece of magic associated with bar charts. You can color a bar chart using either the color aesthetic, or more usefully, fill: ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, color = cut)) ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = cut)) Note what happens if you map the fill aesthetic to another vari‐ able, like clarity: the bars are automatically stacked. Each colored rectangle represents a combination of cut and clarity: ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity)) The stacking is performed automatically by the position adjustment specified by the position argument. If you don’t want a stacked bar Position Adjustments | 27 chart, you can use one of three other options: "identity", "dodge" or "fill": • position = "identity" will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA: ggplot( data = diamonds, mapping = aes(x = cut, fill = clarity) ) + geom_bar(alpha = 1/5, position = "identity") ggplot( data = diamonds, mapping = aes(x = cut, color = clarity) ) + geom_bar(fill = NA, position = "identity") The identity position adjustment is more useful for 2D geoms, like points, where it is the default. • position = "fill" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups: ggplot(data = diamonds) + geom_bar( mapping = aes(x = cut, fill = clarity), position = "fill" ) 28 | Chapter 1: Data Visualization with ggplot2 • position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare individual values: ggplot(data = diamonds) + geom_bar( mapping = aes(x = cut, fill = clarity), position = "dodge" ) There’s one other type of adjustment that’s not useful for bar charts, but it can be very useful for scatterplots. Recall our first scatterplot. Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset? Position Adjustments | 29 The values of hwy and displ are rounded so the points appear on a grid and many points overlap each other. This problem is known as overplotting. This arrangement makes it hard to see where the mass of the data is. Are the data points spread equally throughout the graph, or is there one special combination of hwy and displ that contains 109 values? You can avoid this gridding by setting the position adjustment to “jitter.” position = "jitter" adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise: ggplot(data = mpg) + geom_point( mapping = aes(x = displ, y = hwy), position = "jitter" ) Adding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph more revealing at large scales. Because this is such a use‐ ful operation, ggplot2 comes with a shorthand for geom_point(posi tion = "jitter"): geom_jitter(). To learn more about a position adjustment, look up the help page associated with each adjustment: ?position_dodge, ?posi tion_fill, ?position_identity, ?position_jitter, and ?posi tion_stack. 30 | Chapter 1: Data Visualization with ggplot2 Exercises 1. What is the problem with this plot? How could you improve it? ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point() 2. What parameters to geom_jitter() control the amount of jit‐ tering? 3. Compare and contrast geom_jitter() with geom_count(). 4. What’s the default position adjustment for geom_boxplot()? Create a visualization of the mpg dataset that demonstrates it. Coordinate Systems Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y position act independently to find the location of each point. There are a number of other coordinate sys‐ tems that are occasionally helpful: • coord_flip() switches the x- and y-axes. This is useful (for example) if you want horizontal boxplots. It’s also useful for long labels—it’s hard to get them to fit without overlapping on the x-axis: ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot() ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot() + coord_flip() Coordinate Systems | 31 • coord_quickmap() sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2 (which unfortunately we don’t have the space to cover in this book): nz <- map_data("nz") ggplot(nz, aes(long, lat, group = group)) + geom_polygon(fill = "white", color = "black") ggplot(nz, aes(long, lat, group = group)) + geom_polygon(fill = "white", color = "black") + coord_quickmap() • coord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart: bar <- ggplot(data = diamonds) + geom_bar( mapping = aes(x = cut, fill = cut), show.legend = FALSE, width = 1 ) + theme(aspect.ratio = 1) + labs(x = NULL, y = NULL) bar + coord_flip() bar + coord_polar() 32 | Chapter 1: Data Visualization with ggplot2 Exercises 1. Turn a stacked bar chart into a pie chart using coord_polar(). 2. What does labs() do? Read the documentation. 3. What’s the difference between coord_quickmap() and coord_map()? 4. What does the following plot tell you about the relationship between city and highway mpg? Why is coord_fixed() impor‐ tant? What does geom_abline() do? ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point() + geom_abline() + coord_fixed() Coordinate Systems | 33 The Layered Grammar of Graphics In the previous sections, you learned much more than how to make scatterplots, bar charts, and boxplots. You learned a foundation that you can use to make any type of plot with ggplot2. To see this, let’s add position adjustments, stats, coordinate systems, and faceting to our code template: ggplot(data = ) + ( mapping = aes(), stat = , position = ) + + Our new template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide use‐ ful defaults for everything except the data, the mappings, and the geom function. The seven parameters in the template compose the grammar of graphics, a formal system for building plots. The grammar of graph‐ ics is based on the insight that you can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a posi‐ tion adjustment, a coordinate system, and a faceting scheme. To see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat): 34 | Chapter 1: Data Visualization with ggplot2 Next, you could choose a geometric object to represent each obser‐ vation in the transformed data. You could then use the aesthetic properties of the geoms to represent variables in the data. You would map the values of each variable to the levels of an aesthetic: You’d then select a coordinate system to place the geoms into. You’d use the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables. At that point, you would have a complete graph, but you could further adjust the posi‐ tions of the geoms within the coordinate system (a position adjust‐ ment) or split the graph into subplots (faceting). You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment: You could use this method to build any plot that you imagine. In other words, you can use the code template that you’ve learned in this chapter to build hundreds of thousands of unique plots. The Layered Grammar of Graphics | 35 CHAPTER 2 Workow: Basics You now have some experience running R code. I didn’t give you many details, but you’ve obviously figured out the basics, or you would’ve thrown this book away in frustration! Frustration is natu‐ ral when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain. But while you should expect to be a little frustrated, take comfort in that it’s both typical and temporary: it happens to every‐ one, and the only way to get over it is to keep trying. Before we go any further, let’s make sure you’ve got a solid founda‐ tion in running R code, and that you know about some of the most helpful RStudio features. Coding Basics Let’s review some basics we’ve so far omitted in the interests of get‐ ting you plotting as quickly as possible. You can use R as a calcula‐ tor: 1 / 200 * 30 #> [1] 0.15 (59 + 73 + 2) / 3 #> [1] 44.7 sin(pi / 2) #> [1] 1 You can create new objects with <-: x <- 3 * 4 37 All R statements where you create objects, assignment statements, have the same form: object_name <- value When reading that code say “object name gets value” in your head. You will make lots of assignments and <- is a pain to type. Don’t be lazy and use =: it will work, but it will cause confusion later. Instead, use RStudio’s keyboard shortcut: Alt-– (the minus sign). Notice that RStudio automagically surrounds <- with spaces, which is a good code formatting practice. Code is miserable to read on a good day, so giveyoureyesabreak and use spaces. What’s in a Name? Object names must start with a letter, and can only contain letters, numbers, _, and .. You want your object names to be descriptive, so you’ll need a convention for multiple words. I recommend snake_case where you separate lowercase words with _: i_use_snake_case otherPeopleUseCamelCase some.people.use.periods And_aFew.People_RENOUNCEconvention We’ll come back to code style later, in Chapter 15. You can inspect an object by typing its name: x #> [1] 12 Make another assignment: this_is_a_really_long_name <- 2.5 To inspect this object, try out RStudio’s completion facility: type “this,” press Tab, add characters until you have a unique prefix, then press Return. Oops, you made a mistake! this_is_a_really_long_name should have value 3.5 not 2.5. Use another keyboard shortcut to help you fix it. Type “this” then press Cmd/Ctrl-↑. That will list all the com‐ mands you’ve typed that start with those letters. Use the arrow keys to navigate, then press Enter to retype the command. Change 2.5 to 3.5 and rerun. 38 | Chapter 2: Workow: Basics Make yet another assignment: r_rocks <- 2 ^ 3 Let’s try to inspect it: r_rock #> Error: object 'r_rock' not found R_rocks #> Error: object 'R_rocks' not found There’s an implied contract between you and R: it will do the tedious computation for you, but in return, you must be completely precise in your instructions. Typos matter. Case matters. Calling Functions R has a large collection of built-in functions that are called like this: function_name(arg1 = val1, arg2 = val2, ...) Let’s try using seq(), which makes regular *seq*uences of numbers and, while we’re at it, learn more helpful features of RStudio. Type se and hit Tab. A pop-up shows you possible completions. Specify seq() by typing more (a “q”) to disambiguate, or by using ↑/↓ arrows to select. Notice the floating tooltip that pops up, reminding you of the function’s arguments and purpose. If you want more help, press F1 to get all the details in the help tab in the lower-right pane. Press Tab once more when you’ve selected the function you want. RStudio will add matching opening (() and closing ()) parentheses for you. Type the arguments 1, 10 and hit Return: seq(1, 10) #> [1] 1 2 3 4 5 6 7 8 9 10 Type this code and notice similar assistance help with the paired quotation marks: x <- "hello world" Quotation marks and parentheses must always come in a pair. RStu‐ dio does its best to help you, but it’s still possible to mess up and end up with a mismatch. If this happens, R will show you the continua‐ tion character “+”: > x <- "hello + Calling Functions | 39 The + tells you that R is waiting for more input; it doesn’t think you’re done yet. Usually that means you’ve forgotten either a " or a ). Either add the missing pair, or press Esc to abort the expression and try again. If you make an assignment, you don’t get to see the value. You’re then tempted to immediately double-check the result: y <- seq(1, 10, length.out = 5) y #> [1] 1.00 3.25 5.50 7.75 10.00 This common action can be shortened by surrounding the assign‐ ment with parentheses, which causes assignment and “print to screen” to happen: (y <- seq(1, 10, length.out = 5)) #> [1] 1.00 3.25 5.50 7.75 10.00 Now look at your environment in the upper-right pane: Here you can see all of the objects that you’ve created. Exercises 1. Why does this code not work? my_variable <- 10 my_varıable #> Error in eval(expr, envir, enclos): #> object 'my_varıable' not found Look carefully! (This may seem like an exercise in pointlessness, but training your brain to notice even the tiniest difference will pay off when programming.) 2. Tweak each of the following R commands so that they run cor‐ rectly: 40 | Chapter 2: Workow: Basics library(tidyverse) ggplot(dota = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) fliter(mpg, cyl = 8) filter(diamond, carat > 3) 3. Press Alt-Shift-K. What happens? How can you get to the same place using the menus? Calling Functions | 41 CHAPTER 3 Data Transformation with dplyr Introduction Visualization is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you’ll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. You’ll learn how to do all that (and more!) in this chapter, which will teach you how to transform your data using the dplyr package and a new dataset on flights departing New York City in 2013. Prerequisites In this chapter we’re going to focus on how to use the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the nycights13 package, and use ggplot2 to help us understand the data. library(nycflights13) library(tidyverse) Take careful note of the conflicts message that’s printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag(). 43 nycights13 To explore the basic data manipulation verbs of dplyr, we’ll use nycflights13::flights. This data frame contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights: flights #> # A tibble: 336,776 × 19 #> year month day dep_time sched_dep_time dep_delay #> #> 1 2013 1 1 517 515 2 #> 2 2013 1 1 533 529 4 #> 3 2013 1 1 542 540 2 #> 4 2013 1 1 544 545 -1 #> 5 2013 1 1 554 600 -6 #> 6 2013 1 1 554 558 -4 #> # ... with 336,776 more rows, and 13 more variables: #> # arr_time , sched_arr_time , arr_delay , #> # carrier , flight , tailnum , origin , #> # dest , air_time , distance , hour , #> # minute , time_hour You might notice that this data frame prints a little differently from other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. (To see the whole dataset, you can run View(flights), which will open the dataset in the RStudio viewer.) It prints differently because it’s a tib‐ ble. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. For now, you don’t need to worry about the differ‐ ences; we’ll come back to tibbles in more detail in Part II. You might also have noticed the row of three- (or four-) letter abbre‐ viations under the column names. These describe the type of each variable: • int stands for integers. • dbl stands for doubles, or real numbers. • chr stands for character vectors, or strings. • dttm stands for date-times (a date + a time). There are three other common types of variables that aren’t used in this dataset but you’ll encounter later in the book: 44 | Chapter 3: Data Transformation with dplyr • lgl stands for logical, vectors that contain only TRUE or FALSE. • fctr stands for factors, which R uses to represent categorical variables with fixed possible values. • date stands for dates. dplyr Basics In this chapter you are going to learn the five key dplyr functions that allow you to solve the vast majority of your data-manipulation challenges: • Pick observations by their values (filter()). • Reorder the rows (arrange()). • Pick variables by their names (select()). • Create new variables with functions of existing variables (mutate()). • Collapse many values down to a single summary (summa rize()). These can all be used in conjunction with group_by(), which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions pro‐ vide the verbs for a language of data manipulation. All verbs work similarly: 1. The first argument is a data frame. 2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes). 3. The result is a new data frame. Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let’s dive in and see how these verbs work. Filter Rows with lter() filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and Filter Rows with lter() | 45 subsequent arguments are the expressions that filter the data frame. For example, we can select all flights on January 1st with: filter(flights, month == 1, day == 1) #> # A tibble: 842 × 19 #> year month day dep_time sched_dep_time dep_delay #> #> 1 2013 1 1 517 515 2 #> 2 2013 1 1 533 529 4 #> 3 2013 1 1 542 540 2 #> 4 2013 1 1 544 545 -1 #> 5 2013 1 1 554 600 -6 #> 6 2013 1 1 554 558 -4 #> # ... with 836 more rows, and 13 more variables: #> # arr_time , sched_arr_time , arr_delay , #> # carrier , flight , tailnum ,origin , #> # dest , air_time , distance , hour , #> # minute , time_hour When you run that line of code, dplyr executes the filtering opera‐ tion and returns a new data frame. dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-: jan1 <- filter(flights, month == 1, day == 1) R either prints out the results, or saves them to a variable. If you want to do both, you can wrap the assignment in parentheses: (dec25 <- filter(flights, month == 12, day == 25)) #> # A tibble: 719 × 19 #> year month day dep_time sched_dep_time dep_delay #> #> 1 2013 12 25 456 500 -4 #> 2 2013 12 25 524 515 9 #> 3 2013 12 25 542 540 2 #> 4 2013 12 25 546 550 -4 #> 5 2013 12 25 556 600 -4 #> 6 2013 12 25 557 600 -3 #> # ... with 713 more rows, and 13 more variables: #> # arr_time , sched_arr_time , arr_delay , #> # carrier , flight , tailnum ,origin , #> # dest , air_time , distance , hour , #> # minute , time_hour Comparisons To use filtering effectively, you have to know how to select the obser‐ vations that you want using the comparison operators. R provides the standard suite: >, >=, <, <=, != (not equal), and == (equal). 46 | Chapter 3: Data Transformation with dplyr When you’re starting out with R, the easiest mistake to make is to use = instead of == when testing for equality. When this happens you’ll get an informative error: filter(flights, month = 1) #> Error: filter() takes unnamed arguments. Do you need `==`? There’s another common problem you might encounter when using ==: floating-point numbers. These results might surprise you! sqrt(2) ^ 2 == 2 #> [1] FALSE 1/49 * 49 == 1 #> [1] FALSE Computers use finite precision arithmetic (they obviously can’t store an infinite number of digits!) so remember that every number you see is an approximation. Instead of relying on ==, use near(): near(sqrt(2) ^ 2, 2) #> [1] TRUE near(1 / 49 * 49, 1) #> [1] TRUE Logical Operators Multiple arguments to filter() are combined with “and”: every expression must be true in order for a row to be included in the out‐ put. For other types of combinations, you’ll need to use Boolean operators yourself: & is “and,” | is “or,” and ! is “not.” The following figure shows the complete set of Boolean operations. The following code finds all flights that departed in November or December: Filter Rows with lter() | 47 filter(flights, month == 11 | month == 12) The order of operations doesn’t work like English. You can’t write filter(flights, month == 11 | 12), which you might literally translate into “finds all flights that departed in November or Decem‐ ber.” Instead it finds all months that equal 11 | 12, an expression that evaluates to TRUE. In a numeric context (like here), TRUE becomes one, so this finds all flights in January, not November or December. This is quite confusing! A useful shorthand for this problem is x %in% y. This will select every row where x is one of the values in y. We could use it to rewrite the preceding code: nov_dec <- filter(flights, month %in% c(11, 12)) Sometimes you can simplify complicated subsetting by remember‐ ing De Morgan’s law: !(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y. For example, if you wanted to find flights that weren’t delayed (on arrival or departure) by more than two hours, you could use either of the following two filters: filter(flights, !(arr_delay > 120 | dep_delay > 120)) filter(flights, arr_delay <= 120, dep_delay <= 120) As well as & and |, R also has && and ||. Don’t use them here! You’ll learn when you should use them in “Conditional Execution” on page 276. Whenever you start using complicated, multipart expressions in fil ter(), consider making them explicit variables instead. That makes it much easier to check your work. You’ll learn how to create new variables shortly. Missing Values One important feature of R that can make comparison tricky is missing values, or NAs (“not availables”). NA represents an unknown value so missing values are “contagious”; almost any operation involving an unknown value will also be unknown: NA > 5 #> [1] NA 10 == NA #> [1] NA NA + 10 #> [1] NA 48 | Chapter 3: Data Transformation with dplyr NA / 2 #> [1] NA The most confusing result is this one: NA == NA #> [1] NA It’s easiest to understand why this is true with a bit more context: # Let x be Mary's age. We don't know how old she is. x <- NA # Let y be John's age. We don't know how old he is. y <- NA # Are John and Mary the same age? x == y #> [1] NA # We don't know! If you want to determine if a value is missing, use is.na(): is.na(x) #> [1] TRUE filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly: df <- tibble(x = c(1, NA, 3)) filter(df, x > 1) #> # A tibble: 1 × 1 #> x #> #> 1 3 filter(df, is.na(x) | x > 1) #> # A tibble: 2 × 1 #> x #> #> 1 NA #> 2 3 Exercises 1. Find all flights that: a. Had an arrival delay of two or more hours b. Flew to Houston (IAH or HOU) c. Were operated by United, American, or Delta Filter Rows with lter() | 49 d. Departed in summer (July, August, and September) e. Arrived more than two hours late, but didn’t leave late f. Were delayed by at least an hour, but made up over 30 minutes in flight g. Departed between midnight and 6 a.m. (inclusive) 2. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges? 3. How many flights have a missing dep_time? What other vari‐ ables are missing? What might these rows represent? 4. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!) Arrange Rows with arrange() arrange() works similarly to filter() except that instead of select‐ ing rows, it changes their order. It takes a data frame and a set of col‐ umn names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns: arrange(flights, year, month, day) #> # A tibble: 336,776 × 19 #> year month day dep_time sched_dep_time dep_delay #> #> 1 2013 1 1 517 515 2 #> 2 2013 1 1 533 529 4 #> 3 2013 1 1 542 540 2 #> 4 2013 1 1 544 545 -1 #> 5 2013 1 1 554 600 -6 #> 6 2013 1 1 554 558 -4 #> # ... with 3.368e+05 more rows, and 13 more variables: #> # arr_time , sched_arr_time , arr_delay , #> # carrier , flight , tailnum , origin , #> # dest , air_time , distance , hour , #> # minute , time_hour Use desc() to reorder by a column in descending order: arrange(flights, desc(arr_delay)) #> # A tibble: 336,776 × 19 #> year month day dep_time sched_dep_time dep_delay 50 | Chapter 3: Data Transformation with dplyr #> #> 1 2013 1 9 641 900 1301 #> 2 2013 6 15 1432 1935 1137 #> 3 2013 1 10 1121 1635 1126 #> 4 2013 9 20 1139 1845 1014 #> 5 2013 7 22 845 1600 1005 #> 6 2013 4 10 1100 1900 960 #> # ... with 3.368e+05 more rows, and 13 more variables: #> # arr_time , sched_arr_time , arr_delay , #> # carrier , flight , tailnum , origin , #> # dest , air_time , distance , hour , #> # minute , time_hour , Missing values are always sorted at the end: df <- tibble(x = c(5, 2, NA)) arrange(df, x) #> # A tibble: 3 × 1 #> x #> #> 1 2 #> 2 5 #> 3 NA arrange(df, desc(x)) #> # A tibble: 3 × 1 #> x #> #> 1 5 #> 2 2 #> 3 NA Exercises 1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na().) 2. Sort flights to find the most delayed flights. Find the flights that left earliest. 3. Sort flights to find the fastest flights. 4. Which flights traveled the longest? Which traveled the shortest? Select Columns with select() It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. select() allows you to Select Columns with select() | 51 rapidly zoom in on a useful subset using operations based on the names of the variables. select() is not terribly useful with the flight data because we only have 19 variables, but you can still get the general idea: # Select columns by name select(flights, year, month, day) #> # A tibble: 336,776 × 3 #> year month day #> #> 1 2013 1 1 #> 2 2013 1 1 #> 3 2013 1 1 #> 4 2013 1 1 #> 5 2013 1 1 #> 6 2013 1 1 #> # ... with 3.368e+05 more rows # Select all columns between year and day (inclusive) select(flights, year:day) #> # A tibble: 336,776 × 3 #> year month day #> #> 1 2013 1 1 #> 2 2013 1 1 #> 3 2013 1 1 #> 4 2013 1 1 #> 5 2013 1 1 #> 6 2013 1 1 #> # ... with 3.368e+05 more rows # Select all columns except those from year to day (inclusive) select(flights, -(year:day)) #> # A tibble: 336,776 × 16 #> dep_time sched_dep_time dep_delay arr_time sched_arr_time #> #> 1 517 515 2 830 819 #> 2 533 529 4 850 830 #> 3 542 540 2 923 850 #> 4 544 545 -1 1004 1022 #> 5 554 600 -6 812 837 #> 6 554 558 -4 740 728 #> # ... with 3.368e+05 more rows, and 12 more variables: #> # arr_delay , carrier , flight , #> # tailnum , origin , dest , air_time , #> # distance , hour , minute , #> # time_hour There are a number of helper functions you can use within select(): 52 | Chapter 3: Data Transformation with dplyr • starts_with("abc") matches names that begin with “abc”. • ends_with("xyz") matches names that end with “xyz”. • contains("ijk") matches names that contain “ijk”. • matches("(.)\\1") selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in Chapter 11. • num_range("x", 1:3) matches x1, x2, and x3. See ?select for more details. select() can be used to rename variables, but it’s rarely useful because it drops all of the variables not explicitly mentioned. Instead, use rename(), which is a variant of select() that keeps all the variables that aren’t explicitly mentioned: rename(flights, tail_num = tailnum) #> # A tibble: 336,776 × 19 #> year month day dep_time sched_dep_time dep_delay #> #> 1 2013 1 1 517 515 2 #> 2 2013 1 1 533 529 4 #> 3 2013 1 1 542 540 2 #> 4 2013 1 1 544 545 -1 #> 5 2013 1 1 554 600 -6 #> 6 2013 1 1 554 558 -4 #> # ... with 3.368e+05 more rows, and 13 more variables: #> # arr_time , sched_arr_time , arr_delay , #> # carrier , flight , tail_num , #> # origin , dest , air_time , #> # distance , hour , minute , #> # time_hour Another option is to use select() in conjunction with the every thing() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame: select(flights, time_hour, air_time, everything()) #> # A tibble: 336,776 × 19 #> time_hour air_time year month day dep_time #> #> 1 2013-01-01 05:00:00 227 2013 1 1 517 #> 2 2013-01-01 05:00:00 227 2013 1 1 533 #> 3 2013-01-01 05:00:00 160 2013 1 1 542 #> 4 2013-01-01 05:00:00 183 2013 1 1 544 #> 5 2013-01-01 06:00:00 116 2013 1 1 554 Select Columns with select() | 53 #> 6 2013-01-01 05:00:00 150 2013 1 1 554 #> # ... with 3.368e+05 more rows, and 13 more variables: #> # sched_dep_time , dep_delay , arr_time , #> # sched_arr_time , arr_delay , carrier , #> # flight , tailnum , origin , dest , #> # distance , hour , minute Exercises 1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights. 2. What happens if you include the name of a variable multiple times in a select() call? 3. What does the one_of() function do? Why might it be helpful in conjunction with this vector? vars <- c( "year", "month", "day", "dep_delay", "arr_delay" ) 4. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default? select(flights, contains("TIME")) Add New Variables with mutate() Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate(). mutate() always adds new columns at the end of your dataset so we’ll start by creating a narrower dataset so we can see the new vari‐ ables. Remember that when you’re in RStudio, the easiest way to see all the columns is View(): flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time ) mutate(flights_sml, gain = arr_delay - dep_delay, speed = distance / air_time * 60 54 | Chapter 3: Data Transformation with dplyr """