Understanding Machine Learning From Theory to Algorithms

🔙 Quay lại trang tải sách pdf ebook Understanding Machine Learning From Theory to Algorithms Ebooks Nhóm Zalo Understanding Machine Learning Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a princi pled way. The book provides an extensive theoretical account of the fundamental ideas underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Fol lowing a presentation of the basics of the field, the book covers a wide array of central topics that have not been addressed by previous text books. These include a discussion of the computational complexity of learning and the concepts of convexity and stability; important algorith mic paradigms including stochastic gradient descent, neural networks, and structured output learning; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds. Designed for an advanced undergraduate or beginning graduate course, the text makes the fundamentals and algorithms of machine learning accessible to stu dents and nonexpert readers in statistics, computer science, mathematics, and engineering. Shai Shalev-Shwartz is an Associate Professor at the School of Computer Science and Engineering at The Hebrew University, Israel. Shai Ben-David is a Professor in the School of Computer Science at the University of Waterloo, Canada. UNDERSTANDING MACHINE LEARNING From Theory to Algorithms Shai Shalev-Shwartz The Hebrew University, Jerusalem Shai Ben-David University of Waterloo, Canada 32 Avenue of the Americas, New York, NY 10013-2473, USA Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/9781107057135 c Shai Shalev-Shwartz and Shai Ben-David 2014 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2014 Printed in the United States of America A catalog record for this publication is available from the British Library Library of Congress Cataloging in Publication Data ISBN 978-1-107-05713-5 Hardback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party Internet Web sites referred to in this publication, and does not guarantee that any content on such Web sites is, or will remain, accurate or appropriate. Triple-S dedicates the book to triple-M Contents Preface page xv 1 Introduction 1 1.1 What Is Learning? 1 1.2 When Do We Need Machine Learning? 3 1.3 Types of Learning 4 1.4 Relations to Other Fields 6 1.5 How to Read This Book 7 1.6 Notation 8 Part 1 Foundations 11 2 A Gentle Start 13 2.1 A Formal Model – The Statistical Learning Framework 13 2.2 Empirical Risk Minimization 15 2.3 Empirical Risk Minimization with Inductive Bias 16 2.4 Exercises 20 3 A Formal Learning Model 22 3.1 PAC Learning 22 3.2 A More General Learning Model 23 3.3 Summary 28 3.4 Bibliographic Remarks 28 3.5 Exercises 28 4 Learning via Uniform Convergence 31 4.1 Uniform Convergence Is Sufficient for Learnability 31 4.2 Finite Classes Are Agnostic PAC Learnable 32 4.3 Summary 34 4.4 Bibliographic Remarks 35 4.5 Exercises 35 vii viii Contents 5 The Bias-Complexity Tradeoff 36 5.1 The No-Free-Lunch Theorem 37 5.2 Error Decomposition 40 5.3 Summary 41 5.4 Bibliographic Remarks 41 5.5 Exercises 41 6 The VC-Dimension 43 6.1 Infinite-Size Classes Can Be Learnable 43 6.2 The VC-Dimension 44 6.3 Examples 46 6.4 The Fundamental Theorem of PAC learning 48 6.5 Proof of Theorem 6.7 49 6.6 Summary 53 6.7 Bibliographic remarks 53 6.8 Exercises 54 7 Nonuniform Learnability 58 7.1 Nonuniform Learnability 58 7.2 Structural Risk Minimization 60 7.3 Minimum Description Length and Occam’s Razor 63 7.4 Other Notions of Learnability – Consistency 66 7.5 Discussing the Different Notions of Learnability 67 7.6 Summary 70 7.7 Bibliographic Remarks 70 7.8 Exercises 71 8 The Runtime of Learning 73 8.1 Computational Complexity of Learning 74 8.2 Implementing the ERM Rule 76 8.3 Efficiently Learnable, but Not by a Proper ERM 80 8.4 Hardness of Learning* 81 8.5 Summary 82 8.6 Bibliographic Remarks 82 8.7 Exercises 83 Part 2 From Theory to Algorithms 87 9 Linear Predictors 89 9.1 Halfspaces 90 9.2 Linear Regression 94 9.3 Logistic Regression 97 9.4 Summary 99 9.5 Bibliographic Remarks 99 9.6 Exercises 99 Contents ix 10 Boosting 101 10.1 Weak Learnability 102 10.2 AdaBoost 105 10.3 Linear Combinations of Base Hypotheses 108 10.4 AdaBoost for Face Recognition 110 10.5 Summary 111 10.6 Bibliographic Remarks 111 10.7 Exercises 112 11 Model Selection and Validation 114 11.1 Model Selection Using SRM 115 11.2 Validation 116 11.3 What to Do If Learning Fails 120 11.4 Summary 123 11.5 Exercises 123 12 Convex Learning Problems 124 12.1 Convexity, Lipschitzness, and Smoothness 124 12.2 Convex Learning Problems 130 12.3 Surrogate Loss Functions 134 12.4 Summary 135 12.5 Bibliographic Remarks 136 12.6 Exercises 136 13 Regularization and Stability 137 13.1 Regularized Loss Minimization 137 13.2 Stable Rules Do Not Overfit 139 13.3 Tikhonov Regularization as a Stabilizer 140 13.4 Controlling the Fitting-Stability Tradeoff 144 13.5 Summary 146 13.6 Bibliographic Remarks 146 13.7 Exercises 147 14 Stochastic Gradient Descent 150 14.1 Gradient Descent 151 14.2 Subgradients 154 14.3 Stochastic Gradient Descent (SGD) 156 14.4 Variants 159 14.5 Learning with SGD 162 14.6 Summary 165 14.7 Bibliographic Remarks 166 14.8 Exercises 166 15 Support Vector Machines 167 15.1 Margin and Hard-SVM 167 15.2 Soft-SVM and Norm Regularization 171 15.3 Optimality Conditions and “Support Vectors”* 175 x Contents 15.4 Duality* 175 15.5 Implementing Soft-SVM Using SGD 176 15.6 Summary 177 15.7 Bibliographic Remarks 177 15.8 Exercises 178 16 Kernel Methods 179 16.1 Embeddings into Feature Spaces 179 16.2 The Kernel Trick 181 16.3 Implementing Soft-SVM with Kernels 186 16.4 Summary 187 16.5 Bibliographic Remarks 188 16.6 Exercises 188 17 Multiclass, Ranking, and Complex Prediction Problems 190 17.1 One-versus-All and All-Pairs 190 17.2 Linear Multiclass Predictors 193 17.3 Structured Output Prediction 198 17.4 Ranking 201 17.5 Bipartite Ranking and Multivariate Performance Measures 206 17.6 Summary 209 17.7 Bibliographic Remarks 210 17.8 Exercises 210 18 Decision Trees 212 18.1 Sample Complexity 213 18.2 Decision Tree Algorithms 214 18.3 Random Forests 217 18.4 Summary 217 18.5 Bibliographic Remarks 218 18.6 Exercises 218 19 Nearest Neighbor 219 19.1 k Nearest Neighbors 219 19.2 Analysis 220 19.3 Efficient Implementation* 225 19.4 Summary 225 19.5 Bibliographic Remarks 225 19.6 Exercises 225 20 Neural Networks 228 20.1 Feedforward Neural Networks 229 20.2 Learning Neural Networks 230 20.3 The Expressive Power of Neural Networks 231 20.4 The Sample Complexity of Neural Networks 234 20.5 The Runtime of Learning Neural Networks 235 20.6 SGD and Backpropagation 236 Contents xi 20.7 Summary 240 20.8 Bibliographic Remarks 240 20.9 Exercises 240 Part 3 Additional Learning Models 243 21 Online Learning 245 21.1 Online Classification in the Realizable Case 246 21.2 Online Classification in the Unrealizable Case 251 21.3 Online Convex Optimization 257 21.4 The Online Perceptron Algorithm 258 21.5 Summary 261 21.6 Bibliographic Remarks 261 21.7 Exercises 262 22 Clustering 264 22.1 Linkage-Based Clustering Algorithms 266 22.2 k-Means and Other Cost Minimization Clusterings 268 22.3 Spectral Clustering 271 22.4 Information Bottleneck* 273 22.5 A High Level View of Clustering 274 22.6 Summary 276 22.7 Bibliographic Remarks 276 22.8 Exercises 276 23 Dimensionality Reduction 278 23.1 Principal Component Analysis (PCA) 279 23.2 Random Projections 283 23.3 Compressed Sensing 285 23.4 PCA or Compressed Sensing? 292 23.5 Summary 292 23.6 Bibliographic Remarks 292 23.7 Exercises 293 24 Generative Models 295 24.1 Maximum Likelihood Estimator 295 24.2 Naive Bayes 299 24.3 Linear Discriminant Analysis 300 24.4 Latent Variables and the EM Algorithm 301 24.5 Bayesian Reasoning 305 24.6 Summary 307 24.7 Bibliographic Remarks 307 24.8 Exercises 308 25 Feature Selection and Generation 309 25.1 Feature Selection 310 25.2 Feature Manipulation and Normalization 316 25.3 Feature Learning 319 xii Contents 25.4 Summary 321 25.5 Bibliographic Remarks 321 25.6 Exercises 322 Part 4 Advanced Theory 323 26 Rademacher Complexities 325 26.1 The Rademacher Complexity 325 26.2 Rademacher Complexity of Linear Classes 332 26.3 Generalization Bounds for SVM 333 26.4 Generalization Bounds for Predictors with Low 1 Norm 335 26.5 Bibliographic Remarks 336 27 Covering Numbers 337 27.1 Covering 337 27.2 From Covering to Rademacher Complexity via Chaining 338 27.3 Bibliographic Remarks 340 28 Proof of the Fundamental Theorem of Learning Theory 341 28.1 The Upper Bound for the Agnostic Case 341 28.2 The Lower Bound for the Agnostic Case 342 28.3 The Upper Bound for the Realizable Case 347 29 Multiclass Learnability 351 29.1 The Natarajan Dimension 351 29.2 The Multiclass Fundamental Theorem 352 29.3 Calculating the Natarajan Dimension 353 29.4 On Good and Bad ERMs 355 29.5 Bibliographic Remarks 357 29.6 Exercises 357 30 Compression Bounds 359 30.1 Compression Bounds 359 30.2 Examples 361 30.3 Bibliographic Remarks 363 31 PAC-Bayes 364 31.1 PAC-Bayes Bounds 364 31.2 Bibliographic Remarks 366 31.3 Exercises 366 Appendix A Technical Lemmas 369 Appendix B Measure Concentration 372 B.1 Markov’s Inequality 372 B.2 Chebyshev’s Inequality 373 B.3 Chernoff’s Bounds 373 B.4 Hoeffding’s Inequality 375 Contents xiii B.5 Bennet’s and Bernstein’s Inequalities 376 B.6 Slud’s Inequality 378 B.7 Concentration of χ2 Variables 378 Appendix C Linear Algebra 380 C.1 Basic Definitions 380 C.2 Eigenvalues and Eigenvectors 381 C.3 Positive definite matrices 381 C.4 Singular Value Decomposition (SVD) 381 References 385 Index 395 Preface The term machine learning refers to the automated detection of meaningful patterns in data. In the past couple of decades it has become a common tool in almost any task that requires information extraction from large data sets. We are surrounded by a machine learning based technology: Search engines learn how to bring us the best results (while placing profitable ads), antispam software learns to filter our e mail messages, and credit card transactions are secured by a software that learns how to detect frauds. Digital cameras learn to detect faces and intelligent personal assistance applications on smart-phones learn to recognize voice commands. Cars are equipped with accident prevention systems that are built using machine learning algorithms. Machine learning is also widely used in scientific applications such as bioinformatics, medicine, and astronomy. One common feature of all of these applications is that, in contrast to more tra ditional uses of computers, in these cases, due to the complexity of the patterns that need to be detected, a human programmer cannot provide an explicit, fine-detailed specification of how such tasks should be executed. Taking example from intelligent beings, many of our skills are acquired or refined through learning from our experi ence (rather than following explicit instructions given to us). Machine learning tools are concerned with endowing programs with the ability to “learn” and adapt. The first goal of this book is to provide a rigorous, yet easy to follow, introduction to the main concepts underlying machine learning: What is learning? How can a machine learn? How do we quantify the resources needed to learn a given concept? Is learning always possible? Can we know whether the learning process succeeded or failed? The second goal of this book is to present several key machine learning algo rithms. We chose to present algorithms that on one hand are successfully used in practice and on the other hand give a wide spectrum of different learning tech niques. Additionally, we pay specific attention to algorithms appropriate for large scale learning (a.k.a. “Big Data”), since in recent years, our world has become increasingly “digitized” and the amount of data available for learning is dramati cally increasing. As a result, in many applications data is plentiful and computation xv xvi Preface time is the main bottleneck. We therefore explicitly quantify both the amount of data and the amount of computation time needed to learn a given concept. The book is divided into four parts. The first part aims at giving an initial rigor ous answer to the fundamental questions of learning. We describe a generalization of Valiant’s Probably Approximately Correct (PAC) learning model, which is a first solid answer to the question “What is learning?” We describe the Empirical Risk Minimization (ERM), Structural Risk Minimization (SRM), and Minimum Descrip tion Length (MDL) learning rules, which show “how a machine can learn.” We quantify the amount of data needed for learning using the ERM, SRM, and MDL rules and show how learning might fail by deriving a “no-free-lunch” theorem. We also discuss how much computation time is required for learning. In the second part of the book we describe various learning algorithms. For some of the algorithms, we first present a more general learning principle, and then show how the algorithm follows the principle. While the first two parts of the book focus on the PAC model, the third part extends the scope by presenting a wider variety of learning models. Finally, the last part of the book is devoted to advanced theory. We made an attempt to keep the book as self-contained as possible. However, the reader is assumed to be comfortable with basic notions of probability, linear algebra, analysis, and algorithms. The first three parts of the book are intended for first year graduate students in computer science, engineering, mathematics, or statistics. It can also be accessible to undergraduate students with the adequate background. The more advanced chapters can be used by researchers intending to gather a deeper theoretical understanding. ACKNOWLEDGMENTS The book is based on Introduction to Machine Learning courses taught by Shai Shalev-Shwartz at the Hebrew University and by Shai Ben-David at the University of Waterloo. The first draft of the book grew out of the lecture notes for the course that was taught at the Hebrew University by Shai Shalev-Shwartz during 2010–2013. We greatly appreciate the help of Ohad Shamir, who served as a TA for the course in 2010, and of Alon Gonen, who served as a TA for the course in 2011–2013. Ohad and Alon prepared a few lecture notes and many of the exercises. Alon, to whom we are indebted for his help throughout the entire making of the book, has also prepared a solution manual. We are deeply grateful for the most valuable work of Dana Rubinstein. Dana has scientifically proofread and edited the manuscript, transforming it from lecture based chapters into fluent and coherent text. Special thanks to Amit Daniely, who helped us with a careful read of the advanced part of the book and wrote the advanced chapter on multiclass learnabil ity. We are also grateful for the members of a book reading club in Jerusalem who have carefully read and constructively criticized every line of the manuscript. The members of the reading club are Maya Alroy, Yossi Arjevani, Aharon Birnbaum, Alon Cohen, Alon Gonen, Roi Livni, Ofer Meshi, Dan Rosenbaum, Dana Rubin stein, Shahar Somin, Alon Vinnikov, and Yoav Wald. We would also like to thank Gal Elidan, Amir Globerson, Nika Haghtalab, Shie Mannor, Amnon Shashua, Nati Srebro, and Ruth Urner for helpful discussions. 1 Introduction The subject of this book is automated learning, or, as we will more often call it, Machine Learning (ML). That is, we wish to program computers so that they can “learn” from input available to them. Roughly speaking, learning is the process of converting experience into expertise or knowledge. The input to a learning algo rithm is training data, representing experience, and the output is some expertise, which usually takes the form of another computer program that can perform some task. Seeking a formal-mathematical understanding of this concept, we’ll have to be more explicit about what we mean by each of the involved terms: What is the training data our programs will access? How can the process of learning be auto mated? How can we evaluate the success of such a process (namely, the quality of the output of a learning program)? 1.1 WHAT IS LEARNING? Let us begin by considering a couple of examples from naturally occurring animal learning. Some of the most fundamental issues in ML arise already in that context, which we are all familiar with. Bait Shyness – Rats Learning to Avoid Poisonous Baits: When rats encounter food items with novel look or smell, they will first eat very small amounts, and sub sequent feeding will depend on the flavor of the food and its physiological effect. If the food produces an ill effect, the novel food will often be associated with the illness, and subsequently, the rats will not eat it. Clearly, there is a learning mech anism in play here – the animal used past experience with some food to acquire expertise in detecting the safety of this food. If past experience with the food was negatively labeled, the animal predicts that it will also have a negative effect when encountered in the future. Inspired by the preceding example of successful learning, let us demonstrate a typical machine learning task. Suppose we would like to program a machine that learns how to filter spam e-mails. A naive solution would be seemingly similar to the way rats learn how to avoid poisonous baits. The machine will simply memorize all previous e-mails that had been labeled as spam e-mails by the human user. When a 1 2 Introduction new e-mail arrives, the machine will search for it in the set of previous spam e-mails. If it matches one of them, it will be trashed. Otherwise, it will be moved to the user’s inbox folder. While the preceding “learning by memorization” approach is sometimes useful, it lacks an important aspect of learning systems – the ability to label unseen e-mail messages. A successful learner should be able to progress from individual examples to broader generalization. This is also referred to as inductive reasoning or inductive inference. In the bait shyness example presented previously, after the rats encounter an example of a certain type of food, they apply their attitude toward it on new, unseen examples of food of similar smell and taste. To achieve generalization in the spam filtering task, the learner can scan the previously seen e-mails, and extract a set of words whose appearance in an e-mail message is indicative of spam. Then, when a new e-mail arrives, the machine can check whether one of the suspicious words appears in it, and predict its label accordingly. Such a system would potentially be able correctly to predict the label of unseen e-mails. However, inductive reasoning might lead us to false conclusions. To illustrate this, let us consider again an example from animal learning. Pigeon Superstition: In an experiment performed by the psychologist B. F. Skinner, he placed a bunch of hungry pigeons in a cage. An automatic mech anism had been attached to the cage, delivering food to the pigeons at regular intervals with no reference whatsoever to the birds’ behavior. The hungry pigeons went around the cage, and when food was first delivered, it found each pigeon engaged in some activity (pecking, turning the head, etc.). The arrival of food rein forced each bird’s specific action, and consequently, each bird tended to spend some more time doing that very same action. That, in turn, increased the chance that the next random food delivery would find each bird engaged in that activity again. What results is a chain of events that reinforces the pigeons’ association of the delivery of the food with whatever chance actions they had been performing when it was first delivered. They subsequently continue to perform these same actions diligently.1 What distinguishes learning mechanisms that result in superstition from useful learning? This question is crucial to the development of automated learners. While human learners can rely on common sense to filter out random meaningless learning conclusions, once we export the task of learning to a machine, we must provide well defined crisp principles that will protect the program from reaching senseless or useless conclusions. The development of such principles is a central goal of the theory of machine learning. What, then, made the rats’ learning more successful than that of the pigeons? As a first step toward answering this question, let us have a closer look at the bait shyness phenomenon in rats. Bait Shyness revisited – rats fail to acquire conditioning between food and electric shock or between sound and nausea: The bait shyness mechanism in rats turns out to be more complex than what one may expect. In experiments carried out by Garcia (Garcia & Koelling 1996), it was demonstrated that if the unpleasant stimulus that follows food consumption is replaced by, say, electrical shock (rather than nausea), then no conditioning occurs. Even after repeated trials in which the consumption 1 See: http://psychclassics.yorku.ca/Skinner/Pigeon 1.2 When Do We Need Machine Learning? 3 of some food is followed by the administration of unpleasant electrical shock, the rats do not tend to avoid that food. Similar failure of conditioning occurs when the characteristic of the food that implies nausea (such as taste or smell) is replaced by a vocal signal. The rats seem to have some “built in” prior knowledge telling them that, while temporal correlation between food and nausea can be causal, it is unlikely that there would be a causal relationship between food consumption and electrical shocks or between sounds and nausea. We conclude that one distinguishing feature between the bait shyness learn ing and the pigeon superstition is the incorporation of prior knowledge that biases the learning mechanism. This is also referred to as inductive bias. The pigeons in the experiment are willing to adopt any explanation for the occurrence of food. However, the rats “know” that food cannot cause an electric shock and that the co-occurrence of noise with some food is not likely to affect the nutritional value of that food. The rats’ learning process is biased toward detecting some kind of patterns while ignoring other temporal correlations between events. It turns out that the incorporation of prior knowledge, biasing the learning pro cess, is inevitable for the success of learning algorithms (this is formally stated and proved as the “No-Free-Lunch theorem” in Chapter 5). The development of tools for expressing domain expertise, translating it into a learning bias, and quantifying the effect of such a bias on the success of learning is a central theme of the theory of machine learning. Roughly speaking, the stronger the prior knowledge (or prior assumptions) that one starts the learning process with, the easier it is to learn from further examples. However, the stronger these prior assumptions are, the less flex ible the learning is – it is bound, a priori, by the commitment to these assumptions. We shall discuss these issues explicitly in Chapter 5. 1.2 WHEN DO WE NEED MACHINE LEARNING? When do we need machine learning rather than directly program our computers to carry out the task at hand? Two aspects of a given problem may call for the use of programs that learn and improve on the basis of their “experience”: the problem’s complexity and the need for adaptivity. Tasks That Are Too Complex to Program. Tasks Performed by Animals/Humans: There are numerous tasks that we human beings perform routinely, yet our introspection concerning how we do them is not sufficiently elaborate to extract a well defined pro gram. Examples of such tasks include driving, speech recognition, and image understanding. In all of these tasks, state of the art machine learn ing programs, programs that “learn from their experience,” achieve quite satisfactory results, once exposed to sufficiently many training examples. Tasks beyond Human Capabilities: Another wide family of tasks that ben efit from machine learning techniques are related to the analysis of very large and complex data sets: astronomical data, turning medical archives into medical knowledge, weather prediction, analysis of genomic data, Web search engines, and electronic commerce. With more and more available 4 Introduction digitally recorded data, it becomes obvious that there are treasures of mean ingful information buried in data archives that are way too large and too complex for humans to make sense of. Learning to detect meaningful pat terns in large and complex data sets is a promising domain in which the combination of programs that learn with the almost unlimited memory capacity and ever increasing processing speed of computers opens up new horizons. Adaptivity. One limiting feature of programmed tools is their rigidity – once the program has been written down and installed, it stays unchanged. However, many tasks change over time or from one user to another. Machine learning tools – programs whose behavior adapts to their input data – offer a solution to such issues; they are, by nature, adaptive to changes in the environment they interact with. Typical successful applications of machine learning to such prob lems include programs that decode handwritten text, where a fixed program can adapt to variations between the handwriting of different users; spam detection programs, adapting automatically to changes in the nature of spam e-mails; and speech recognition programs. 1.3 TYPES OF LEARNING Learning is, of course, a very wide domain. Consequently, the field of machine learning has branched into several subfields dealing with different types of learning tasks. We give a rough taxonomy of learning paradigms, aiming to provide some perspective of where the content of this book sits within the wide field of machine learning. We describe four parameters along which learning paradigms can be classified. Supervised versus Unsupervised Since learning involves an interaction between the learner and the environment, one can divide learning tasks according to the nature of that interaction. The first distinction to note is the difference between supervised and unsupervised learning. As an illustrative example, consider the task of learning to detect spam e-mail versus the task of anomaly detection. For the spam detection task, we consider a setting in which the learner receives training e-mails for which the label spam/not-spam is provided. On the basis of such training the learner should figure out a rule for labeling a newly arriving e-mail message. In contrast, for the task of anomaly detection, all the learner gets as training is a large body of e-mail messages (with no labels) and the learner’s task is to detect “unusual” messages. More abstractly, viewing learning as a process of “using experience to gain expertise,” supervised learning describes a scenario in which the “experience,” a training example, contains significant information (say, the spam/not-spam labels) that is missing in the unseen “test examples” to which the learned exper tise is to be applied. In this setting, the acquired expertise is aimed to predict that missing information for the test data. In such cases, we can think of the environment as a teacher that “supervises” the learner by providing the extra information (labels). In unsupervised learning, however, there is no distinction between training and test data. The learner processes input data with the goal 1.3 Types of Learning 5 of coming up with some summary, or compressed version of that data. Clus tering a data set into subsets of similar objets is a typical example of such a task. There is also an intermediate learning setting in which, while the train ing examples contain more information than the test examples, the learner is required to predict even more information for the test examples. For exam ple, one may try to learn a value function that describes for each setting of a chess board the degree by which White’s position is better than the Black’s. Yet, the only information available to the learner at training time is positions that occurred throughout actual chess games, labeled by who eventually won that game. Such learning frameworks are mainly investigated under the title of reinforcement learning. Active versus Passive Learners Learning paradigms can vary by the role played by the learner. We distinguish between “active” and “passive” learners. An active learner interacts with the environment at training time, say, by posing queries or performing experiments, while a passive learner only observes the information provided by the environment (or the teacher) without influenc ing or directing it. Note that the learner of a spam filter is usually passive – waiting for users to mark the e-mails coming to them. In an active set ting, one could imagine asking users to label specific e-mails chosen by the learner, or even composed by the learner, to enhance its understanding of what spam is. Helpfulness of the Teacher When one thinks about human learning, of a baby at home or a student at school, the process often involves a helpful teacher, who is trying to feed the learner with the information most useful for achieving the learning goal. In contrast, when a scientist learns about nature, the envir onment, playing the role of the teacher, can be best thought of as passive – apples drop, stars shine, and the rain falls without regard to the needs of the learner. We model such learning scenarios by postulating that the training data (or the learner’s experience) is generated by some random process. This is the basic building block in the branch of “statistical learning.” Finally, learning also occurs when the learner’s input is generated by an adversarial “teacher.” This may be the case in the spam filtering example (if the spammer makes an effort to mislead the spam filtering designer) or in learning to detect fraud. One also uses an adversarial teacher model as a worst-case scenario, when no milder setup can be safely assumed. If you can learn against an adversarial teacher, you are guaranteed to succeed interacting any odd teacher. Online versus Batch Learning Protocol The last parameter we mention is the dis tinction between situations in which the learner has to respond online, through out the learning process, and settings in which the learner has to engage the acquired expertise only after having a chance to process large amounts of data. For example, a stockbroker has to make daily decisions, based on the expe rience collected so far. He may become an expert over time, but might have made costly mistakes in the process. In contrast, in many data mining settings, the learner – the data miner – has large amounts of training data to play with before having to output conclusions. 6 Introduction In this book we shall discuss only a subset of the possible learning paradigms. Our main focus is on supervised statistical batch learning with a passive learner (for example, trying to learn how to generate patients’ prognoses, based on large archives of records of patients that were independently collected and are already labeled by the fate of the recorded patients). We shall also briefly discuss online learning and batch unsupervised learning (in particular, clustering). 1.4 RELATIONS TO OTHER FIELDS As an interdisciplinary field, machine learning shares common threads with the mathematical fields of statistics, information theory, game theory, and optimization. It is naturally a subfield of computer science, as our goal is to program machines so that they will learn. In a sense, machine learning can be viewed as a branch of AI (Artificial Intelligence), since, after all, the ability to turn experience into exper tise or to detect meaningful patterns in complex sensory data is a cornerstone of human (and animal) intelligence. However, one should note that, in contrast with traditional AI, machine learning is not trying to build automated imitation of intel ligent behavior, but rather to use the strengths and special abilities of computers to complement human intelligence, often performing tasks that fall way beyond human capabilities. For example, the ability to scan and process huge databases allows machine learning programs to detect patterns that are outside the scope of human perception. The component of experience, or training, in machine learning often refers to data that is randomly generated. The task of the learner is to process such randomly generated examples toward drawing conclusions that hold for the environment from which these examples are picked. This description of machine learning highlights its close relationship with statistics. Indeed there is a lot in common between the two disciplines, in terms of both the goals and techniques used. There are, however, a few significant differences of emphasis; if a doctor comes up with the hypothesis that there is a correlation between smoking and heart disease, it is the statistician’s role to view samples of patients and check the validity of that hypothesis (this is the common statistical task of hypothesis testing). In contrast, machine learning aims to use the data gathered from samples of patients to come up with a description of the causes of heart disease. The hope is that automated techniques may be able to figure out meaningful patterns (or hypotheses) that may have been missed by the human observer. In contrast with traditional statistics, in machine learning in general, and in this book in particular, algorithmic considerations play a major role. Machine learning is about the execution of learning by computers; hence algorithmic issues are piv otal. We develop algorithms to perform the learning tasks and are concerned with their computational efficiency. Another difference is that while statistics is often interested in asymptotic behavior (like the convergence of sample-based statisti cal estimates as the sample sizes grow to infinity), the theory of machine learning focuses on finite sample bounds. Namely, given the size of available samples, machine learning theory aims to figure out the degree of accuracy that a learner can expect on the basis of such samples. 1.5 How to Read This Book 7 There are further differences between these two disciplines, of which we shall mention only one more here. While in statistics it is common to work under the assumption of certain presubscribed data models (such as assuming the normal ity of data-generating distributions, or the linearity of functional dependencies), in machine learning the emphasis is on working under a “distribution-free” setting, where the learner assumes as little as possible about the nature of the data distribu tion and allows the learning algorithm to figure out which models best approximate the data-generating process. A precise discussion of this issue requires some techni cal preliminaries, and we will come back to it later in the book, and in particular in Chapter 5. 1.5 HOW TO READ THIS BOOK The first part of the book provides the basic theoretical principles that underlie machine learning (ML). In a sense, this is the foundation upon which the rest of the book is built. This part could serve as a basis for a minicourse on the theoretical foundations of ML. The second part of the book introduces the most commonly used algorithmic approaches to supervised machine learning. A subset of these chapters may also be used for introducing machine learning in a general AI course to computer science, Math, or engineering students. The third part of the book extends the scope of discussion from statistical clas sification to other learning models. It covers online learning, unsupervised learning, dimensionality reduction, generative models, and feature learning. The fourth part of the book, Advanced Theory, is geared toward readers who have interest in research and provides the more technical mathematical techniques that serve to analyze and drive forward the field of theoretical machine learning. The Appendixes provide some technical tools used in the book. In particular, we list basic results from measure concentration and linear algebra. A few sections are marked by an asterisk, which means they are addressed to more advanced students. Each chapter is concluded with a list of exercises. A solution manual is provided in the course Web site. 1.5.1 Possible Course Plans Based on This Book A 14 Week Introduction Course for Graduate Students: 1. Chapters 2–4. 2. Chapter 9 (without the VC calculation). 3. Chapters 5–6 (without proofs). 4. Chapter 10. 5. Chapters 7, 11 (without proofs). 6. Chapters 12, 13 (with some of the easier proofs). 7. Chapter 14 (with some of the easier proofs). 8. Chapter 15. 9. Chapter 16. 10. Chapter 18. 8 Introduction 11. Chapter 22. 12. Chapter 23 (without proofs for compressed sensing). 13. Chapter 24. 14. Chapter 25. A 14 Week Advanced Course for Graduate Students: 1. Chapters 26, 27. 2. (continued) 3. Chapters 6, 28. 4. Chapter 7. 5. Chapter 31. 6. Chapter 30. 7. Chapters 12, 13. 8. Chapter 14. 9. Chapter 8. 10. Chapter 17. 11. Chapter 29. 12. Chapter 19. 13. Chapter 20. 14. Chapter 21. 1.6 NOTATION Most of the notation we use throughout the book is either standard or defined on the spot. In this section we describe our main conventions and provide a table sum marizing our notation (Table 1.1). The reader is encouraged to skip this section and return to it if during the reading of the book some notation is unclear. We denote scalars and abstract objects with lowercase letters (e.g. x and λ). Often, we would like to emphasize that some object is a vector and then we use boldface letters (e.g. x and λ). The ith element of a vector x is denoted by xi . We use uppercase letters to denote matrices, sets, and sequences. The meaning should be clear from the context. As we will see momentarily, the input of a learning algorithm is a sequence of training examples. We denote by z an abstract example and by S = z1,...,zm a sequence of m examples. Historically, S is often referred to as a training set; however, we will always assume that S is a sequence rather than a set. A sequence of m vectors is denoted by x1,...,xm. The ith element of xt is denoted by xt,i . Throughout the book, we make use of basic notions from probability. We denote by D a distribution over some set,2 for example, Z. We use the notation z ∼ D to denote that z is sampled according to D. Given a random variable f : Z → R, its expected value is denoted by Ez∼D [ f (z)]. We sometimes use the shorthand E[ f ] when the dependence on z is clear from the context. For f : Z → {true,false} we also use Pz∼D [ f (z)] to denote D({z : f (z) = true}). In the next chapter we will also 2 To be mathematically precise, D should be defined over some σ-algebra of subsets of Z. The user who is not familiar with measure theory can skip the few footnotes and remarks regarding more formal measurability definitions and assumptions. 1.6 Notation 9 Table 1.1. Summary of notation symbol meaning R the set of real numbers Rd the set of d-dimensional vectors over R R+ the set of non-negative real numbers N the set of natural numbers O,o, ,ω, , O˜ asymptotic notation (see text) 1[Boolean expression] indicator function (equals 1 if expression is true and 0 o.w.) [a]+ = max{0,a} [n] the set {1,...,n} (for n ∈ N) x,v,w (column) vectors xi,vi,wi the ith element of a vector x,v = di=1 xi vi (inner product) x 2 or x = √ x,x (the 2 norm of x) x 1 = di=1 |xi | (the 1 norm of x) x ∞ = maxi |xi| (the ∞ norm of x) x 0 the number of nonzero elements of x A ∈ Rd,k a d × k matrix over R A the transpose of A Ai, j the (i, j) element of A x x the d × d matrix A s.t. Ai, j = xi x j (where x ∈ Rd ) x1,...,xm a sequence of m vectors xi, j the jth element of the ith vector in the sequence w(1),...,w(T ) the values of a vector w during an iterative algorithm w(t) i the ith element of the vector w(t) X instances domain (a set) Y labels domain (a set) Z examples domain (a set) H hypothesis class (a set) : H × Z → R+ loss function D a distribution over some set (usually over Z or over X ) D(A) the probability of a set A ⊆ Z according to D z ∼ D sampling z according to D S = z1,...,zm a sequence of m examples S ∼ Dm sampling S = z1,...,zm i.i.d. according to D P,E probability and expectation of a random variable Pz∼D [ f (z)] = D({z : f (z) = true}) for f : Z → {true,false} Ez∼D [ f (z)] expectation of the random variable f : Z → R N(µ,C) Gaussian distribution with expectation µ and covariance C f (x) the derivative of a function f : R → R at x f (x) the second derivative of a function f : R → R at x ∂ f (w) ∂wi the partial derivative of a function f : Rd → R at w w.r.t. wi ∇ f (w) the gradient of a function f : Rd → R at w ∂ f (w) the differential set of a function f : Rd → R at w minx∈C f (x) = min{ f (x) : x ∈ C} (minimal value of f over C) maxx∈C f (x) = max{ f (x) : x ∈ C} (maximal value of f over C) argminx∈C f (x) the set {x ∈ C : f (x) = minz∈C f (z)} argmaxx∈C f (x) the set {x ∈ C : f (x) = maxz∈C f (z)} log the natural logarithm 10 Introduction introduce the notation Dm to denote the probability over Zm induced by sampling (z1,...,zm) where each point zi is sampled from D independently of the other points. In general, we have made an effort to avoid asymptotic notation. However, we occasionally use it to clarify the main results. In particular, given f : R → R+ and g : R → R+ we write f = O(g) if there exist x0,α ∈ R+ such that for all x > x0 we have f (x) ≤ αg(x). We write f = o(g) if for every α > 0 there exists x0 such that for all x > x0 we have f (x) ≤ αg(x). We write f = (g) if there exist x0,α ∈ R+ such that for all x > x0 we have f (x) ≥ αg(x). The notation f = ω(g) is defined analogously. The notation f = (g) means that f = O(g) and g = O( f ). Finally, the notation f = O˜(g) means that there exists k ∈ N such that f (x) = O(g(x) logk (g(x))). The inner product between vectors x and w is denoted by x,w . Whenever we do not specify the vector space we assume that it is the d-dimensional Euclidean space and then x,w = di=1 xiwi . The Euclidean (or 2) norm of a vector w is w 2 = √ w,w . We omit the subscript from the 2 norm when it is clear from the context. We also use other p norms, w p = i |wi|p 1/p , and in particular w 1 = i |wi| and w ∞ = maxi |wi|. We use the notation minx∈C f (x) to denote the minimum value of the set { f (x) : x ∈ C}. To be mathematically more precise, we should use infx∈C f (x) when ever the minimum is not achievable. However, in the context of this book the distinction between infimum and minimum is often of little interest. Hence, to sim plify the presentation, we sometimes use the min notation even when inf is more adequate. An analogous remark applies to max versus sup. PART 1 Foundations 2 A Gentle Start Let us begin our mathematical analysis by showing how successful learning can be achieved in a relatively simplified setting. Imagine you have just arrived in some small Pacific island. You soon find out that papayas are a significant ingredient in the local diet. However, you have never before tasted papayas. You have to learn how to predict whether a papaya you see in the market is tasty or not. First, you need to decide which features of a papaya your prediction should be based on. On the basis of your previous experience with other fruits, you decide to use two features: the papaya’s color, ranging from dark green, through orange and red to dark brown, and the papaya’s softness, ranging from rock hard to mushy. Your input for figuring out your prediction rule is a sample of papayas that you have examined for color and softness and then tasted and found out whether they were tasty or not. Let us analyze this task as a demonstration of the considerations involved in learning problems. Our first step is to describe a formal model aimed to capture such learning tasks. 2.1 A FORMAL MODEL – THE STATISTICAL LEARNING FRAMEWORK The learner’s input: In the basic statistical learning setting, the learner has access to the following: Domain set: An arbitrary set, X. This is the set of objects that we may wish to label. For example, in the papaya learning problem mentioned before, the domain set will be the set of all papayas. Usually, these domain points will be represented by a vector of features (like the papaya’s color and softness). We also refer to domain points as instances and to X as instance space. Label set: For our current discussion, we will restrict the label set to be a two-element set, usually {0,1} or {−1,+1}. Let Y denote our set of pos sible labels. For our papayas example, let Y be {0,1}, where 1 represents being tasty and 0 stands for being not-tasty. Training data: S = ((x1, y1)...(xm, ym)) is a finite sequence of pairs in X ×Y: that is, a sequence of labeled domain points. This is the input that the 13 14 A Gentle Start learner has access to (like a set of papayas that have been tasted and their color, softness, and tastiness). Such labeled examples are often called training examples. We sometimes also refer to S as a training set.1 The learner’s output: The learner is requested to output a prediction rule, h : X → Y. This function is also called a predictor, a hypothesis, or a classifier. The predictor can be used to predict the label of new domain points. In our papayas example, it is a rule that our learner will employ to predict whether future papayas he examines in the farmers’ market are going to be tasty or not. We use the notation A(S) to denote the hypothesis that a learning algorithm, A, returns upon receiving the training sequence S. A simple data-generation model We now explain how the training data is gen erated. First, we assume that the instances (the papayas we encounter) are generated by some probability distribution (in this case, representing the environment). Let us denote that probability distribution over X by D. It is important to note that we do not assume that the learner knows anything about this distribution. For the type of learning tasks we discuss, this could be any arbitrary probability distribution. As to the labels, in the current discussion we assume that there is some “correct” labeling function, f : X → Y, and that yi = f (xi) for all i. This assumption will be relaxed in the next chapter. The labeling function is unknown to the learner. In fact, this is just what the learner is trying to figure out. In summary, each pair in the training data S is generated by first sampling a point xi according to D and then labeling it by f . Measures of success: We define the error of a classifier to be the probability that it does not predict the correct label on a random data point generated by the aforementioned underlying distribution. That is, the error of h is the proba bility to draw a random instance x, according to the distribution D, such that h(x) does not equal f (x). Formally, given a domain subset,2 A ⊂ X, the probability distribution, D, assigns a number, D(A), which determines how likely it is to observe a point x ∈ A. In many cases, we refer to A as an event and express it using a function π : X → {0,1}, namely, A = {x ∈ X : π(x) = 1}. In that case, we also use the notation Px∼D [π(x)] to express D(A). We define the error of a prediction rule, h : X → Y, to be LD, f (h) def x∼D [h(x) = f (x)] def = P = D({x : h(x) = f (x)}). (2.1) That is, the error of such h is the probability of randomly choosing an example x for which h(x) = f (x). The subscript (D, f ) indicates that the error is mea sured with respect to the probability distribution D and the correct labeling function f . We omit this subscript when it is clear from the context. L(D, f )(h) has several synonymous names such as the generalization error, the risk, or the true error of h, and we will use these names interchangeably throughout 1 Despite the “set” notation, S is a sequence. In particular, the same example may appear twice in S and some algorithms can take into account the order of examples in S. 2 Strictly speaking, we should be more careful and require that A is a member of some σ-algebra of subsets of X, over which D is defined. We will formally define our measurability assumptions in the next chapter. 2.2 Empirical Risk Minimization 15 the book. We use the letter L for the error, since we view this error as the loss of the learner. We will later also discuss other possible formulations of such loss. A note about the information available to the learner The learner is blind to the underlying distribution D over the world and to the labeling function f. In our papayas example, we have just arrived in a new island and we have no clue as to how papayas are distributed and how to predict their tastiness. The only way the learner can interact with the environment is through observing the training set. In the next section we describe a simple learning paradigm for the preceding setup and analyze its performance. 2.2 EMPIRICAL RISK MINIMIZATION As mentioned earlier, a learning algorithm receives as input a training set S, sam pled from an unknown distribution D and labeled by some target function f , and should output a predictor hS : X → Y (the subscript S emphasizes the fact that the output predictor depends on S). The goal of the algorithm is to find hS that minimizes the error with respect to the unknown D and f . Since the learner does not know what D and f are, the true error is not directly available to the learner. A useful notion of error that can be calculated by the learner is the training error – the error the classifier incurs over the training sample: = |{i ∈ [m] : h(xi) = yi}| m , (2.2) where [m] = {1,...,m}. LS(h) def The terms empirical error and empirical risk are often used interchangeably for this error. Since the training sample is the snapshot of the world that is available to the learner, it makes sense to search for a solution that works well on that data. This learning paradigm – coming up with a predictor h that minimizes LS(h) – is called Empirical Risk Minimization or ERM for short. 2.2.1 Something May Go Wrong – Overfitting Although the ERM rule seems very natural, without being careful, this approach may fail miserably. To demonstrate such a failure, let us go back to the problem of learning to pre dict the taste of a papaya on the basis of its softness and color. Consider a sample as depicted in the following: 16 A Gentle Start Assume that the probability distribution D is such that instances are distributed uniformly within the gray square and the labeling function, f , determines the label to be 1 if the instance is within the inner square, and 0 otherwise. The area of the gray square in the picture is 2 and the area of the inner square is 1. Consider the following predictor: yi if ∃i ∈ [m] s. t. xi = x hS(x) = 0 otherwise. (2.3) While this predictor might seem rather artificial, in Exercise 2.1 we show a natu ral representation of it using polynomials. Clearly, no matter what the sample is, LS(hS ) = 0, and therefore this predictor may be chosen by an ERM algorithm (it is one of the empirical-minimum-cost hypotheses; no classifier can have smaller error). On the other hand, the true error of any classifier that predicts the label 1 only on a finite number of instances is, in this case, 1/2. Thus, LD(hS) = 1/2. We have found a predictor whose performance on the training set is excellent, yet its performance on the true “world” is very poor. This phenomenon is called overfitting. Intuitively, overfitting occurs when our hypothesis fits the training data “too well” (perhaps like the everyday experience that a person who provides a perfect detailed explanation for each of his single actions may raise suspicion). 2.3 EMPIRICAL RISK MINIMIZATION WITH INDUCTIVE BIAS We have just demonstrated that the ERM rule might lead to overfitting. Rather than giving up on the ERM paradigm, we will look for ways to rectify it. We will search for conditions under which there is a guarantee that ERM does not overfit, namely, conditions under which when the ERM predictor has good performance with respect to the training data, it is also highly likely to perform well over the underlying data distribution. A common solution is to apply the ERM learning rule over a restricted search space. Formally, the learner should choose in advance (before seeing the data) a set of predictors. This set is called a hypothesis class and is denoted by H. Each h ∈ H is a function mapping from X to Y. For a given class H, and a training sample, S, the ERMH learner uses the ERM rule to choose a predictor h ∈ H, with the lowest possible error over S. Formally, ERMH(S) ∈ argmin h∈H LS(h), where argmin stands for the set of hypotheses in H that achieve the minimum value of LS(h) over H. By restricting the learner to choosing a predictor from H, we bias it toward a particular set of predictors. Such restrictions are often called an inductive bias. Since the choice of such a restriction is determined before the learner sees the training data, it should ideally be based on some prior knowledge about the problem to be learned. For example, for the papaya taste prediction problem we may choose the class H to be the set of predictors that are determined by axis aligned rectangles (in the space determined by the color and softness coordinates). We will later show that ERMH over this class is guaranteed not to overfit. On the other hand, the example of overfitting that we have seen previously, demonstrates that choosing H 2.3 Empirical Risk Minimization with Inductive Bias 17 to be a class of predictors that includes all functions that assign the value 1 to a finite set of domain points does not suffice to guarantee that ERMH will not overfit. A fundamental question in learning theory is, over which hypothesis classes ERMH learning will not result in overfitting. We will study this question later in the book. Intuitively, choosing a more restricted hypothesis class better protects us against overfitting but at the same time might cause us a stronger inductive bias. We will get back to this fundamental tradeoff later. 2.3.1 Finite Hypothesis Classes The simplest type of restriction on a class is imposing an upper bound on its size (that is, the number of predictors h in H). In this section, we show that if H is a finite class then ERMH will not overfit, provided it is based on a sufficiently large training sample (this size requirement will depend on the size of H). Limiting the learner to prediction rules within some finite hypothesis class may be considered as a reasonably mild restriction. For example, H can be the set of all predictors that can be implemented by a C++ program written in at most 109 bits of code. In our papayas example, we mentioned previously the class of axis aligned rectangles. While this is an infinite class, if we discretize the representation of real numbers, say, by using a 64 bits floating-point representation, the hypothesis class becomes a finite class. Let us now analyze the performance of the ERMH learning rule assuming that H is a finite class. For a training sample, S, labeled according to some f : X → Y, let hS denote a result of applying ERMH to S, namely, hS ∈ argmin h∈H LS(h). (2.4) In this chapter, we make the following simplifying assumption (which will be relaxed in the next chapter). Definition 2.1 (The Realizability Assumption). There exists h ∈ H s.t. L(D, f )(h ) = 0. Note that this assumption implies that with probability 1 over ran dom samples, S, where the instances of S are sampled according to D and are labeled by f , we have LS(h ) = 0. The realizability assumption implies that for every ERM hypothesis we have that3 LS(hS) = 0. However, we are interested in the true risk of hS, L(D, f )(hS ), rather than its empirical risk. Clearly, any guarantee on the error with respect to the underlying distribution, D, for an algorithm that has access only to a sample S should depend on the rela tionship between D and S. The common assumption in statistical machine learning is that the training sample S is generated by sampling points from the distribution D independently of each other. Formally, 3 Mathematically speaking, this holds with probability 1. To simplify the presentation, we sometimes omit the “with probability 1” specifier. 18 A Gentle Start The i.i.d. assumption: The examples in the training set are independently and identically distributed (i.i.d.) according to the distribution D. That is, every xi in S is freshly sampled according to D and then labeled according to the labeling function, f . We denote this assumption by S ∼ Dm where m is the size of S, and Dm denotes the probability over m-tuples induced by applying D to pick each element of the tuple independently of the other members of the tuple. Intuitively, the training set S is a window through which the learner gets partial information about the distribution D over the world and the labeling function, f . The larger the sample gets, the more likely it is to reflect more accurately the distribution and labeling used to generate it. Since L(D, f )(hS) depends on the training set, S, and that training set is picked by a random process, there is randomness in the choice of the predictor hS and, conse quently, in the risk L(D, f )(hS). Formally, we say that it is a random variable. It is not realistic to expect that with full certainty S will suffice to direct the learner toward a good classifier (from the point of view of D), as there is always some probability that the sampled training data happens to be very nonrepresentative of the under lying D. If we go back to the papaya tasting example, there is always some (small) chance that all the papayas we have happened to taste were not tasty, in spite of the fact that, say, 70% of the papayas in our island are tasty. In such a case, ERMH(S) may be the constant function that labels every papaya as “not tasty” (and has 70% error on the true distribution of papapyas in the island). We will therefore address the probability to sample a training set for which L(D, f )(hS ) is not too large. Usu ally, we denote the probability of getting a nonrepresentative sample by δ, and call (1 − δ) the confidence parameter of our prediction. On top of that, since we cannot guarantee perfect label prediction, we introduce another parameter for the quality of prediction, the accuracy parameter, commonly denoted by . We interpret the event L(D, f )(hS ) > as a failure of the learner, while if L(D, f )(hS) ≤ we view the output of the algorithm as an approximately correct predictor. Therefore (fixing some labeling function f : X → Y), we are interested in upper bounding the probability to sample m-tuple of instances that will lead to failure of the learner. Formally, let S|x = (x1,..., xm) be the instances of the training set. We would like to upper bound Dm({S|x : L(D, f )(hS ) > }). Let HB be the set of “bad” hypotheses, that is, HB = {h ∈ H : L(D, f )(h) > }. In addition, let M = {S|x : ∃h ∈ HB, LS (h) = 0} be the set of misleading samples: Namely, for every S|x ∈ M, there is a “bad” hypoth esis, h ∈ HB, that looks like a “good” hypothesis on S|x . Now, recall that we would like to bound the probability of the event L(D, f )(hS ) > . But, since the realizabil ity assumption implies that LS(hS) = 0, it follows that the event L(D, f )(hS) > can only happen if for some h ∈ HB we have LS(h) = 0. In other words, this event will 2.3 Empirical Risk Minimization with Inductive Bias 19 only happen if our sample is in the set of misleading samples, M. Formally, we have shown that {S|x : L(D, f )(hS) > } ≤ M. Note that we can rewrite M as {S|x : LS(h) = 0}. (2.5) Hence, M = h∈HB Dm({S|x : L(D, f )(hS ) > }) ≤ Dm(M) = Dm(∪h∈HB {S|x : LS(h) = 0}). (2.6) Next, we upper bound the right-hand side of the preceding equation using the union bound – a basic property of probabilities. Lemma 2.2 (Union Bound). For any two sets A, B and a distribution D we have D(A ∪ B) ≤ D(A)+ D(B). Applying the union bound to the right-hand side of Equation (2.6) yields Dm({S|x : L(D, f )(hS) > }) ≤ h∈HB Dm({S|x : LS(h) = 0}). (2.7) Next, let us bound each summand of the right-hand side of the preceding inequality. Fix some “bad” hypothesis h ∈ HB. The event LS(h) = 0 is equivalent to the event ∀i,h(xi) = f (xi). Since the examples in the training set are sampled i.i.d. we get that Dm({S|x : LS(h) = 0}) = Dm({S|x : ∀i,h(xi) = f (xi)}) = m i=1 D({xi : h(xi) = f (xi)}). (2.8) For each individual sampling of an element of the training set we have D({xi : h(xi) = yi}) = 1 − L(D, f )(h) ≤ 1 − , where the last inequality follows from the fact that h ∈ HB. Combining the previous equation with Equation (2.8) and using the inequality 1− ≤ e− we obtain that for every h ∈ HB, Dm({S|x : LS(h) = 0}) ≤ (1 − )m ≤ e−m. (2.9) Combining this equation with Equation (2.7) we conclude that Dm({S|x : L(D, f )(hS ) > }) ≤ |HB| e−m ≤ |H| e− m. A graphical illustration which explains how we used the union bound is given in Figure 2.1. Corollary 2.3. Let H be a finite hypothesis class. Let δ ∈ (0,1) and > 0 and let m be an integer that satisfies m ≥ log (|H|/δ) . 20 A Gentle Start Figure 2.1. Each point in the large circle represents a possible m-tuple of instances. Each colored oval represents the set of “misleading” m-tuple of instances for some “bad” pre dictor h ∈ HB. The ERM can potentially overfit whenever it gets a misleading training set S. That is, for some h ∈ HB we have LS(h) = 0. Equation (2.9) guarantees that for each individual bad hypothesis, h ∈ HB, at most (1− )m-fraction of the training sets would be misleading. In particular, the larger m is, the smaller each of these colored ovals becomes. The union bound formalizes the fact that the area representing the training sets that are misleading with respect to some h ∈ HB (that is, the training sets in M) is at most the sum of the areas of the colored ovals. Therefore, it is bounded by |HB| times the maximum size of a colored oval. Any sample S outside the colored ovals cannot cause the ERM rule to overfit. Then, for any labeling function, f , and for any distribution, D, for which the realiz ability assumption holds (that is, for some h ∈ H, L(D, f)(h) = 0), with probability of at least 1 − δ over the choice of an i.i.d. sample S of size m, we have that for every ERM hypothesis, hS , it holds that L(D, f )(hS) ≤ . The preceeding corollary tells us that for a sufficiently large m, the ERMH rule over a finite hypothesis class will be probably (with confidence 1−δ) approximately (up to an error of ) correct. In the next chapter we formally define the model of Probably Approximately Correct (PAC) learning. 2.4 EXERCISES 2.1 Overfitting of polynomial matching: We have shown that the predictor defined in Equation (2.3) leads to overfitting. While this predictor seems to be very unnatural, the goal of this exercise is to show that it can be described as a thresholded poly nomial. That is, show that given a training set S = {(xi, f (xi))}mi=1 ⊆ (Rd × {0,1})m, there exists a polynomial pS such that hS(x) = 1 if and only if pS(x) ≥ 0, where hS is as defined in Equation (2.3). It follows that learning the class of all thresholded polynomials using the ERM rule may lead to overfitting. 2.2 Let H be a class of binary classifiers over a domain X . Let D be an unknown distri bution over X , and let f be the target hypothesis in H. Fix some h ∈ H. Show that the expected value of LS(h) over the choice of S|x equals L(D, f )(h), namely, S|x∼Dm [LS(h)] = L(D, f )(h). E 2.3 Axis aligned rectangles: An axis aligned rectangle classifier in the plane is a classi fier that assigns the value 1 to a point if and only if it is inside a certain rectangle. 2.4 Exercises 21 - R* + + R(S) - + + + R1 Figure 2.2. Axis aligned rectangles. Formally, given real numbers a1 ≤ b1,a2 ≤ b2, define the classifier h(a1,b1,a2,b2) by h(a1,b1,a2,b2)(x1, x2) = 1 if a1 ≤ x1 ≤ b1 and a2 ≤ x2 ≤ b2 0 otherwise . (2.10) The class of all axis aligned rectangles in the plane is defined as H2rec = {h(a1,b1,a2,b2) : a1 ≤ b1, and a2 ≤ b2}. Note that this is an infinite size hypothesis class. Throughout this exercise we rely on the realizability assumption. 1. Let A be the algorithm that returns the smallest rectangle enclosing all positive examples in the training set. Show that A is an ERM. 2. Show that if A receives a training set of size ≥ 4log (4/δ) then, with probability of at least 1− δ it returns a hypothesis with error of at most . Hint: Fix some distribution D over X , let R∗ = R(a∗1 ,b∗1,a∗2 ,b∗2 ) be the rectan gle that generates the labels, and let f be the corresponding hypothesis. Let a1 ≥ a∗1 be a number such that the probability mass (with respect to D) of the rectangle R1 = R(a∗1 ,a1,a∗2 ,b∗2) is exactly /4. Similarly, let b1,a2,b2 be numbers such that the probability masses of the rectangles R2 = R(b1,b∗1 ,a∗2 ,b∗2), R3 = R(a∗1 ,b∗1 ,a∗2 ,a2), R4 = R(a∗1 ,b∗1 ,b2,b∗2 ) are all exactly /4. Let R(S) be the rectangle returned by A. See illustration in Figure 2.2. Show that R(S) ⊆ R∗. Show that if S contains (positive) examples in all of the rectangles R1, R2, R3, R4, then the hypothesis returned by A has error of at most . For each i ∈ {1,...,4}, upper bound the probability that S does not contain an example from Ri . Use the union bound to conclude the argument. 3. Repeat the previous question for the class of axis aligned rectangles in Rd . 4. Show that the runtime of applying the algorithm A mentioned earlier is polyno mial in d,1/, and in log (1/δ). 3 A Formal Learning Model In this chapter we define our main formal learning model – the PAC learning model and its extensions. We will consider other notions of learnability in Chapter 7. 3.1 PAC LEARNING In the previous chapter we have shown that for a finite hypothesis class, if the ERM rule with respect to that class is applied on a sufficiently large training sample (whose size is independent of the underlying distribution or labeling function) then the out put hypothesis will be probably approximately correct. More generally, we now defineProbably Approximately Correct (PAC) learning. Definition 3.1 (PAC Learnability). A hypothesis class H is PAC learnable if there exist a function mH : (0,1)2 → N and a learning algorithm with the following prop erty: For every ,δ ∈ (0,1), for every distribution D over X, and for every labeling function f : X → {0,1}, if the realizable assumption holds with respect to H,D, f , then when running the learning algorithm on m ≥ mH(,δ) i.i.d. examples gener ated by D and labeled by f , the algorithm returns a hypothesis h such that, with probability of at least 1 − δ (over the choice of the examples), L(D, f)(h) ≤ . The definition of Probably Approximately Correct learnability contains two approximation parameters. The accuracy parameter determines how far the out put classifier can be from the optimal one (this corresponds to the “approximately correct”), and a confidence parameter δ indicating how likely the classifier is to meet that accuracy requirement (corresponds to the “probably” part of “PAC”). Under the data access model that we are investigating, these approximations are inevitable. Since the training set is randomly generated, there may always be a small chance that it will happen to be noninformative (for example, there is always some chance that the training set will contain only one domain point, sampled over and over again). Furthermore, even when we are lucky enough to get a training sample that does faithfully represent D, because it is just a finite sample, there may always be some fine details of D that it fails to reflect. Our accuracy parameter, , allows “forgiving” the learner’s classifier for making minor errors. 22 3.2 A More General Learning Model 23 Sample Complexity The function mH : (0,1)2 → N determines the sample complexity of learning H: that is, how many examples are required to guarantee a probably approximately correct solution. The sample complexity is a function of the accuracy () and confidence (δ) parameters. It also depends on properties of the hypothesis class H – for example, for a finite class we showed that the sample complexity depends on log the size of H. Note that if H is PAC learnable, there are many functions mH that satisfy the requirements given in the definition of PAC learnability. Therefore, to be precise, we will define the sample complexity of learning H to be the “minimal function,” in the sense that for any ,δ, mH(,δ) is the minimal integer that satisfies the requirements of PAC learning with accuracy and confidence δ. Let us now recall the conclusion of the analysis of finite hypothesis classes from the previous chapter. It can be rephrased as stating: Corollary 3.2. Every finite hypothesis class is PAC learnable with sample complexity mH(,δ) ≤ log(|H|/δ) . There are infinite classes that are learnable as well (see, for example, Exercise 3.3). Later on we will show that what determines the PAC learnability of a class is not its finiteness but rather a combinatorial measure called the VC dimension. 3.2 A MORE GENERAL LEARNING MODEL The model we have just described can be readily generalized, so that it can be made relevant to a wider scope of learning tasks. We consider generalizations in two aspects: Removing the Realizability Assumption We have required that the learning algorithm succeeds on a pair of data distribu tion D and labeling function f provided that the realizability assumption is met. For practical learning tasks, this assumption may be too strong (can we really guaran tee that there is a rectangle in the color-hardness space that fully determines which papayas are tasty?). In the next subsection, we will describe the agnostic PAC model in which this realizability assumption is waived. Learning Problems beyond Binary Classification The learning task that we have been discussing so far has to do with predicting a binary label to a given example (like being tasty or not). However, many learning tasks take a different form. For example, one may wish to predict a real valued number (say, the temperature at 9:00 p.m. tomorrow) or a label picked from a finite set of labels (like the topic of the main story in tomorrow’s paper). It turns out that our analysis of learning can be readily extended to such and many other sce narios by allowing a variety of loss functions. We shall discuss that in Section 3.2.2 later. 24 A Formal Learning Model 3.2.1 Releasing the Realizability Assumption – Agnostic PAC Learning A More Realistic Model for the Data-Generating Distribution Recall that the realizability assumption requires that there exists h ∈ H such that Px∼D [h (x) = f (x)] = 1. In many practical problems this assumption does not hold. Furthermore, it is maybe more realistic not to assume that the labels are fully deter mined by the features we measure on input elements (in the case of the papayas, it is plausible that two papayas of the same color and softness will have differ ent taste). In the following, we relax the realizability assumption by replacing the “target labeling function” with a more flexible notion, a data-labels generating distribution. Formally, from now on, let D be a probability distribution over X × Y, where, as before, X is our domain set and Y is a set of labels (usually we will consider Y = {0,1}). That is, D is a joint distribution over domain points and labels. One can view such a distribution as being composed of two parts: a distribution Dx over unla beled domain points (sometimes called the marginal distribution) and a conditional probability over labels for each domain point, D((x, y)|x). In the papaya example, Dx determines the probability of encountering a papaya whose color and hardness fall in some color-hardness values domain, and the conditional probability is the probability that a papaya with color and hardness represented by x is tasty. Indeed, such modeling allows for two papayas that share the same color and hardness to belong to different taste categories. The empirical and the True Error Revised For a probability distribution, D, over X × Y, one can measure how likely h is to make an error when labeled points are randomly drawn according to D. We redefine the true error (or risk) of a prediction rule h to be LD(h) def (x,y)∼D[h(x) = y] def = P = D({(x, y) : h(x) = y}). (3.1) We would like to find a predictor, h, for which that error will be minimized. However, the learner does not know the data generating D. What the learner does have access to is the training data, S. The definition of the empirical risk remains the same as before, namely, = |{i ∈ [m] : h(xi) = yi}| LS(h) def m . Given S, a learner can compute LS(h) for any function h : X → {0,1}. Note that LS(h) = L D(uniform over S)(h). The Goal We wish to find some hypothesis, h : X → Y, that (probably approximately) minimizes the true risk, L D(h). 3.2 A More General Learning Model 25 The Bayes Optimal Predictor. Given any probability distribution D over X × {0,1}, the best label predicting function from X to {0,1} will be fD(x) = 1 if P[y = 1|x] ≥ 1/2 0 otherwise It is easy to verify (see Exercise 3.7) that for every probability distribution D, the Bayes optimal predictor fD is optimal, in the sense that no other classifier, g : X → {0,1}, has a lower error. That is, for every classifier g, LD( fD) ≤ LD(g). Unfortunately, since we do not know D, we cannot utilize this optimal predictor fD. What the learner does have access to is the training sample. We can now present the formal definition of agnostic PAC learnability, which is a natural extension of the definition of PAC learnability to the more realistic, nonrealizable, learning setup we have just discussed. Clearly, we cannot hope that the learning algorithm will find a hypothesis whose error is smaller than the minimal possible error, that of the Bayes predictor. Fur thermore, as we shall prove later, once we make no prior assumptions about the data-generating distribution, no algorithm can be guaranteed to find a predictor that is as good as the Bayes optimal one. Instead, we require that the learning algorithm will find a predictor whose error is not much larger than the best possible error of a predictor in some given benchmark hypothesis class. Of course, the strength of such a requirement depends on the choice of that hypothesis class. Definition 3.3 (Agnostic PAC Learnability). A hypothesis class H is agnostic PAC learnable if there exist a function mH : (0,1)2 → N and a learning algorithm with the following property: For every ,δ ∈ (0,1) and for every distribution D over X × Y, when running the learning algorithm on m ≥ mH(,δ) i.i.d. examples generated by D, the algorithm returns a hypothesis h such that, with probability of at least 1 − δ (over the choice of the m training examples), h ∈HLD(h )+ . LD(h) ≤ min Clearly, if the realizability assumption holds, agnostic PAC learning provides the same guarantee as PAC learning. In that sense, agnostic PAC learning generalizes the definition of PAC learning. When the realizability assumption does not hold, no learner can guarantee an arbitrarily small error. Nevertheless, under the definition of agnostic PAC learning, a learner can still declare success if its error is not much larger than the best error achievable by a predictor from the class H. This is in contrast to PAC learning, in which the learner is required to achieve a small error in absolute terms and not relative to the best error achievable by the hypothesis class. 3.2.2 The Scope of Learning Problems Modeled We next extend our model so that it can be applied to a wide variety of learning tasks. Let us consider some examples of different learning tasks. Multiclass Classification Our classification does not have to be binary. Take, for example, the task of document classification: We wish to design a program that 26 A Formal Learning Model will be able to classify given documents according to topics (e.g., news, sports, biology, medicine). A learning algorithm for such a task will have access to examples of correctly classified documents and, on the basis of these examples, should output a program that can take as input a new document and output a topic classification for that document. Here, the domain set is the set of all potential documents. Once again, we would usually represent documents by a set of features that could include counts of different key words in the document, as well as other possibly relevant features like the size of the document or its ori gin. The label set in this task will be the set of possible document topics (so Y will be some large finite set). Once we determine our domain and label sets, the other components of our framework look exactly the same as in the papaya tasting example; Our training sample will be a finite sequence of (feature vector,label) pairs, the learner’s output will be a function from the domain set to the label set, and, finally, for our measure of success, we can use the probability, over (document, topic) pairs, of the event that our predictor suggests a wrong label. Regression In this task, one wishes to find some simple pattern in the data – a functional relationship between the X and Y components of the data. For exam ple, one wishes to find a linear function that best predicts a baby’s birth weight on the basis of ultrasound measures of his head circumference, abdominal cir cumference, and femur length. Here, our domain set X is some subset of R3 (the three ultrasound measurements), and the set of “labels,” Y, is the the set of real numbers (the weight in grams). In this context, it is more adequate to call Y the target set. Our training data as well as the learner’s output are as before (a finite sequence of (x, y) pairs, and a function from X to Y respectively). However, our measure of success is different. We may evaluate the quality of a hypothesis function, h : X → Y, by the expected square difference between the true labels and their predicted values, namely, LD(h) def (x,y)∼D(h(x)− y)2. (3.2) = E To accommodate a wide range of learning tasks we generalize our formalism of the measure of success as follows: Generalized Loss Functions Given any set H (that plays the role of our hypotheses, or models) and some domain Z let be any function from H × Z to the set of nonnegative real numbers, : H × Z → R+. We call such functions loss functions. Note that for prediction problems, we have that Z = X ×Y. However, our notion of the loss function is generalized beyond prediction tasks, and therefore it allows Z to be any domain of examples (for instance, in unsupervised learning tasks such as the one described in Chapter 22, Z is not a product of an instance domain and a label domain). We now define the risk function to be the expected loss of a classifier, h ∈ H, with respect to a probability distribution D over Z, namely, LD(h) def = E z∼D [ (h,z)]. (3.3) 3.2 A More General Learning Model 27 That is, we consider the expectation of the loss of h over objects z picked ran domly according to D. Similarly, we define the empirical risk to be the expected loss over a given sample S = (z1,...,zm) ∈ Zm, namely, = 1m m LS(h) def i=1 (h,zi). (3.4) The loss functions used in the preceding examples of classification and regression tasks are as follows: 0–1 loss: Here, our random variable z ranges over the set of pairs X ×Y and the loss function is 0−1(h,(x, y)) def = 0 if h(x) = y 1 if h(x) = y This loss function is used in binary or multiclass classification problems. One should note that, for a random variable, α, taking the values {0,1}, Eα∼D [α] = Pα∼D [α = 1]. Consequently, for this loss function, the definitions of LD(h) given in Equation (3.3) and Equation (3.1) coincide. Square Loss: Here, our random variable z ranges over the set of pairs X ×Y and the loss function is sq(h,(x, y)) def = (h(x)− y)2. This loss function is used in regression problems. We will later see more examples of useful instantiations of loss functions. To summarize, we formally define agnostic PAC learnability for general loss functions. Definition 3.4 (Agnostic PAC Learnability for General Loss Functions). A hypoth esis class H is agnostic PAC learnable with respect to a set Z and a loss function : H × Z → R+, if there exist a function mH : (0,1)2 → N and a learning algorithm with the following property: For every ,δ ∈ (0,1) and for every distribution D over Z, when running the learning algorithm on m ≥ mH(,δ) i.i.d. examples generated by D, the algorithm returns h ∈ H such that, with probability of at least 1 − δ (over the choice of the m training examples), h ∈HLD(h )+ , LD(h) ≤ min where LD(h) = Ez∼D [ (h,z)]. Remark 3.1 (A Note About Measurability*). In the aforementioned definition, for every h ∈ H, we view the function (h,·) : Z → R+ as a random variable and define LD(h) to be the expected value of this random variable. For that, we need to require that the function (h,·) is measurable. Formally, we assume that there is a σ-algebra of subsets of Z, over which the probability D is defined, and that the preimage of every initial segment in R+ is in this σ-algebra. In the specific case of binary classification with the 0−1 loss, the σ-algebra is over X × {0,1} and our assumption on is equivalent to the assumption that for every h, the set {(x,h(x)) : x ∈ X} is in the σ-algebra. 28 A Formal Learning Model Remark 3.2 (Proper vs. Representation-Independent Learning*). In the preced ing definition, we required that the algorithm will return a hypothesis from H. In some situations, H is a subset of a set H , and the loss function can be naturally extended to be a function from H × Z to the reals. In this case, we may allow the algorithm to return a hypothesis h ∈ H , as long as it satisfies the requirement LD(h ) ≤ minh∈H LD(h) + . Allowing the algorithm to output a hypothesis from H is called representation independent learning, while proper learning occurs when the algorithm must output a hypothesis from H. Representation independent learn ing is sometimes called “improper learning,” although there is nothing improper in representation independent learning. 3.3 SUMMARY In this chapter we defined our main formal learning model – PAC learning. The basic model relies on the realizability assumption, while the agnostic variant does not impose any restrictions on the underlying distribution over the examples. We also generalized the PAC model to arbitrary loss functions. We will sometimes refer to the most general model simply as PAC learning, omitting the “agnostic” prefix and letting the reader infer what the underlying loss function is from the context. When we would like to emphasize that we are dealing with the original PAC setting we mention that the realizability assumption holds. In Chapter 7 we will discuss other notions of learnability. 3.4 BIBLIOGRAPHIC REMARKS Our most general definition of agnostic PAC learning with general loss functions follows the works of Vladimir Vapnik and Alexey Chervonenkis (Vapnik and Chervonenkis 1971). In particular, we follow Vapnik’s general setting of learning (Vapnik 1982, Vapnik 1992, Vapnik 1995, Vapnik 1998). The term PAC learning was introduced by Valiant (1984). Valiant was named the winner of the 2010 Turing Award for the introduction of the PAC model. Valiant’s definition requires that the sample complexity will be polynomial in 1/ and in 1/δ, as well as in the representation size of hypotheses in the class (see also Kearns and Vazirani (1994)). As we will see in Chapter 6, if a problem is at all PAC learnable then the sample complexity depends polynomially on 1/ and log(1/δ). Valiant’s definition also requires that the runtime of the learning algorithm will be polynomial in these quantities. In contrast, we chose to distinguish between the statistical aspect of learning and the computational aspect of learning. We will elab orate on the computational aspect later on in Chapter 8. Finally, the formalization of agnostic PAC learning is due to Haussler (1992). 3.5 EXERCISES 3.1 Monotonicity of Sample Complexity: Let H be a hypothesis class for a binary clas sification task. Suppose that H is PAC learnable and its sample complexity is given by mH(·,·). Show that mH is monotonically nonincreasing in each of its parame ters. That is, show that given δ ∈ (0,1), and given 0 < 1 ≤ 2 < 1, we have that 3.5 Exercises 29 mH(1, δ) ≥ mH(2, δ). Similarly, show that given ∈ (0,1), and given 0 < δ1 ≤ δ2 < 1, we have that mH(,δ1) ≥ mH(,δ2). 3.2 Let X be a discrete domain, and let HSingleton = {hz : z ∈ X }∪{h−}, where for each z ∈ X , hz is the function defined by hz(x) = 1 if x = z and hz(x) = 0 if x = z. h− is simply the all-negative hypothesis, namely, ∀x ∈ X, h−(x) = 0. The realizability assumption here implies that the true hypothesis f labels negatively all examples in the domain, perhaps except one. 1. Describe an algorithm that implements the ERM rule for learning HSingleton in the realizable setup. 2. Show that HSingleton is PAC learnable. Provide an upper bound on the sample complexity. 3.3 Let X = R2, Y = {0,1}, and let H be the class of concentric circles in the plane, that is, H = {hr :r ∈ R+}, where hr(x) = 1[ x ≤r]. Prove that H is PAC learnable (assume realizability), and its sample complexity is bounded by mH(,δ) ≤ log(1/δ) . 3.4 In this question, we study the hypothesis class of Boolean conjunctions defined as follows. The instance space is X = {0,1}d and the label set is Y = {0,1}. A literal over the variables x1,..., xd is a simple Boolean function that takes the form f (x) = xi, for some i ∈ [d], or f (x) = 1− xi for some i ∈ [d]. We use the notation x¯i as a shorthand for 1 − xi . A conjunction is any product of literals. In Boolean logic, the product is denoted using the ∧ sign. For example, the function h(x) = x1 ·(1 − x2) is written as x1 ∧ ¯x2. We consider the hypothesis class of all conjunctions of literals over the d vari ables. The empty conjunction is interpreted as the all-positive hypothesis (namely, the function that returns h(x) = 1 for all x). The conjunction x1 ∧ ¯x1 (and similarly any conjunction involving a literal and its negation) is allowed and interpreted as the all-negative hypothesis (namely, the conjunction that returns h(x) = 0 for all x). We assume realizability: Namely, we assume that there exists a Boolean conjunction that generates the labels. Thus, each example (x, y) ∈ X × Y consists of an assign ment to the d Boolean variables x1,..., xd , and its truth value (0 for false and 1 for true). For instance, let d = 3 and suppose that the true conjunction is x1 ∧ ¯x2. Then, the training set S might contain the following instances: ((1,1,1),0),((1,0,1),1),((0,1,0),0)((1,0,0),1). Prove that the hypothesis class of all conjunctions over d variables is PAC learn able and bound its sample complexity. Propose an algorithm that implements the ERM rule, whose runtime is polynomial in d · m. 3.5 Let X be a domain and let D1,D2,...,Dm be a sequence of distributions over X . Let H be a finite class of binary classifiers over X and let f ∈ H. Suppose we are getting a sample S of m examples, such that the instances are independent but are not iden tically distributed; the ith instance is sampled from Di and then yi is set to be f (xi). Let D¯ m denote the average, that is, D¯ m = (D1 +···+ Dm)/m. Fix an accuracy parameter ∈ (0,1). Show that ∃h ∈ H s.t. L(D¯ m, f )(h) > and L(S, f )(h) = 0 P ≤ |H|e−m. Hint: Use the geometric-arithmetic mean inequality. 30 A Formal Learning Model 3.6 Let H be a hypothesis class of binary classifiers. Show that if H is agnostic PAC learnable, then H is PAC learnable as well. Furthermore, if A is a successful agnostic PAC learner for H, then A is also a successful PAC learner for H. 3.7 (*) The Bayes optimal predictor: Show that for every probability distribution D, the Bayes optimal predictor fD is optimal, in the sense that for every classifier g from X to {0,1}, LD( fD) ≤ LD(g). 3.8 (*) We say that a learning algorithm A is better than B with respect to some probability distribution, D, if LD(A(S)) ≤ LD(B(S)) for all samples S ∈ (X × {0,1})m. We say that a learning algorithm A is better than B, if it is better than B with respect to all probability distributions D over X × {0,1}. 1. A probabilistic label predictor is a function that assigns to every domain point x a probability value, h(x) ∈ [0,1], that determines the probability of predicting the label 1. That is, given such an h and an input, x, the label for x is predicted by tossing a coin with bias h(x) toward Heads and predicting 1 iff the coin comes up Heads. Formally, we define a probabilistic label predictor as a function, h : X → [0,1]. The loss of such h on an example (x, y) is defined to be |h(x)− y|, which is exactly the probability that the prediction of h will not be equal to y. Note that if h is deterministic, that is, returns values in {0,1}, then |h(x) − y| = 1[h(x) =y]. Prove that for every data-generating distribution D over X × {0,1}, the Bayes optimal predictor has the smallest risk (w.r.t. the loss function (h,(x, y)) = |h(x) − y|, among all possible label predictors, including probabilistic ones). 2. Let X be a domain and {0,1} be a set of labels. Prove that for every distribution D over X × {0,1}, there exist a learning algorithm AD that is better than any other learning algorithm with respect to D. 3. Prove that for every learning algorithm A there exist a probability distribution, D, and a learning algorithm B such that A is not better than B w.r.t. D. 3.9 Consider a variant of the PAC model in which there are two example oracles: one that generates positive examples and one that generates negative examples, both according to the underlying distribution D on X . Formally, given a target function f : X → {0,1}, let D+ be the distribution over X + = {x ∈ X : f (x) = 1} defined by D+(A) = D(A)/D(X +), for every A ⊂ X +. Similarly, D− is the distribution over X − induced by D. The definition of PAC learnability in the two-oracle model is the same as the standard definition of PAC learnability except that here the learner has access to m+H(,δ) i.i.d. examples from D+ and m−(,δ) i.i.d. examples from D−. The learner’s goal is to output h s.t. with probability at least 1 − δ (over the choice of the two training sets, and possibly over the nondeterministic decisions made by the learning algorithm), both L(D+, f )(h) ≤ and L(D−, f )(h) ≤ . 1. (*) Show that if H is PAC learnable (in the standard one-oracle model), then H is PAC learnable in the two-oracle model. 2. (**) Define h+ to be the always-plus hypothesis and h− to be the always-minus hypothesis. Assume that h+,h− ∈ H. Show that if H is PAC learnable in the two-oracle model, then H is PAC learnable in the standard one-oracle model. 4 Learning via Uniform Convergence The first formal learning model that we have discussed was the PAC model. In Chapter 2 we have shown that under the realizability assumption, any finite hypoth esis class is PAC learnable. In this chapter we will develop a general tool, uniform convergence, and apply it to show that any finite class is learnable in the agnos tic PAC model with general loss functions, as long as the range loss function is bounded. 4.1 UNIFORM CONVERGENCE IS SUFFICIENT FOR LEARNABILITY The idea behind the learning condition discussed in this chapter is very simple. Recall that, given a hypothesis class, H, the ERM learning paradigm works as fol lows: Upon receiving a training sample, S, the learner evaluates the risk (or error) of each h in H on the given sample and outputs a member of H that minimizes this empirical risk. The hope is that an h that minimizes the empirical risk with respect to S is a risk minimizer (or has risk close to the minimum) with respect to the true data probability distribution as well. For that, it suffices to ensure that the empirical risks of all members of H are good approximations of their true risk. Put another way, we need that uniformly over all hypotheses in the hypothesis class, the empirical risk will be close to the true risk, as formalized in the following. Definition 4.1 (-representative sample). A training set S is called -representative (w.r.t. domain Z, hypothesis class H, loss function , and distribution D) if ∀h ∈ H, |LS(h)− LD(h)| ≤ . The next simple lemma states that whenever the sample is (/2)-representative, the ERM learning rule is guaranteed to return a good hypothesis. Lemma 4.2. Assume that a training set S is 2 -representative (w.r.t. domain Z, hypothesis class H, loss function , and distribution D). Then, any output of ERMH(S), namely, any hS ∈ argminh∈H LS(h), satisfies LD(hS) ≤ min h∈HLD(h)+ . 31 32 Learning via Uniform Convergence Proof. For every h ∈ H, LD(hS) ≤ LS(hS )+ 2 ≤ LS(h)+ 2 ≤ LD(h)+ 2 + 2 = LD(h)+ , where the first and third inequalities are due to the assumption that S is 2 -representative (Definition 4.1) and the second inequality holds since hS is an ERM predictor. The preceding lemma implies that to ensure that the ERM rule is an agnostic PAC learner, it suffices to show that with probability of at least 1 − δ over the ran dom choice of a training set, it will be an -representative training set. The uniform convergence condition formalizes this requirement. Definition 4.3 (Uniform Convergence). We say that a hypothesis class H has the uniform convergence property (w.r.t. a domain Z and a loss function ) if there exists a function mUC H : (0,1)2 → N such that for every ,δ ∈ (0,1) and for every probabil ity distribution D over Z, if S is a sample of m ≥ mUC H (,δ) examples drawn i.i.d. according to D, then, with probability of at least 1 − δ, S is -representative. Similar to the definition of sample complexity for PAC learning, the function mUC H measures the (minimal) sample complexity of obtaining the uniform con vergence property, namely, how many examples we need to ensure that with probability of at least 1 − δ the sample would be -representative. The term uniform here refers to having a fixed sample size that works for all members of H and over all possible probability distributions over the domain. The following corollary follows directly from Lemma 4.2 and the definition of uniform convergence. Corollary 4.4. If a class H has the uniform convergence property with a function mUC H then the class is agnostically PAC learnable with the sample complexity mH(,δ) ≤ mUC H (/2,δ). Furthermore, in that case, the ERMH paradigm is a successful agnostic PAC learner for H. 4.2 FINITE CLASSES ARE AGNOSTIC PAC LEARNABLE In view of Corollary 4.4, the claim that every finite hypothesis class is agnostic PAC learnable will follow once we establish that uniform convergence holds for a finite hypothesis class. To show that uniform convergence holds we follow a two step argument, similar to the derivation in Chapter 2. The first step applies the union bound while the second step employs a measure concentration inequality. We now explain these two steps in detail. Fix some ,δ. We need to find a sample size m that guarantees that for any D, with probability of at least 1 − δ of the choice of S = (z1,...,zm) sampled i.i.d. from D we have that for all h ∈ H, |LS(h)− LD(h)| ≤ . That is, Dm({S : ∀h ∈ H,|LS (h)− LD(h)| ≤ }) ≥ 1 − δ. Equivalently, we need to show that Dm({S : ∃h ∈ H,|LS(h)− LD(h)| > }) < δ. 4.2 Finite Classes Are Agnostic PAC Learnable 33 Writing {S : ∃h ∈ H,|LS (h)− LD(h)| > }=∪h∈H{S : |LS(h)− LD(h)| > }, and applying the union bound (Lemma 2.2) we obtain Dm({S : ∃h ∈ H,|LS(h)− LD(h)| > }) ≤ h∈H Dm({S : |LS(h)− LD(h)| > }). (4.1) Our second step will be to argue that each summand of the right-hand side of this inequality is small enough (for a sufficiently large m). That is, we will show that for any fixed hypothesis, h, (which is chosen in advance prior to the sampling of the training set), the gap between the true and empirical risks, |LS(h)− LD(h)|, is likely to be small. Recall that LD(h) = Ez∼D [ (h,z)] and that LS(h) = 1m mi=1 (h,zi). Since each zi is sampled i.i.d. from D, the expected value of the random variable (h,zi) is LD(h). By the linearity of expectation, it follows that LD(h) is also the expected value of LS(h). Hence, the quantity |LD(h) − LS(h)| is the deviation of the random variable LS(h) from its expectation. We therefore need to show that the measure of LS(h) is concentrated around its expected value. A basic statistical fact, the law of large numbers, states that when m goes to infinity, empirical averages converge to their true expectation. This is true for LS(h), since it is the empirical average of m i.i.d random variables. However, since the law of large numbers is only an asymptotic result, it provides no information about the gap between the empirically estimated error and its true value for any given, finite, sample size. Instead, we will use a measure concentration inequality due to Hoeffding, which quantifies the gap between empirical averages and their expected value. Lemma 4.5 (Hoeffding’s Inequality). Let θ1,...,θm be a sequence of i.i.d. random variables and assume that for all i, E[θi] = µ and P[a ≤ θi ≤ b] = 1. Then, for any > 0 P 1 m m i=1 θi − µ > ≤ 2 exp −2m 2/(b − a)2 . The proof can be found in Appendix B. Getting back to our problem, let θi be the random variable (h,zi). Since h is fixed and z1,...,zm are sampled i.i.d., it follows that θ1,...,θm are also i.i.d. random variables. Furthermore, LS(h) = 1m mi=1 θi and LD(h) = µ. Let us further assume that the range of is [0,1] and therefore θi ∈ [0,1]. We therefore obtain that m −2m 2 . (4.2) 1 Dm({S : |LS(h)− LD(h)| > }) = P m θi − µ i=1 > ≤ 2 exp Combining this with Equation (4.1) yields Dm({S : ∃h ∈ H,|LS(h)− LD(h)| > }) ≤ h∈H 2 exp −2m 2 −2m 2 . = 2|H| exp 34 Learning via Uniform Convergence Finally, if we choose m ≥ log(2|H|/δ) 22 then Dm({S : ∃h ∈ H,|LS(h)− LD(h)| > }) ≤ δ. Corollary 4.6. Let H be a finite hypothesis class, let Z be a domain, and let : H × Z → [0,1] be a loss function. Then, H enjoys the uniform convergence property with sample complexity mUC H (,δ) ≤ log(2|H|/δ) 22 . Furthermore, the class is agnostically PAC learnable using the ERM algorithm with sample complexity mH(,δ) ≤ mUC 2log(2|H|/δ) H (/2,δ) ≤ 2 . Remark 4.1 (The “Discretization Trick”). While the preceding corollary only applies to finite hypothesis classes, there is a simple trick that allows us to get a very good estimate of the practical sample complexity of infinite hypothesis classes. Consider a hypothesis class that is parameterized by d parameters. For example, let X = R, Y = {±1}, and the hypothesis class, H, be all functions of the form hθ (x) = sign(x − θ). That is, each hypothesis is parameterized by one parameter, θ ∈ R, and the hypothesis outputs 1 for all instances larger than θ and outputs −1 for instances smaller than θ. This is a hypothesis class of an infinite size. However, if we are going to learn this hypothesis class in practice, using a computer, we will probably maintain real numbers using floating point representation, say, of 64 bits. It follows that in practice, our hypothesis class is parameterized by the set of scalars that can be represented using a 64 bits floating point number. There are at most 264 such numbers; hence the actual size of our hypothesis class is at most 264. More gen erally, if our hypothesis class is parameterized by d numbers, in practice we learn a hypothesis class of size at most 264d. Applying Corollary 4.6 we obtain that the sample complexity of such classes is bounded by 128d+2log(2/δ) 2 . This upper bound on the sample complexity has the deficiency of being dependent on the specific rep resentation of real numbers used by our machine. In Chapter 6 we will introduce a rigorous way to analyze the sample complexity of infinite size hypothesis classes. Nevertheless, the discretization trick can be used to get a rough estimate of the sample complexity in many practical situations. 4.3 SUMMARY If the uniform convergence property holds for a hypothesis class H then in most cases the empirical risks of hypotheses in H will faithfully represent their true risks. Uniform convergence suffices for agnostic PAC learnability using the ERM rule. We have shown that finite hypothesis classes enjoy the uniform convergence property and are hence agnostic PAC learnable. 4.5 Exercises 35 4.4 BIBLIOGRAPHIC REMARKS Classes of functions for which the uniform convergence property holds are also called Glivenko-Cantelli classes, named after Valery Ivanovich Glivenko and Francesco Paolo Cantelli, who proved the first uniform convergence result in the 1930s. See (Dudley, Gine & Zinn 1991). The relation between uniform convergence and learnability was thoroughly studied by Vapnik – see (Vapnik 1992, Vapnik 1995, Vapnik 1998). In fact, as we will see later in Chapter 6, the fundamental theorem of learning theory states that in binary classification problems, uniform convergence is not only a sufficient condition for learnability but is also a necessary condition. This is not the case for more general learning problems (see (Shalev-Shwartz, Shamir, Srebro & Sridharan 2010)). 4.5 EXERCISES 4.1 In this exercise, we show that the (,δ) requirement on the convergence of errors in our definitions of PAC learning, is, in fact, quite close to a simpler looking require ment about averages (or expectations). Prove that the following two statements are equivalent (for any learning algorithm A, any probability distribution D, and any loss function whose range is [0,1]): 1. For every ,δ > 0, there exists m(,δ) such that ∀m ≥ m(,δ) S∼Dm [LD(A(S)) > ] < δ P 2. S∼Dm [LD(A(S))] = 0 limm→∞ E (where ES∼Dm denotes the expectation over samples S of size m). 4.2 Bounded loss functions: In Corollary 4.6 we assumed that the range of the loss func tion is [0,1]. Prove that if the range of the loss function is [a,b] then the sample complexity satisfies mH(,δ) ≤ mUC 2log(2|H|/δ)(b − a)2 H (/2, δ) ≤ 2 . 5 The Bias-Complexity Tradeoff In Chapter 2 we saw that unless one is careful, the training data can mislead the learner, and result in overfitting. To overcome this problem, we restricted the search space to some hypothesis class H. Such a hypothesis class can be viewed as reflecting some prior knowledge that the learner has about the task – a belief that one of the members of the class H is a low-error model for the task. For example, in our papayas taste problem, on the basis of our previous experience with other fruits, we may assume that some rectangle in the color-hardness plane predicts (at least approximately) the papaya’s tastiness. Is such prior knowledge really necessary for the success of learning? Maybe there exists some kind of universal learner, that is, a learner who has no prior knowl edge about a certain task and is ready to be challenged by any task? Let us elaborate on this point. A specific learning task is defined by an unknown distribution D over X × Y, where the goal of the learner is to find a predictor h : X → Y, whose risk, LD(h), is small enough. The question is therefore whether there exist a learning algorithm A and a training set size m, such that for every distribution D, if A receives m i.i.d. examples from D, there is a high chance it outputs a predictor h that has a low risk. The first part of this chapter addresses this question formally. The No-Free Lunch theorem states that no such universal learner exists. To be more precise, the theorem states that for binary classification prediction tasks, for every learner there exists a distribution on which it fails. We say that the learner fails if, upon receiving i.i.d. examples from that distribution, its output hypothesis is likely to have a large risk, say, ≥ 0.3, whereas for the same distribution, there exists another learner that will output a hypothesis with a small risk. In other words, the theorem states that no learner can succeed on all learnable tasks – every learner has tasks on which it fails while other learners succeed. Therefore, when approaching a particular learning problem, defined by some distribution D, we should have some prior knowledge on D. One type of such prior knowledge is that D comes from some specific parametric family of distributions. We will study learning under such assumptions later on in Chapter 24. Another type of prior knowledge on D, which we assumed when defining the PAC learning model, 36 5.1 The No-Free-Lunch Theorem 37 is that there exists h in some predefined hypothesis class H, such that LD(h) = 0. A softer type of prior knowledge on D is assuming that minh∈H LD(h) is small. In a sense, this weaker assumption on D is a prerequisite for using the agnostic PAC model, in which we require that the risk of the output hypothesis will not be much larger than minh∈H LD(h). In the second part of this chapter we study the benefits and pitfalls of using a hypothesis class as a means of formalizing prior knowledge. We decompose the error of an ERM algorithm over a class H into two components. The first compo nent reflects the quality of our prior knowledge, measured by the minimal risk of a hypothesis in our hypothesis class, minh∈H LD(h). This component is also called the approximation error, or the bias of the algorithm toward choosing a hypothesis from H. The second component is the error due to overfitting, which depends on the size or the complexity of the class H and is called the estimation error. These two terms imply a tradeoff between choosing a more complex H (which can decrease the bias but increases the risk of overfitting) or a less complex H (which might increase the bias but decreases the potential overfitting). 5.1 THE NO-FREE-LUNCH THEOREM In this part we prove that there is no universal learner. We do this by showing that no learner can succeed on all learning tasks, as formalized in the following theorem: Theorem 5.1. (No-Free-Lunch) Let A be any learning algorithm for the task of binary classification with respect to the 0−1 loss over a domain X. Let m be any num ber smaller than |X|/2, representing a training set size. Then, there exists a distribution D over X × {0,1} such that: 1. There exists a function f : X → {0,1} with LD( f ) = 0. 2. With probability of at least 1/7 over the choice of S ∼ Dm we have that LD(A(S)) ≥ 1/8. This theorem states that for every learner, there exists a task on which it fails, even though that task can be successfully learned by another learner. Indeed, a trivial successful learner in this case would be an ERM learner with the hypoth esis class H = { f }, or more generally, ERM with respect to any finite hypothesis class that contains f and whose size satisfies the equation m ≥ 8log(7|H|/6) (see Corollary 2.3). Proof. Let C be a subset of X of size 2m. The intuition of the proof is that any learning algorithm that observes only half of the instances in C has no information on what should be the labels of the rest of the instances in C. Therefore, there exists a “reality,” that is, some target function f , that would contradict the labels that A(S) predicts on the unobserved instances in C. Note that there are T = 22m possible functions from C to {0,1}. Denote these functions by f1,..., fT . For each such function, let Di be a distribution over C ×{0,1} defined by Di({(x, y)}) = 1/|C| if y = fi(x) 0 otherwise. 38 The Bias-Complexity Tradeoff That is, the probability to choose a pair (x, y) is 1/|C| if the label y is indeed the true label according to fi , and the probability is 0 if y = fi(x). Clearly, LDi( fi) = 0. We will show that for every algorithm, A, that receives a training set of m examples from C ×{0,1} and returns a function A(S) : C → {0,1}, it holds that max i∈[T ]E S∼Dmi[LDi(A(S))] ≥ 1/4. (5.1) Clearly, this means that for every algorithm, A , that receives a training set of m examples from X × {0,1} there exist a function f : X → {0,1} and a distribution D over X ×{0,1}, such that LD( f ) = 0 and S∼Dm [LD(A (S))] ≥ 1/4. (5.2) E It is easy to verify that the preceding suffices for showing that P[LD(A (S)) ≥ 1/8] ≥ 1/7, which is what we need to prove (see Exercise 5.1). We now turn to proving that Equation (5.1) holds. There are k = (2m)m possible sequences of m examples from C. Denote these sequences by S1,..., Sk . Also, if Sj = (x1,..., xm) we denote by Sij the sequence containing the instances in Sj labeled by the function fi , namely, Sij = ((x1, fi(x1)),...,(xm, fi(xm))). If the distribution is Di then the possible training sets A can receive are Si1,..., Sik , and all these training sets have the same probability of being sampled. Therefore, S∼Dmi[LDi(A(S))] = 1k k E j=1 LDi(A(Sij)). (5.3) Using the facts that “maximum” is larger than “average” and that “average” is larger than “minimum,” we have max i∈[T ] 1 k k j=1 LDi(A(Sij)) ≥1T T i=1 = 1k k j=1 1 k 1 T k j=1 T i=1 LDi(A(Sij)) LDi(A(Sij)) ≥ min j∈[k] 1 T T i=1 LDi(A(Sij)). (5.4) Next, fix some j ∈ [k]. Denote Sj = (x1,..., xm) and let v1,...,vp be the examples in C that do not appear in Sj . Clearly, p ≥ m. Therefore, for every function h : C → 5.1 The No-Free-Lunch Theorem 39 {0,1} and every i we have 1[h(x) = fi(x)] 1[h(vr) = fi(vr)] 1[h(vr) = fi(vr)]. (5.5) Hence, LDi(h) = 12m x∈C ≥12m p r=1 ≥12 p p r=1 1 T T i=1 LDi(A(Sij)) ≥1T T i=1 1 2 p p r=1 1[A(Sij)(vr) = fi(vr)] = 12 p p r=1 1 T T i=1 1[A(Sij)(vr) = fi(vr)] ≥12 · min r∈[p] 1 T T i=1 1[A(Sij)(vr) = fi(vr)]. (5.6) Next, fix some r ∈ [p]. We can partition all the functions in f1,..., fT into T /2 dis joint pairs, where for a pair ( fi, fi ) we have that for every c ∈ C, fi(c) = fi (c) if and only if c = vr. Since for such a pair we must have Sij = Si j , it follows that 1[A(Sij)(vr) = fi(vr)] + 1[A(Si j )(vr) = fi (vr)] = 1, which yields 1 T T i=1 1[A(Sij)(vr) = fi(vr)] = 12. Combining this with Equation (5.6), Equation (5.4), and Equation (5.3), we obtain that Equation (5.1) holds, which concludes our proof. 5.1.1 No-Free-Lunch and Prior Knowledge How does the No-Free-Lunch result relate to the need for prior knowledge? Let us consider an ERM predictor over the hypothesis class H of all the functions f from X to {0,1}. This class represents lack of prior knowledge: Every possible function from the domain to the label set is considered a good candidate. According to the No-Free-Lunch theorem, any algorithm that chooses its output from hypotheses in H, and in particular the ERM predictor, will fail on some learning task. Therefore, this class is not PAC learnable, as formalized in the following corollary: Corollary 5.2. Let X be an infinite domain set and let H be the set of all functions from X to {0,1}. Then, H is not PAC learnable. Proof. Assume, by way of contradiction, that the class is learnable. Choose some < 1/8 and δ < 1/7. By the definition of PAC learnability, there must be some 40 The Bias-Complexity Tradeoff learning algorithm A and an integer m = m(,δ), such that for any data-generating distribution over X × {0,1}, if for some function f : X → {0,1}, LD( f ) = 0, then with probability greater than 1 − δ when A is applied to samples S of size m, generated i.i.d. by D, LD(A(S)) ≤ . However, applying the No-Free-Lunch theorem, since |X| > 2m, for every learning algorithm (and in particular for the algorithm A), there exists a distribution D such that with probability greater than 1/7 > δ, LD(A(S)) > 1/8 > , which leads to the desired contradiction. How can we prevent such failures? We can escape the hazards foreseen by the No-Free-Lunch theorem by using our prior knowledge about a specific learning task, to avoid the distributions that will cause us to fail when learning that task. Such prior knowledge can be expressed by restricting our hypothesis class. But how should we choose a good hypothesis class? On the one hand, we want to believe that this class includes the hypothesis that has no error at all (in the PAC setting), or at least that the smallest error achievable by a hypothesis from this class is indeed rather small (in the agnostic setting). On the other hand, we have just seen that we cannot simply choose the richest class – the class of all functions over the given domain. This tradeoff is discussed in the following section. 5.2 ERROR DECOMPOSITION To answer this question we decompose the error of an ERMH predictor into two components as follows. Let hS be an ERMH hypothesis. Then, we can write LD(hS) = app + est where : app = min h∈HLD(h), est = LD(hS)− app. (5.7) The Approximation Error – the minimum risk achievable by a predictor in the hypothesis class. This term measures how much risk we have because we restrict ourselves to a specific class, namely, how much inductive bias we have. The approximation error does not depend on the sample size and is determined by the hypothesis class chosen. Enlarging the hypothesis class can decrease the approximation error. Under the realizability assumption, the approximation error is zero. In the agnostic case, however, the approximation error can be large.1 The Estimation Error – the difference between the approximation error and the error achieved by the ERM predictor. The estimation error results because the empirical risk (i.e., training error) is only an estimate of the true risk, and so the predictor minimizing the empirical risk is only an estimate of the predictor minimizing the true risk. The quality of this estimation depends on the training set size and on the size, or complexity, of the hypothesis class. As we have shown, for a finite hypothe sis class, est increases (logarithmically) with |H| and decreases with m. We can 1 In fact, it always includes the error of the Bayes optimal predictor (see Chapter 3), the minimal yet inevitable error, because of the possible nondeterminism of the world in this model. Sometimes in the literature the term approximation error refers not to minh∈H LD(h), but rather to the excess error over that of the Bayes optimal predictor, namely, minh∈H LD(h)−Bayes. 5.5 Exercises 41 think of the size of H as a measure of its complexity. In future chapters we will define other complexity measures of hypothesis classes. Since our goal is to minimize the total risk, we face a tradeoff, called the bias complexity tradeoff. On one hand, choosing H to be a very rich class decreases the approximation error but at the same time might increase the estimation error, as a rich H might lead to overfitting. On the other hand, choosing H to be a very small set reduces the estimation error but might increase the approximation error or, in other words, might lead to underfitting. Of course, a great choice for H is the class that contains only one classifier – the Bayes optimal classifier. But the Bayes optimal classifier depends on the underlying distribution D, which we do not know (indeed, learning would have been unnecessary had we known D). Learning theory studies how rich we can make H while still maintaining reason able estimation error. In many cases, empirical research focuses on designing good hypothesis classes for a certain domain. Here, “good” means classes for which the approximation error would not be excessively high. The idea is that although we are not experts and do not know how to construct the optimal classifier, we still have some prior knowledge of the specific problem at hand, which enables us to design hypothesis classes for which both the approximation error and the estimation error are not too large. Getting back to our papayas example, we do not know how exactly the color and hardness of a papaya predict its taste, but we do know that papaya is rectangle in the color-hardness space may be a good predictor. 5.3 SUMMARY The No-Free-Lunch theorem states that there is no universal learner. Every learner has to be specified to some task, and use some prior knowledge about that task, in order to succeed. So far we have modeled our prior knowledge by restricting our output hypothesis to be a member of a chosen hypothesis class. When choosing this hypothesis class, we face a tradeoff, between a larger, or more complex, class that is more likely to have a small approximation error, and a more restricted class that would guarantee that the estimation error will be small. In the next chapter we will study in more detail the behavior of the estimation error. In Chapter 7 we will discuss alternative ways to express prior knowledge. 5.4 BIBLIOGRAPHIC REMARKS (Wolpert & Macready 1997) proved several no-free-lunch theorems for optimiza tion, but these are rather different from the theorem we prove here. The theorem we prove here is closely related to lower bounds in VC theory, as we will study in the next chapter. 5.5 EXERCISES 5.1 Prove that Equation (5.2) suffices for showing that P[LD(A(S)) ≥ 1/8] ≥ 1/7. Hint: Let θ be a random variable that receives values in [0,1] and whose expectation satisfies E[θ] ≥ 1/4. Use Lemma B.1 to show that P[θ ≥ 1/8] ≥ 1/7. 42 The Bias-Complexity Tradeoff 5.2 Assume you are asked to design a learning algorithm to predict whether patients are going to suffer a heart attack. Relevant patient features the algorithm may have access to include blood pressure (BP), body-mass index (BMI), age (A), level of physical activity (P), and income (I). You have to choose between two algorithms; the first picks an axis aligned rect angle in the two dimensional space spanned by the features BP and BMI and the other picks an axis aligned rectangle in the five dimensional space spanned by all the preceding features. 1. Explain the pros and cons of each choice. 2. Explain how the number of available labeled training samples will affect your choice. 5.3 Prove that if |X | ≥ km for a positive integer k ≥ 2, then we can replace the lower bound of 1/4 in the No-Free-Lunch theorem with k−1 2k = 12 − 12k . Namely, let A be a learning algorithm for the task of binary classification. Let m be any number smaller than |X |/k, representing a training set size. Then, there exists a distribution D over X × {0,1} such that: There exists a function f : X → {0,1} with LD( f ) = 0. ES∼Dm [LD(A(S))] ≥ 12 − 12k . 6 The VC-Dimension In the previous chapter, we decomposed the error of the ERMH rule into approx imation error and estimation error. The approximation error depends on the fit of our prior knowledge (as reflected by the choice of the hypothesis class H) to the underlying unknown distribution. In contrast, the definition of PAC learn ability requires that the estimation error would be bounded uniformly over all distributions. Our current goal is to figure out which classes H are PAC learnable, and to characterize exactly the sample complexity of learning a given hypothesis class. So far we have seen that finite classes are learnable, but that the class of all functions (over an infinite size domain) is not. What makes one class learnable and the other unlearnable? Can infinite-size classes be learnable, and, if so, what determines their sample complexity? We begin the chapter by showing that infinite classes can indeed be learn able, and thus, finiteness of the hypothesis class is not a necessary condition for learnability. We then present a remarkably crisp characterization of the family of learnable classes in the setup of binary valued classification with the zero-one loss. This characterization was first discovered by Vladimir Vapnik and Alexey Chervo nenkis in 1970 and relies on a combinatorial notion called the Vapnik-Chervonenkis dimension (VC-dimension). We formally define the VC-dimension, provide several examples, and then state the fundamental theorem of statistical learning theory, which integrates the concepts of learnability, VC-dimension, the ERM rule, and uniform convergence. 6.1 INFINITE-SIZE CLASSES CAN BE LEARNABLE In Chapter 4 we saw that finite classes are learnable, and in fact the sample complex ity of a hypothesis class is upper bounded by the log of its size. To show that the size of the hypothesis class is not the right characterization of its sample complexity, we first present a simple example of an infinite-size hypothesis class that is learnable. 43 44 The VC-Dimension Example 6.1. Let H be the set of threshold functions over the real line, namely, H = {ha : a ∈ R}, where ha : R → {0,1} is a function such that ha(x) = 1[x ] ≤ P P S∼Dm [b0 < a0 ∨ b1 > a1], and using the union bound we can bound the preceding by S∼Dm [LD(hS ) > ] ≤ P S∼Dm [b0 < a0] + P S∼Dm [b1 > a1]. (6.1) P The event b0 < a0 happens if and only if all examples in S are not in the interval (a0,a∗), whose probability mass is defined to be , namely, S∼Dm [∀(x, y) ∈ S, x ∈ (a0,a )] = (1 − )m ≤ e− m. S∼Dm [b0 < a0] = P P Since we assume m > log(2/δ)/ it follows that the equation is at most δ/2. In the same way it is easy to see that PS∼Dm [b1 > a1] ≤ δ/2. Combining with Equation (6.1) we conclude our proof. 6.2 THE VC-DIMENSION We see, therefore, that while finiteness of H is a sufficient condition for learnability, it is not a necessary condition. As we will show, a property called the VC-dimension of a hypothesis class gives the correct characterization of its learnability. To moti vate the definition of the VC-dimension, let us recall the No-Free-Lunch theorem (Theorem 5.1) and its proof. There, we have shown that without restricting the hypothesis class, for any learning algorithm, an adversary can construct a distri bution for which the learning algorithm will perform poorly, while there is another 6.2 The VC-Dimension 45 learning algorithm that will succeed on the same distribution. To do so, the adver sary used a finite set C ⊂ X and considered a family of distributions that are concentrated on elements of C. Each distribution was derived from a “true” tar get function from C to {0,1}. To make any algorithm fail, the adversary used the power of choosing a target function from the set of all possible functions from C to {0,1}. When considering PAC learnability of a hypothesis class H, the adversary is restricted to constructing distributions for which some hypothesis h ∈ H achieves a zero risk. Since we are considering distributions that are concentrated on elements of C, we should study how H behaves on C, which leads to the following definition. Definition 6.2 (Restriction of H to C). Let H be a class of functions from X to {0,1} and let C = {c1,...,cm} ⊂ X. The restriction of H to C is the set of functions from C to {0,1} that can be derived from H. That is, HC = {(h(c1),...,h(cm)) : h ∈ H}, where we represent each function from C to {0,1} as a vector in {0,1}|C|. If the restriction of H to C is the set of all functions from C to {0,1}, then we say that H shatters the set C. Formally: Definition 6.3 (Shattering). A hypothesis class H shatters a finite set C ⊂ X if the restriction of H to C is the set of all functions from C to {0,1}. That is, |HC| = 2|C|. Example 6.2. Let H be the class of threshold functions over R. Take a set C = {c1}. Now, if we take a = c1 + 1, then we have ha(c1) = 1, and if we take a = c1 − 1, then we have ha(c1) = 0. Therefore, HC is the set of all functions from C to {0,1}, and H shatters C. Now take a set C = {c1,c2}, where c1 ≤ c2. No h ∈ H can account for the labeling (0,1), because any threshold that assigns the label 0 to c1 must assign the label 0 to c2 as well. Therefore not all functions from C to {0,1} are included in HC; hence C is not shattered by H. Getting back to the construction of an adversarial distribution as in the proof of the No-Free-Lunch theorem (Theorem 5.1), we see that whenever some set C is shattered by H, the adversary is not restricted by H, as they can construct a distri bution over C based on any target function from C to {0,1}, while still maintaining the realizability assumption. This immediately yields: Corollary 6.4. Let H be a hypothesis class of functions from X to {0,1}. Let m be a training set size. Assume that there exists a set C ⊂ X of size 2m that is shattered by H. Then, for any learning algorithm, A, there exist a distribution D over X × {0,1} and a predictor h ∈ H such that LD(h) = 0 but with probability of at least 1/7 over the choice of S ∼ Dm we have that LD(A(S)) ≥ 1/8. Corollary 6.4 tells us that if H shatters some set C of size 2m then we cannot learn H using m examples. Intuitively, if a set C is shattered by H, and we receive a sample containing half the instances of C, the labels of these instances give us no informa tion about the labels of the rest of the instances in C – every possible labeling of the rest of the instances can be explained by some hypothesis in H. Philosophically, 46 The VC-Dimension If someone can explain every phenomenon, his explanations are worthless. This leads us directly to the definition of the VC dimension. Definition 6.5 (VC-dimension). The VC-dimension of a hypothesis class H, denoted VCdim(H), is the maximal size of a set C ⊂ X that can be shattered by H. If H can shatter sets of arbitrarily large size we say that H has infinite VC-dimension. A direct consequence of Corollary 6.4 is therefore: Theorem 6.6. Let H be a class of infinite VC-dimension. Then, H is not PAC learnable. Proof. Since H has an infinite VC-dimension, for any training set size m, there exists a shattered set of size 2m, and the claim follows by Corollary 6.4. We shall see later in this chapter that the converse is also true: A finite VC dimension guarantees learnability. Hence, the VC-dimension characterizes PAC learnability. But before delving into more theory, we first show several examples. 6.3 EXAMPLES In this section we calculate the VC-dimension of several hypothesis classes. To show that VCdim(H) = d we need to show that 1. There exists a set C of size d that is shattered by H. 2. Every set C of size d + 1 is not shattered by H. 6.3.1 Threshold Functions Let H be the class of threshold functions over R. Recall Example 6.2, where we have shown that for an arbitrary set C = {c1}, H shatters C; therefore VCdim(H) ≥ 1. We have also shown that for an arbitrary set C = {c1,c2} where c1 ≤ c2, H does not shatter C. We therefore conclude that VCdim(H) = 1. 6.3.2 Intervals Let H be the class of intervals over R, namely, H = {ha,b : a,b ∈ R,a < b}, where ha,b : R→ {0,1} is a function such that ha,b(x) = 1[x∈(a,b)]. Take the set C = {1,2}. Then, H shatters C (make sure you understand why) and therefore VCdim(H) ≥ 2. Now take an arbitrary set C = {c1,c2,c3} and assume without loss of generality that c1 ≤ c2 ≤ c3. Then, the labeling (1,0,1) cannot be obtained by an interval and therefore H does not shatter C. We therefore conclude that VCdim(H) = 2. 6.3.3 Axis Aligned Rectangles Let H be the class of axis aligned rectangles, formally: H = {h(a1,a2,b1,b2) : a1 ≤ a2 and b1 ≤ b2} 6.3 Examples 47 c1 c2 c5 c4 c3 Figure 6.1. Left: 4 points that are shattered by axis aligned rectangles. Right: Any axis aligned rectangle cannot label c5 by 0 and the rest of the points by 1. where h(a1,a2,b1,b2)(x1, x2) = 1 if a1 ≤ x1 ≤ a2 and b1 ≤ x2 ≤ b2 0 otherwise(6.2) We shall show in the following that VCdim(H) = 4. To prove this we need to find a set of 4 points that are shattered by H, and show that no set of 5 points can be shattered by H. Finding a set of 4 points that are shattered is easy (see Figure 6.1). Now, consider any set C ⊂ R2 of 5 points. In C, take a leftmost point (whose first coordinate is the smallest in C), a rightmost point (first coordinate is the largest), a lowest point (second coordinate is the smallest), and a highest point (second coor dinate is the largest). Without loss of generality, denote C = {c1,...,c5} and let c5 be the point that was not selected. Now, define the labeling (1,1,1,1,0). It is impos sible to obtain this labeling by an axis aligned rectangle. Indeed, such a rectangle must contain c1,...,c4; but in this case the rectangle contains c5 as well, because its coordinates are within the intervals defined by the selected points. So, C is not shattered by H, and therefore VCdim(H) = 4. 6.3.4 Finite Classes Let H be a finite class. Then, clearly, for any set C we have |HC| ≤ |H| and thus C cannot be shattered if |H| < 2|C|. This implies that VCdim(H) ≤ log2 (|H|). This shows that the PAC learnability of finite classes follows from the more generalstate ment of PAC learnability of classes with finite VC-dimension, which we shall see in the next section. Note, however, that the VC-dimension of a finite class H can be significantly smaller than log2 (|H|). For example, let X = {1,...,k}, for some inte ger k, and consider the class of threshold functions (as defined in Example 6.2). Then, |H| = k but VCdim(H) = 1. Since k can be arbitrarily large, the gap between log2 (|H|) and VCdim(H) can be arbitrarily large. 6.3.5 VC-Dimension and the Number of Parameters In the previous examples, the VC-dimension happened to equal the number of parameters defining the hypothesis class. While this is often the case, it is not always true. Consider, for example, the domain X = R, and the hypothesis class H = {hθ : θ ∈ R} where hθ : X → {0,1} is defined by hθ (x) = 0.5 sin(θ x) . It is pos sible to prove that VCdim(H) = ∞, namely, for every d, one can find d points that are shattered by H (see Exercise 6.8). 48 The VC-Dimension 6.4 THE FUNDAMENTAL THEOREM OF PAC LEARNING We have already shown that a class of infinite VC-dimension is not learnable. The converse statement is also true, leading to the fundamental theorem of statistical learning theory: Theorem 6.7 (The Fundamental Theorem of Statistical Learning). Let H be a hypothesis class of functions from a domain X to {0,1} and let the loss function be the 0−1 loss. Then, the following are equivalent: 1. H has the uniform convergence property. 2. Any ERM rule is a successful agnostic PAC learner for H. 3. H is agnostic PAC learnable. 4. H is PAC learnable. 5. Any ERM rule is a successful PAC learner for H. 6. H has a finite VC-dimension. The proof of the theorem is given in the next section. Not only does the VC-dimension characterize PAC learnability; it even deter mines the sample complexity. Theorem 6.8 (The Fundamental Theorem of Statistical Learning – Quantitative Version). Let H be a hypothesis class of functions from a domain X to {0,1} and let the loss function be the 0−1 loss. Assume that VCdim(H) = d < ∞. Then, there are absolute constants C1,C2 such that 1. H has the uniform convergence property with sample complexity C1d + log(1/δ) H (,δ) ≤ C2d + log(1/δ) 2 ≤ mUC 2 2. H is agnostic PAC learnable with sample complexity C1d + log(1/δ) 2 ≤ mH(,δ) ≤ C2d + log(1/δ) 2 3. H is PAC learnable with sample complexity C1d + log(1/δ) ≤ mH(,δ) ≤ C2d log(1/)+ log(1/δ) The proof of this theorem is given in Chapter 28. Remark 6.3. We stated the fundamental theorem for binary classification tasks. A similar result holds for some other learning problems such as regression with the absolute loss or the squared loss. However, the theorem does not hold for all learn ing tasks. In particular, learnability is sometimes possible even though the uniform convergence property does not hold (we will see an example in Chapter 13, Exercise 6.2). Furthermore, in some situations, the ERM rule fails but learnability is possible with other learning rules. 6.5 Proof of Theorem 6.7 49 6.5 PROOF OF THEOREM 6.7 We have already seen that 1 → 2 in Chapter 4. The implications 2 → 3 and 3 → 4 are trivial and so is 2 → 5. The implications 4 → 6 and 5 → 6 follow from the No Free-Lunch theorem. The difficult part is to show that 6 → 1. The proof is based on two main claims: If VCdim(H) = d, then even though H might be infinite, when restricting it to a finite set C ⊂ X, its “effective” size, |HC|, is only O(|C|d ). That is, the size of HC grows polynomially rather than exponentially with |C|. This claim is often referred to as Sauer’s lemma, but it has also been stated and proved indepen dently by Shelah and by Perles. The formal statement is given in Section 6.5.1 later. In Section 4 we have shown that finite hypothesis classes enjoy the uniform con vergence property. In Section 6.5.2 later we generalize this result and show that uniform convergence holds whenever the hypothesis class has a “small effective size.” By “small effective size” we mean classes for which |HC| grows polynomially with |C|. 6.5.1 Sauer’s Lemma and the Growth Function We defined the notion of shattering, by considering the restriction of H to a finite set of instances. The growth function measures the maximal “effective” size of H on a set of m examples. Formally: Definition 6.9 (Growth Function). Let H be a hypothesis class. Then the growth function of H, denoted τH : N → N, is defined as C⊂X :|C|=m |HC|. τH(m) = max In words, τH (m) is the number of different functions from a set C of size m to {0,1} that can be obtained by restricting H to C. Obviously, if VCdim(H) = d then for any m ≤ d we have τH(m) = 2m. In such cases, H induces all possible functions from C to {0,1}. The following beautiful lemma, proposed independently by Sauer, Shelah, and Perles, shows that when m becomes larger than the VC-dimension, the growth function increases polynomially rather than exponentially with m. Lemma 6.10 (Sauer-Shelah-Perles). Let H be a hypothesis class with VCdim(H) ≤ d < ∞. Then, for all m, τH(m) ≤ di=0 mi . In particular, if m > d + 1 then τH(m) ≤ (em/d)d . Proof of Sauer’s Lemma* To prove the lemma it suffices to prove the following stronger claim: For any C = {c1,...,cm} we have ∀H, |HC| ≤ |{B ⊆ C : H shatters B}|. (6.3) 50 The VC-Dimension The reason why Equation (6.3) is sufficient to prove the lemma is that if VCdim(H) ≤ d then no set whose size is larger than d is shattered by H and therefore |{B ⊆ C : H shatters B}| ≤ d i=0 m i . When m > d + 1 the right-hand side of the preceding is at most (em/d)d (see Lemma A.5 in Appendix A). We are left with proving Equation (6.3) and we do it using an inductive argu ment. For m = 1, no matter what H is, either both sides of Equation (6.3) equal 1 or both sides equal 2 (the empty set is always considered to be shattered by H). Assume Equation (6.3) holds for sets of size k < m and let us prove it for sets of size m. Fix H and C = {c1,...,cm}. Denote C = {c2,...,cm} and in addition, define the following two sets: Y0 = {(y2,..., ym) : (0, y2,..., ym) ∈ HC ∨ (1, y2,..., ym) ∈ HC}, and Y1 = {(y2,..., ym) : (0, y2,..., ym) ∈ HC ∧ (1, y2,..., ym) ∈ HC}. It is easy to verify that |HC|=|Y0|+|Y1|. Additionally, since Y0 = HC , using the induction assumption (applied on H and C ) we have that |Y0|=|HC | ≤ |{B ⊆ C : H shatters B}| = |{B ⊆ C : c1 ∈ B ∧ H shatters B}|. Next, define H ⊆ H to be H = {h ∈ H : ∃h ∈ H s.t. (1 − h (c1),h (c2),...,h (cm)) = (h(c1),h(c2),...,h(cm)}, namely, H contains pairs of hypotheses that agree on C and differ on c1. Using this definition, it is clear that if H shatters a set B ⊆ C then it also shatters the set B ∪ {c1} and vice versa. Combining this with the fact that Y1 = H C and using the inductive assumption (now applied on H and C ) we obtain that |Y1|=|H C | ≤ |{B ⊆ C : H shatters B}| = |{B ⊆ C : H shatters B ∪ {c1}}| = |{B ⊆ C : c1 ∈ B ∧ H shatters B}| ≤ |{B ⊆ C : c1 ∈ B ∧ H shatters B}|. Overall, we have shown that |HC|=|Y0|+|Y1| ≤ |{B ⊆ C : c1 ∈ B ∧ H shatters B}| + |{B ⊆ C : c1 ∈ B ∧ H shatters B}| = |{B ⊆ C : H shatters B}|, which concludes our proof. 6.5.2 Uniform Convergence for Classes of Small Effective Size In this section we prove that if H has small effective size then it enjoys the uniform convergence property. Formally, 6.5 Proof of Theorem 6.7 51 Theorem 6.11. Let H be a class and let τH be its growth function. Then, for every D and every δ ∈ (0,1), with probability of at least 1 − δ over the choice of S ∼ Dm we have |LD(h)− LS (h)| ≤4 + log(τH(2m)) δ√2m . Before proving the theorem, let us first conclude the proof of Theorem 6.7. Proof of Theorem 6.7. It suffices to prove that if the VC-dimension is finite then the uniform convergence property holds. We will prove that H (,δ) ≤ 416d 16d +16d log(2e/d) mUC (δ)2 log (δ)2 (δ)2 . From Sauer’s lemma we have that for m > d, τH(2m) ≤ (2em/d)d. Combining this with Theorem 6.11 we obtain that with probability of at least 1 − δ, |LS(h)− LD(h)| ≤4 + d log(2em/d) δ√2m . For simplicity assume that d log(2em/d) ≥ 4; hence, |LS(h)− LD(h)| ≤1δ 2d log(2em/d) m . To ensure that the preceding is at most we need that m ≥2d log(m) (δ)2 +2d log(2e/d) (δ)2 . Standard algebraic manipulations (see Lemma A.2 in Appendix A) show that a sufficient condition for the preceding to hold is that m ≥ 4 2d (δ)2 log 2d (δ)2 +4d log(2e/d) (δ)2 . Remark 6.4. The upper bound on mUC H we derived in the proof Theorem 6.7 is not the tightest possible. A tighter analysis that yields the bounds given in Theorem 6.8 can be found in Chapter 28. Proof of Theorem 6.11* We will start by showing that E S∼Dm sup h∈H |LD(h)− LS (h)| ≤4 + log(τH(2m)) √2m . (6.4) Since the random variable suph∈H |LD(h) − LS(h)| is nonnegative, the proof of the theorem follows directly from the preceding using Markov’s inequality (see Section B.1). 52 The VC-Dimension To bound the left-hand side of Equation (6.4) we first note that for every h ∈ H, we can rewrite LD(h) = ES ∼Dm [LS (h)], where S = z 1,...,z m is an additional i.i.d. sample. Therefore, E S∼Dm sup h∈H |LD(h)− LS (h)| = E S∼Dm sup h∈H E S ∼Dm LS (h)− LS (h) . A generalization of the triangle inequality yields E S ∼Dm [LS (h)− LS (h)] ≤ E S ∼Dm |LS (h)− LS (h)|, and the fact that supermum of expectation is smaller than expectation of supremum yields sup S ∼Dm |LS (h)− LS(h)| ≤ E h∈H E S ∼Dm sup h∈H |LS (h)− LS(h)|. Formally, the previous two inequalities follow from Jensen’s inequality. Combining all we obtain E S∼Dm sup h∈H |LD(h)− LS (h)| ≤ E S,S ∼Dm sup h∈H |LS (h)− LS (h)| = E S,S ∼Dm sup h∈H 1 m m i=1 ( (h,z i)− (h,zi)) . (6.5) The expectation on the right-hand side is over a choice of two i.i.d. samples S = z1,...,zm and S = z 1,...,z m. Since all of these 2m vectors are chosen i.i.d., nothing will change if we replace the name of the random vector zi with the name of the random vector z i . If we do it, instead of the term ( (h,z i)− (h,zi)) in Equation (6.5) we will have the term −( (h,z i) − (h,zi)). It follows that for every σ ∈ {±1}m we have that Equation (6.5) equals E S,S ∼Dm sup h∈H 1 m m i=1 σi( (h,z i)− (h,zi)) Since this holds for every σ ∈ {±1}m, it also holds if we sample each component of σ uniformly at random from the uniform distribution over {±1}, denoted U±. Hence, Equation (6.5) also equals Eσ∼Um±E S,S ∼Dm sup h∈H 1 m m i=1 σi( (h,z i)− (h,zi)) , and by the linearity of expectation it also equals S,S ∼Dm Eσ∼Um± E sup h∈H 1 m m i=1 σi( (h,z i)− (h,zi)) . 6.7 Bibliographic Remarks 53 Next, fix S and S , and let C be the instances appearing in S and S . Then, we can take the supremum only over h ∈ HC. Therefore, Eσ∼Um± sup h∈H 1 m m i=1 σi( (h,z i)− (h,zi)) = Eσ∼Um± max h∈HC 1 m m i=1 σi( (h,z i)− (h,zi)) . Fix some h ∈ HC and denote θh = 1m mi=1 σi( (h,z i) − (h,zi)). Since E[θh] = 0 and θh is an average of independent variables, each of which takes values in [− 1,1], we have by Hoeffding’s inequality that for every ρ > 0, P[|θh| > ρ] ≤ 2 exp −2m ρ2 . Applying the union bound over h ∈ HC, we obtain that for any ρ > 0, P h∈HC|θh| > ρmax ≤ 2|HC| exp −2m ρ2 . Finally, Lemma A.4 in Appendix A tells us that the preceding implies E max h∈HC|θh| ≤4 + log(|HC|) √2m . Combining all with the definition of τH, we have shown that E S∼Dm 6.6 SUMMARY sup h∈H |LD(h)− LS (h)| ≤4 + log(τH(2m)) √2m . The fundamental theorem of learning theory characterizes PAC learnability of classes of binary classifiers using VC-dimension. The VC-dimension of a class is a combinatorial property that denotes the maximal sample size that can be shattered by the class. The fundamental theorem states that a class is PAC learnable if and only if its VC-dimension is finite and specifies the sample complexity required for PAC learning. The theorem also shows that if a problem is at all learnable, then uniform convergence holds and therefore the problem is learnable using the ERM rule. 6.7 BIBLIOGRAPHIC REMARKS The definition of VC-dimension and its relation to learnability and to uniform con vergence is due to the seminal work of Vapnik and Chervonenkis (1971). The relation to the definition of PAC learnability is due to Blumer, Ehrenfeucht, Haussler, and Warmuth (1989). Several generalizations of the VC-dimension have been proposed. For example, the fat-shattering dimension characterizes learnability of some regression prob lems (Kearns, Schapire & Sellie 1994; Alon, Ben-David, Cesa-Bianchi & Haussler 54 The VC-Dimension 1997; Bartlett, Long & Williamson 1994; Anthony & Bartlet 1999), and the Natarajan dimension characterizes learnability of some multiclass learning prob lems (Natarajan 1989). However, in general, there is no equivalence between learnability and uniform convergence. See (Shalev-Shwartz, Shamir, Srebro & Sridharan 2010; Daniely, Sabato, Ben-David & Shalev-Shwartz 2011). Sauer’s lemma has been proved by Sauer in response to a problem of Erdos (Sauer 1972). Shelah (with Perles) proved it as a useful lemma for Shelah’s theory of stable models (Shelah 1972). Gil Kalai tells1 us that at some later time, Benjy Weiss asked Perles about such a result in the context of ergodic theory, and Perles, who forgot that he had proved it once, proved it again. Vapnik and Chervonenkis proved the lemma in the context of statistical learning theory. 6.8 EXERCISES 6.1 Show the following monotonicity property of VC-dimension: For every two hypoth esis classes if H ⊆ H then VCdim(H ) ≤ VCdim(H). 6.2 Given some finite domain set, X , and a number k ≤ |X |, figure out the VC-dimension of each of the following classes (and prove your claims): 1. HX=k = {h ∈ {0,1}X : |{x : h(x) = 1}| = k}: that is, the set of all functions that assign the value 1 to exactly k elements of X . 2. Hat−most−k = {h ∈ {0,1}X : |{x : h(x) = 1}| ≤ k or|{x : h(x) = 0}| ≤ k}. 6.3 Let X be the Boolean hypercube {0,1}n . For a set I ⊆ {1,2,...,n} we define a parity function hI as follows. On a binary vector x = (x1, x2,..., xn) ∈ {0,1}n , hI(x) = i∈I xi mod 2 . (That is, hI computes parity of bits in I.) What is the VC-dimension of the class of all such parity functions, Hn-parity = {hI : I ⊆ {1,2,...,n}}? 6.4 We proved Sauer’s lemma by proving that for every class H of finite VC-dimension d, and every subset A of the domain, |HA| ≤ |{B ⊆ A : H shatters B}| ≤ d i=0 |A| i . Show that there are cases in which the previous two inequalities are strict (namely, the ≤ can be replaced by <) and cases in which they can be replaced by equalities. Demonstrate all four combinations of = and <. 6.5 VC-dimension of axis aligned rectangles in Rd : Let Hdrec be the class of axis aligned rectangles in Rd . We have already seen that VCdim(H2rec) = 4. Prove that in general, VCdim(Hdrec) = 2d. 6.6 VC-dimension of Boolean conjunctions: Let Hdcon be the class of Boolean conjunc tions over the variables x1,..., xd (d ≥ 2). We already know that this class is finite and thus (agnostic) PAC learnable. In this question we calculate VCdim(Hdcon). 1. Show that |Hdcon| ≤ 3d + 1. 2. Conclude that VCdim(H) ≤ d log3. 3. Show that Hdcon shatters the set of unit vectors {ei : i ≤ d}. 1 http://gilkalai.wordpress.com/2008/09/28/extremal-combinatorics-iii-some-basic-theorems 6.8 Exercises 55 4. (**) Show that VCdim(Hdcon) ≤ d. Hint: Assume by contradiction that there exists a set C = {c1,..., cd+1} that is shattered by Hdcon. Let h1,...,hd+1 be hypotheses in Hdcon that satisfy ∀i, j ∈ [d + 1], hi(c j) = 0 i = j 1 otherwise For each i ∈ [d + 1], hi (or more accurately, the conjunction that corresponds to hi) contains some literal i which is false on ci and true on c j for each j = i. Use the Pigeonhole principle to show that there must be a pair i < j ≤ d + 1 such that i and j use the same xk and use that fact to derive a contradiction to the requirements from the conjunctions hi,h j . 5. Consider the class Hdmcon of monotone Boolean conjunctions over {0,1}d . Mono tonicity here means that the conjunctions do not contain negations. As in Hdcon, the empty conjunction is interpreted as the all-positive hypothesis. We augment Hdmcon with the all-negative hypothesis h−. Show that VCdim(Hdmcon) = d. 6.7 We have shown that for a finite hypothesis class H, VCdim(H) ≤ log(|H|) . How ever, this is just an upper bound. The VC-dimension of a class can be much lower than that: 1. Find an example of a class H of functions over the real interval X = [0,1] such that H is infinite while VCdim(H) = 1. 2. Give an example of a finite hypothesis class H over the domain X = [0,1], where VCdim(H) = log2 (|H|) . 6.8 (*) It is often the case that the VC-dimension of a hypothesis class equals (or can be bounded above by) the number of parameters one needs to set in order to define each hypothesis in the class. For instance, if H is the class of axis aligned rectangles in Rd , then VCdim(H) = 2d, which is equal to the number of parameters used to define a rectangle in Rd . Here is an example that shows that this is not always the case. We will see that a hypothesis class might be very complex and even not learnable, although it has a small number of parameters. Consider the domain X = R, and the hypothesis class H = {x → sin(θ x) : θ ∈ R} (here, we take −1 = 0). Prove that VCdim(H) = ∞. Hint: There is more than one way to prove the required result. One option is by applying the following lemma: If 0. x1x2x3 ..., is the binary expansion of x ∈ (0,1), then for any natural number m, sin(2mπx) = (1 − xm), provided that ∃k ≥ m s.t. xk = 1. 6.9 Let H be the class of signed intervals, that is, H = {ha,b,s : a ≤ b,s ∈ {−1,1}} where s if x ∈ [a,b] −s if x ∈/ [a,b] Calculate VCdim(H). ha,b,s(x) = 6.10 Let H be a class of functions from X to {0,1}. 1. Prove that if VCdim(H) ≥ d, for any d, then for some probability distribution D over X × {0,1}, for every sample size, m, h∈HLD(h) +d − m S∼Dm [LD(A(S))] ≥ min E 2d 56 The VC-Dimension Hint: Use Exercise 6.3 in Chapter 5. 2. Prove that for every H that is PAC learnable, VCdim(H) < ∞. (Note that this is the implication 3 → 6 in Theorem 6.7.) 6.11 VC of union: Let H1,...,Hr be hypothesis classes over some fixed domain set X . Let d = maxi VCdim(Hi) and assume for simplicity that d ≥ 3. 1. Prove that VCdim ∪ri=1Hi ≤ 4d log(2d) + 2log(r). Hint: Take a set of k examples and assume that they are shattered by the union class. Therefore, the union class can produce all 2k possible labelings on these examples. Use Sauer’s lemma to show that the union class cannot produce more than rkd labelings. Therefore, 2krkd . Now use Lemma A.2. 2. (*) Prove that for r = 2 it holds that VCdim(H1 ∪ H2) ≤ 2d + 1. 6.12 Dudley classes: In this question we discuss an algebraic framework for defining con cept classes over Rn and show a connection between the VC dimension of such classes and their algebraic properties. Given a function f : Rn → R we define the corresponding function, POS( f )(x) = 1[ f (x)>0]. For a class F of real valued func tions we define a corresponding class of functions POS(F) = {POS( f ) : f ∈ F}. We say that a family, F, of real valued functions is linearly closed if for all f , g ∈ F and r ∈ R, ( f +r g)∈ F (where addition and scalar multiplication of functions are defined point wise, namely, for all x ∈ Rn, ( f +r g)(x) = f (x) +r g(x)). Note that if a family of functions is linearly closed then we can view it as a vector space over the reals. For a function g : Rn → R and a family of functions F, let F + gdef = { f + g : f ∈ F}. Hypothesis classes that have a representation as POS(F + g) for some vector space of functions F and some function g are called Dudley classes. 1. Show that for every g : Rn → R and every vector space of functions F as defined earlier, VCdim(POS(F + g)) = VCdim(POS(F)). 2. (**) For every linearly closed family of real valued functions F, the VC dimension of the corresponding class POS(F) equals the linear dimension of F (as a vector space). Hint: Let f1,..., fd be a basis for the vector space F. Con sider the mapping x →( f1(x),..., fd (x)) (from Rn to Rd ). Note that this mapping induces a matching between functions over Rn of the form POS( f ) and homo geneous linear halfspaces in Rd (the VC-dimension of the class of homogeneous linear halfspaces is analyzed in Chapter 9). 3. Show that each of the following classes can be represented as a Dudley class: 1. The class H Sn of halfspaces over Rn (see Chapter 9). 2. The class HHSn of all homogeneous halfspaces over Rn (see Chapter 9). 3. The class Bd of all functions defined by (open) balls in Rd . Use the Dudley representation to figure out the VC-dimension of this class. 4. Let Pdn denote the class of functions defined by polynomial inequalities of degree ≤ d, namely, Pdn = {h p : p is a polynomial of degree ≤ d in the variables x1,..., xn}, where for x = (x1...., xn), h p(x) = 1[p(x)≥0] (the degree of a multivariable poly nomial is the maximal sum of variable exponents over all of its terms. For example, the degree of p(x) = 3x31 x22 + 4x3x27 is 5). 6.8 Exercises 57 1. Use the Dudley representation to figure out the VC-dimension of the class Pd1 – the class of all d-degree polynomials over R. 2. Prove that the class of all polynomial classifiers over R has infinite VC dimension. 3. Use the Dudley representation to figure out the VC-dimension of the class Pdn (as a function of d and n). 7 Nonuniform Learnability The notions of PAC learnability discussed so far in the book allow the sample sizes to depend on the accuracy and confidence parameters, but they are uniform with respect to the labeling rule and the underlying data distribution. Conse quently, classes that are learnable in that respect are limited (they must have a finite VC-dimension, as stated by Theorem 6.7). In this chapter we consider more relaxed, weaker notions of learnability. We discuss the usefulness of such notions and provide characterization of the concept classes that are learnable using these definitions. We begin this discussion by defining a notion of “nonuniform learnability” that allows the sample size to depend on the hypothesis to which the learner is com pared. We then provide a characterization of nonuniform learnability and show that nonuniform learnability is a strict relaxation of agnostic PAC learnability. We also show that a sufficient condition for nonuniform learnability is that H is a count able union of hypothesis classes, each of which enjoys the uniform convergence property. These results will be proved in Section 7.2 by introducing a new learning paradigm, which is called Structural Risk Minimization (SRM). In Section 7.3 we specify the SRM paradigm for countable hypothesis classes, which yields the Min imum Description Length (MDL) paradigm. The MDL paradigm gives a formal justification to a philosophical principle of induction called Occam’s razor. Next, in Section 7.4 we introduce consistency as an even weaker notion of learnabil ity. Finally, we discuss the significance and usefulness of the different notions of learnability. 7.1 NONUNIFORM LEARNABILITY “Nonuniform learnability” allows the sample size to be nonuniform with respect to the different hypotheses with which the learner is competing. We say that a hypoth esis h is (,δ)-competitive with another hypothesis h if, with probability higher than (1 − δ), LD(h) ≤ LD(h )+ . 58 7.1 Nonuniform Learnability 59 In PAC learnability, this notion of “competitiveness” is not very useful, as we are looking for a hypothesis with an absolute low risk (in the realizable case) or with a low risk compared to the minimal risk achieved by hypotheses in our class (in the agnostic case). Therefore, the sample size depends only on the accuracy and confidence parameters. In nonuniform learnability, however, we allow the sample size to be of the form mH(,δ,h); namely, it depends also on the h with which we are competing. Formally, Definition 7.1. A hypothesis class H is nonuniformly learnable if there exist a learning algorithm, A, and a function mNUL H : (0,1)2 × H → N such that, for every ,δ ∈ (0,1) and for every h ∈ H, if m ≥ mNUL H (,δ,h) then for every distribution D, with probability of at least 1 − δ over the choice of S ∼ Dm, it holds that LD(A(S)) ≤ LD(h)+ . At this point it might be useful to recall the definition of agnostic PAC learnabil ity (Definition 3.3): A hypothesis class H is agnostically PAC learnable if there exist a learning algorithm, A, and a function mH : (0,1)2 → N such that, for every ,δ ∈ (0,1) and for every dis tribution D, if m ≥ mH(,δ), then with probability of at least 1 − δ over the choice of S ∼ Dm it holds that h ∈HLD(h )+ . LD(A(S)) ≤ min Note that this implies that for every h ∈ H LD(A(S)) ≤ LD(h)+ . In both types of learnability, we require that the output hypothesis will be (,δ)- competitive with every other hypothesis in the class. But the difference between these two notions of learnability is the question of whether the sample size m may depend on the hypothesis h to which the error of A(S) is compared. Note that that nonuniform learnability is a relaxation of agnostic PAC learnability. That is, if a class is agnostic PAC learnable then it is also nonuniformly learnable. 7.1.1 Characterizing Nonuniform Learnability Our goal now is to characterize nonuniform learnability. In the previous chapter we have found a crisp characterization of PAC learnable classes, by showing that a class of binary classifiers is agnostic PAC learnable if and only if its VC-dimension is finite. In the following theorem we find a different characterization for nonuniform learnable classes for the task of binary classification. Theorem 7.2. A hypothesis class H of binary classifiers is nonuniformly learnable if and only if it is a countable union of agnostic PAC learnable hypothesis classes. The proof of Theorem 7.2 relies on the following result of independent interest: Theorem 7.3. Let H be a hypothesis class that can be written as a countable union of hypothesis classes, H = n∈N Hn, where each Hn enjoys the uniform convergence property. Then, H is nonuniformly learnable. 60 Nonuniform Learnability Recall that in Chapter 4 we have shown that uniform convergence is sufficient for agnostic PAC learnability. Theorem 7.3 generalizes this result to nonuniform learn ability. The proof of this theorem will be given in the next section by introducing a new learning paradigm. We now turn to proving Theorem 7.2. Proof of Theorem 7.2. First assume that H = n∈N Hn where each Hn is agnostic PAC learnable. Using the fundamental theorem of statistical learning, it follows that each Hn has the uniform convergence property. Therefore, using Theorem 7.3 we obtain that H is nonuniform learnable. For the other direction, assume that H is nonuniform learnable using some algorithm A. For every n ∈ N, let Hn = {h ∈ H : mNUL H = ∪n∈NHn. In addition, using the definition of mNUL H (1/8,1/7,h) ≤ n}. Clearly, H we know that for any distribu tion D that satisfies the realizability assumption with respect to Hn, with probability of at least 6/7 over S ∼ Dn we have that LD(A(S)) ≤ 1/8. Using the fundamental theorem of statistical learning, this implies that the VC-dimension of Hn must be finite, and therefore Hn is agnostic PAC learnable. The following example shows that nonuniform learnability is a strict relax ation of agnostic PAC learnability; namely, there are hypothesis classes that are nonuniform learnable but are not agnostic PAC learnable. Example 7.1. Consider a binary classification problem with the instance domain being X = R. For every n ∈ N let Hn be the class of polynomial classifiers of degree n; namely, Hn is the set of all classifiers of the form h(x) = sign(p(x)) where p : R → R is a polynomial of degree n. Let H = n∈N Hn. Therefore, H is the class of all polynomial classifiers over R. It is easy to verify that VCdim(H) = ∞ while VCdim(Hn) = n + 1 (see Exercise 7.12). Hence, H is not PAC learnable, while on the basis of Theorem 7.3, H is nonuniformly learnable. 7.2 STRUCTURAL RISK MINIMIZATION So far, we have encoded our prior knowledge by specifying a hypothesis class H, which we believe includes a good predictor for the learning task at hand. Yet another way to express our prior knowledge is by specifying preferences over hypotheses within H. In the Structural Risk Minimization (SRM) paradigm, we do so by first assuming that H can be written as H = n∈N Hn and then specify ing a weight function, w : N → [0,1], which assigns a weight to each hypothesis class, Hn, such that a higher weight reflects a stronger preference for the hypothesis class. In this section we discuss how to learn with such prior knowledge. In the next section we describe a couple of important weighting schemes, including Minimum Description Length. Concretely, let H be a hypothesis class that can be written as H = n∈N Hn. For example, H may be the class of all polynomial classifiers where each Hn is the class of polynomial classifiers of degree n (see Example 7.1). Assume that for each n, the class Hn enjoys the uniform convergence property (see Definition 4.3 in Chapter 4) with a sample complexity function mUC N ×(0,1) → (0,1) by Hn (,δ). Let us also define the function n : n(m,δ) = min{ ∈ (0,1) : mUC Hn (,δ) ≤ m}. (7.1) 7.2 Structural Risk Minimization 61 In words, we have a fixed sample size m, and we are interested in the lowest possible upper bound on the gap between empirical and true risks achievable by using a sample of m examples. From the definitions of uniform convergence and n, it follows that for every m and δ, with probability of at least 1 − δ over the choice of S ∼ Dm we have that ∀h ∈ Hn, |LD(h)− LS(h)| ≤ n(m,δ). (7.2) Let w : N → [0,1] be a function such that ∞n=1 w(n) ≤ 1. We refer to w as a weight function over the hypothesis classes H1,H2,.... Such a weight function can reflect the importance that the learner attributes to each hypothesis class, or some measure of the complexity of different hypothesis classes. If H is a finite union of N hypothesis classes, one can simply assign the same weight of 1/N to all hypothesis classes. This equal weighting corresponds to no a priori preference to any hypothesis class. Of course, if one believes (as prior knowledge) that a certain hypothesis class is more likely to contain the correct target function, then it should be assigned a larger weight, reflecting this prior knowledge. When H is a (countable) infinite union of hypothesis classes, a uniform weighting is not possible but many other weighting schemes may work. For example, one can choose w(n) = 6 π2n2 or w(n) = 2−n. Later in this chapter we will provide another convenient way to define weighting functions using description languages. The SRM rule follows a “bound minimization” approach. This means that the goal of the paradigm is to find a hypothesis that minimizes a certain upper bound on the true risk. The bound that the SRM rule wishes to minimize is given in the following theorem. Theorem 7.4. Let w : N → [0,1] be a function such that ∞n=1 w(n) ≤ 1. Let H be a hypothesis class that can be written as H = n∈N Hn, where for each n, Hn satisfies the uniform convergence property with a sample complexity function mUC Hn . Let n be as defined in Equation (7.1). Then, for every δ ∈ (0,1) and distribution D, with probability of at least 1 − δ over the choice of S ∼ Dm, the following bound holds (simultaneously) for every n ∈ N and h ∈ Hn. |LD(h)− LS (h)| ≤ n(m,w(n)· δ). Therefore, for every δ ∈ (0,1) and distribution D, with probability of at least 1 − δ it holds that ∀h ∈ H, LD(h) ≤ LS(h)+ min n:h∈Hnn(m,w(n)· δ). (7.3) Proof. For each n define δn = w(n)δ. Applying the assumption that uniform conver gence holds for all n with the rate given in Equation (7.2), we obtain that if we fix n in advance, then with probability of at least 1 − δn over the choice of S ∼ Dm, ∀h ∈ Hn, |LD(h)− LS(h)| ≤ n(m,δn). Applying the union bound over n = 1,2,..., we obtain that with probability of at least 1− n δn = 1−δ n w(n) ≥ 1−δ, the preceding holds for all n, which concludes our proof. Denote n(h) = min{n : h ∈ Hn}, (7.4) 62 Nonuniform Learnability and then Equation (7.3) implies that LD(h) ≤ LS(h)+ n(h)(m,w(n(h))· δ). The SRM paradigm searches for h that minimizes this bound, as formalized in the following pseudocode: Structural Risk Minimization (SRM) prior knowledge: H = n Hn where Hn has uniform convergence with mUC Hn w : N → [0,1] where n w(n) ≤ 1 define: n as in Equation (7.1); n(h) as in Equation (7.4) input: training set S ∼ Dm, confidence δ output: h ∈ argminh∈H LS(h)+ n(h)(m,w(n(h))· δ) Unlike the ERM paradigm discussed in previous chapters, we no longer just care about the empirical risk, LS(h), but we are willing to trade some of our bias toward low empirical risk with a bias toward classes for which n(h)(m,w(n(h))·δ) is smaller, for the sake of a smaller estimation error. Next we show that the SRM paradigm can be used for nonuniform learning of every class, which is a countable union of uniformly converging hypothesis classes. Theorem 7.5. Let H be a hypothesis class such that H = n∈N Hn, where each Hn has the uniform convergence property with sample complexity mUC such that w(n) = 6 Hn . Let w : N → [0,1] be rate n2π2 . Then, H is nonuniformly learnable using the SRM rule with mNUL H (,δ,h) ≤ mUC Hn(h) /2, 6δ (πn(h))2 . Proof. Let A be the SRM algorithm with respect to the weighting function w. For Hn(h)(,w(n(h))δ). Using the fact that n w(n) = 1, every h ∈ H, , and δ, let m ≥ mUC we can apply Theorem 7.4 to get that, with probability of at least 1 − δ over the choice of S ∼ Dm, we have that for every h ∈ H, LD(h ) ≤ LS(h )+ n(h )(m,w(n(h ))δ). The preceding holds in particular for the hypothesis A(S) returned by the SRM rule. By the definition of SRM we obtain that LD(A(S)) ≤ min h Finally, if m ≥ mUC LS(h )+ n(h )(m,w(n(h ))δ) ≤ LS(h)+ n(h)(m,w(n(h))δ). Hn(h)(/2,w(n(h))δ) then clearly n(h)(m,w(n(h))δ) ≤ /2. In addi tion, from the uniform convergence property of each Hn we have that with probability of more than 1 − δ, LS(h) ≤ LD(h)+ /2. Combining all the preceding we obtain that LD(A(S)) ≤ LD(h)+, which concludes our proof. Note that the previous theorem also proves Theorem 7.3.