đ Quay láșĄi trang táșŁi sĂĄch pdf ebook Building Machine Learning Systems with Python Second Edition
Ebooks
NhĂłm Zalo
Building Machine Learning Systems with Python Second Edition
Table of Contents
Building Machine Learning Systems with Python Second Edition Credits
About the Authors
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Getting Started with Python Machine Learning
Machine learning and Python â a dream team
What the book will teach you (and what it will not)
What to do when you are stuck
Getting started
Introduction to NumPy, SciPy, and matplotlib
Installing Python
Chewing data efficiently with NumPy and intelligently with SciPy Learning NumPy
Indexing
Handling nonexisting values
Comparing the runtime
Learning SciPy
Our first (tiny) application of machine learning
Reading in the data
Preprocessing and cleaning the data
Choosing the right model and learning algorithm Before building our first modelâŠ
Starting with a simple straight line
Towards some advanced stuff
Stepping back to go forward â another look at our data Training and testing
Answering our initial question
Summary
2. Classifying with Real-world Examples
The Iris dataset
Visualization is a good first step
Building our first classification model
Evaluation â holding out data and cross-validation Building more complex classifiers
A more complex dataset and a more complex classifier Learning about the Seeds dataset
Features and feature engineering
Nearest neighbor classification
Classifying with scikit-learn
Looking at the decision boundaries
Binary and multiclass classification
Summary
3. Clustering â Finding Related Posts
Measuring the relatedness of posts
How not to do it
How to do it
Preprocessing â similarity measured as a similar number of common words Converting raw text into a bag of words
Counting words
Normalizing word count vectors
Removing less important words
Stemming
Installing and using NLTK
Extending the vectorizer with NLTKâs stemmer
Stop words on steroids
Our achievements and goals
Clustering
K-means
Getting test data to evaluate our ideas on
Clustering posts
Solving our initial challenge
Another look at noise
Tweaking the parameters
Summary
4. Topic Modeling
Latent Dirichlet allocation
Building a topic model
Comparing documents by topics
Modeling the whole of Wikipedia
Choosing the number of topics
Summary
5. Classification â Detecting Poor Answers
Sketching our roadmap
Learning to classify classy answers
Tuning the instance
Tuning the classifier
Fetching the data
Slimming the data down to chewable chunks
Preselection and processing of attributes
Defining what is a good answer
Creating our first classifier
Starting with kNN
Engineering the features
Training the classifier
Measuring the classifierâs performance
Designing more features
Deciding how to improve
Bias-variance and their tradeoff
Fixing high bias
Fixing high variance
High bias or low bias
Using logistic regression
A bit of math with a small example
Applying logistic regression to our post classification problem Looking behind accuracy â precision and recall
Slimming the classifier
Ship it!
Summary
6. Classification II â Sentiment Analysis
Sketching our roadmap
Fetching the Twitter data
Introducing the NaĂŻve Bayes classifier
Getting to know the Bayesâ theorem
Being naĂŻve
Using NaĂŻve Bayes to classify
Accounting for unseen words and other oddities
Accounting for arithmetic underflows
Creating our first classifier and tuning it Solving an easy problem first
Using all classes
Tuning the classifierâs parameters
Cleaning tweets
Taking the word types into account
Determining the word types
Successfully cheating using SentiWordNet Our first estimator
Putting everything together
Summary
7. Regression
Predicting house prices with regression
Multidimensional regression
Cross-validation for regression
Penalized or regularized regression
L1 and L2 penalties
Using Lasso or ElasticNet in scikit-learn Visualizing the Lasso path
P-greater-than-N scenarios
An example based on text documents
Setting hyperparameters in a principled way Summary
8. Recommendations
Rating predictions and recommendations Splitting into training and testing
Normalizing the training data
A neighborhood approach to recommendations A regression approach to recommendations Combining multiple methods
Basket analysis
Obtaining useful predictions
Analyzing supermarket shopping baskets
Association rule mining
More advanced basket analysis
Summary
9. Classification â Music Genre Classification
Sketching our roadmap
Fetching the music data
Converting into a WAV format
Looking at music
Decomposing music into sine wave components
Using FFT to build our first classifier
Increasing experimentation agility
Training the classifier
Using a confusion matrix to measure accuracy in multiclass problems
An alternative way to measure classifier performance using receiver-operator characteristics
Improving classification performance with Mel Frequency Cepstral Coefficients Summary
10. Computer Vision
Introducing image processing
Loading and displaying images
Thresholding
Gaussian blurring
Putting the center in focus
Basic image classification
Computing features from images
Writing your own features
Using features to find similar images
Classifying a harder dataset
Local feature representations
Summary
11. Dimensionality Reduction
Sketching our roadmap
Selecting features
Detecting redundant features using filters
Correlation
Mutual information
Asking the model about the features using wrappers Other feature selection methods
Feature extraction
About principal component analysis
Sketching PCA
Applying PCA
Limitations of PCA and how LDA can help
Multidimensional scaling
Summary
12. Bigger Data
Learning about big data
Using jug to break up your pipeline into tasks An introduction to tasks in jug
Looking under the hood
Using jug for data analysis
Reusing partial results
Using Amazon Web Services
Creating your first virtual machines
Installing Python packages on Amazon Linux Running jug on our cloud machine
Automating the generation of clusters with StarCluster Summary
A. Where to Learn More Machine Learning
Online courses
Books
Question and answer sites Blogs
Data sources
Getting competitive
All that was left out
Summary
Index
Building Machine Learning Systems with Python Second Edition
Building Machine Learning Systems with Python Second Edition
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: July 2013
Second edition: March 2015
Production reference: 1230315
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78439-277-2
www.packtpub.com
Credits
Authors
Luis Pedro Coelho
Willi Richert
Reviewers
Matthieu Brucher
Maurice HT Ling
Radim ĆehĆŻĆek
Commissioning Editor Kartikey Pandey
Acquisition Editors
Greg Wild
Richard Harvey
Kartikey Pandey
Content Development Editor Arun Nadar
Technical Editor
Pankaj Kadam
Copy Editors
Relin Hedly
Sameen Siddiqui
Laxmi Subramanian
Project Coordinator
Nikhil Nair
Proofreaders
Simran Bhogal
Lawrence A. Herman
Linda Morris
Paul Hindle
Indexer
Hemangini Bari
Graphics
Sheetal Aute
Abhinash Sahu
Production Coordinator Arvindkumar Gupta Cover Work
Arvindkumar Gupta
About the Authors
Luis Pedro Coelho is a computational biologist: someone who uses computers as a tool to understand biological systems. In particular, Luis analyzes DNA from microbial communities to characterize their behavior. Luis has also worked extensively in bioimage informaticsâthe application of machine learning techniques for the analysis of images of biological specimens. His main focus is on the processing and integration of large-scale datasets.
Luis has a PhD from Carnegie Mellon University, one of the leading universities in the world in the area of machine learning. He is the author of several scientific publications.
Luis started developing open source software in 1998 as a way to apply real code to what he was learning in his computer science courses at the Technical University of Lisbon. In 2004, he started developing in Python and has contributed to several open source libraries
in this language. He is the lead developer on the popular computer vision package for Python and mahotas, as well as the contributor of several machine learning codes.
Luis currently divides his time between Luxembourg and Heidelberg.
I thank my wife, Rita, for all her love and support and my daughter, Anna, for being the best thing ever.
Willi Richert has a PhD in machine learning/robotics, where he used reinforcement learning, hidden Markov models, and Bayesian networks to let heterogeneous robots learn by imitation. Currently, he works for Microsoft in the Core Relevance Team of Bing, where he is involved in a variety of ML areas such as active learning, statistical machine translation, and growing decision trees.
This book would not have been possible without the support of my wife, Natalie, and my sons, Linus and Moritz. I am especially grateful for the many fruitful discussions with my current or previous managers, Andreas Bode, Clemens Marschner, Hongyan Zhou, and Eric Crestan, as well as my colleagues and friends, Tomasz Marciniak, Cristian Eigel, Oliver Niehoerster, and Philipp Adelt. The interesting ideas are most likely from them; the bugs belong to me.
About the Reviewers
Matthieu Brucher holds an engineering degree from the Ecole SupĂ©rieure dâElectricitĂ© (Information, Signals, Measures), France and has a PhD in unsupervised manifold learning from the UniversitĂ© de Strasbourg, France. He currently holds an HPC software developer position in an oil company and is working on the next generation reservoir simulation.
Maurice HT Ling has been programming in Python since 2003. Having completed his PhD in Bioinformatics and BSc (Hons.) in Molecular and Cell Biology from The University of Melbourne, he is currently a Research Fellow at Nanyang Technological University, Singapore, and an Honorary Fellow at The University of Melbourne, Australia. Maurice is the Chief Editor for Computational and Mathematical Biology, and co-editor for The Python Papers. Recently, Maurice cofounded the first synthetic biology start-up in Singapore, AdvanceSyn Pte. Ltd., as the Director and Chief Technology Officer. His research interests lies in lifeâbiological life, artificial life, and artificial intelligenceâ using computer science and statistics as tools to understand life and its numerous aspects. In his free time, Maurice likes to read, enjoy a cup of coffee, write his personal journal, or philosophize on various aspects of life. His website and LinkedIn profile are http://maurice.vodien.com and http://www.linkedin.com/in/mauriceling, respectively.
Radim ĆehĆŻĆek is a tech geek and developer at heart. He founded and led the research department at Seznam.cz, a major search engine company in central Europe. After finishing his PhD, he decided to move on and spread the machine learning love, starting his own privately owned R&D company, RaRe Consulting Ltd. RaRe specializes in made to-measure data mining solutions, delivering cutting-edge systems for clients ranging from large multinationals to nascent start-ups.
Radim is also the author of a number of popular open source projects, including gensim and smart_open.
A big fan of experiencing different cultures, Radim has lived around the globe with his wife for the past decade, with his next steps leading to South Korea. No matter where he stays, Radim and his team always try to evangelize data-driven solutions and help companies worldwide make the most of their machine learning opportunities.
www.PacktPub.com
Support files, eBooks, discount offers, and more
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packtâs online digital book library. Here, you can search, access, and read Packtâs entire library of books.
Why subscribe?
Fully searchable across every book published by Packt Copy and paste, print, and bookmark content On demand and accessible via a web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.
Preface
One could argue that it is a fortunate coincidence that you are holding this book in your hands (or have it on your eBook reader). After all, there are millions of books printed every year, which are read by millions of readers. And then there is this book read by you. One could also argue that a couple of machine learning algorithms played their role in leading you to this bookâor this book to you. And we, the authors, are happy that you want to understand more about the hows and whys.
Most of the book will cover the how. How has data to be processed so that machine learning algorithms can make the most out of it? How should one choose the right algorithm for a problem at hand?
Occasionally, we will also cover the why. Why is it important to measure correctly? Why does one algorithm outperform another one in a given scenario?
We know that there is much more to learn to be an expert in the field. After all, we only covered some hows and just a tiny fraction of the whys. But in the end, we hope that this mixture will help you to get up and running as quickly as possible.
What this book covers
Chapter 1, Getting Started with Python Machine Learning, introduces the basic idea of machine learning with a very simple example. Despite its simplicity, it will challenge us with the risk of overfitting.
Chapter 2, Classifying with Real-world Examples, uses real data to learn about classification, whereby we train a computer to be able to distinguish different classes of flowers.
Chapter 3, Clustering â Finding Related Posts, teaches how powerful the bag of words approach is, when we apply it to finding similar posts without really âunderstandingâ them.
Chapter 4, Topic Modeling, moves beyond assigning each post to a single cluster and assigns them to several topics as a real text can deal with multiple topics.
Chapter 5, Classification â Detecting Poor Answers, teaches how to use the bias-variance trade-off to debug machine learning models though this chapter is mainly on using a logistic regression to find whether a userâs answer to a question is good or bad.
Chapter 6, Classification II â Sentiment Analysis, explains how NaĂŻve Bayes works, and how to use it to classify tweets to see whether they are positive or negative.
Chapter 7, Regression, explains how to use the classical topic, regression, in handling data, which is still relevant today. You will also learn about advanced regression techniques such as the Lasso and ElasticNets.
Chapter 8, Recommendations, builds recommendation systems based on costumer product ratings. We will also see how to build recommendations just from shopping data without the need for ratings data (which users do not always provide).
Chapter 9, Classification â Music Genre Classification, makes us pretend that someone has scrambled our huge music collection, and our only hope to create order is to let a machine learner classify our songs. It will turn out that it is sometimes better to trust someone elseâs expertise than creating features ourselves.
Chapter 10, Computer Vision, teaches how to apply classification in the specific context of handling images by extracting features from data. We will also see how these methods can be adapted to find similar images in a collection.
Chapter 11, Dimensionality Reduction, teaches us what other methods exist that can help us in downsizing data so that it is chewable by our machine learning algorithms.
Chapter 12, Bigger Data, explores some approaches to deal with larger data by taking advantage of multiple cores or computing clusters. We also have an introduction to using cloud computing (using Amazon Web Services as our cloud provider).
Appendix, Where to Learn More Machine Learning, lists many wonderful resources available to learn more about machine learning.
What you need for this book
This book assumes you know Python and how to install a library using easy_install or pip. We do not rely on any advanced mathematics such as calculus or matrix algebra.
We are using the following versions throughout the book, but you should be fine with any more recent ones:
Python 2.7 (all the code is compatible with version 3.3 and 3.4 as well) NumPy 1.8.1
SciPy 0.13
scikit-learn 0.14.0
Who this book is for
This book is for Python programmers who want to learn how to perform machine learning using open source libraries. We will walk through the basic modes of machine learning based on realistic examples.
This book is also for machine learners who want to start using Python to build their systems. Python is a flexible language for rapid prototyping, while the underlying algorithms are all written in optimized C or C++. Thus the resulting code is fast and robust enough to be used in production as well.
Conventions
In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: âWe then use poly1d() to create a model function from the model parameters.â
A block of code is set as follows:
[aws info]
AWS_ACCESS_KEY_ID = AAKIIT7HHF6IUSN3OCAA
AWS_SECRET_ACCESS_KEY =
Any command-line input or output is written as follows:
>>> import numpy
>>> numpy.version.full_version
1.8.1
New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: âOnce the machine is stopped, the Change instance type option becomes available.â
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this bookâwhat you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.
To send us general feedback, simply send an e-mail to , and mention the book title via the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e mailed directly to you.
The code for this book is also available on GitHub at
https://github.com/luispedro/BuildingMachineLearningSystemsWithPython. This repository is kept up-to-date so that it will incorporate both errata and any necessary updates for newer versions of Python or of the packages we use in the book.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our booksâmaybe a mistake in the text or the codeâwe would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to
https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Another excellent way would be to visit www.TwoToReal.com where the authors try to provide support and answer all your questions.
Piracy
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at with a link to the suspected pirated material.
We appreciate your help in protecting our authors, and our ability to bring you valuable content.
Questions
You can contact us at if you are having a problem with any aspect of the book, and we will do our best to address it.
Chapter 1. Getting Started with Python Machine Learning
Machine learning teaches machines to learn to carry out tasks by themselves. It is that simple. The complexity comes with the details, and that is most likely the reason you are reading this book.
Maybe you have too much data and too little insight. You hope that using machine learning algorithms you can solve this challenge, so you started digging into the algorithms. But after some time you were puzzled: Which of the myriad of algorithms should you actually choose?
Alternatively, maybe you are in general interested in machine learning and for some time you have been reading blogs and articles about it. Everything seemed to be magic and cool, so you started your exploration and fed some toy data into a decision tree or a support vector machine. However, after you successfully applied it to some other data, you wondered: Was the whole setting right? Did you get the optimal results? And how do you know whether there are no better algorithms? Or whether your data was the right one?
Welcome to the club! Both of us (authors) were at those stages looking for information that tells the stories behind the theoretical textbooks about machine learning. It turned out that much of that information was âblack artâ not usually taught in standard text books. So in a sense, we wrote this book to our younger selves. A book that not only gives a quick introduction into machine learning, but also teaches lessons we learned along the way. We hope that it will also give you a smoother entry to one of the most exciting fields in Computer Science.
Machine learning and Python â a dream team
The goal of machine learning is to teach machines (software) to carry out tasks by providing them a couple of examples (how to do or not do the task). Letâs assume that each morning when you turn on your computer, you do the same task of moving e-mails around so that only e-mails belonging to the same topic end up in the same folder. After some time, you might feel bored and think of automating this chore. One way would be to start analyzing your brain and write down all rules your brain processes while you are shuffling your e-mails. However, this will be quite cumbersome and always imperfect. While you will miss some rules, you will over-specify others. A better and more future proof way would be to automate this process by choosing a set of e-mail meta info and body/folder name pairs and let an algorithm come up with the best rule set. The pairs would be your training data, and the resulting rule set (also called model) could then be applied to future e-mails that we have not yet seen. This is machine learning in its simplest form.
Of course, machine learning (often also referred to as Data Mining or Predictive Analysis) is not a brand new field in itself. Quite the contrary, its success over the recent years can be attributed to the pragmatic way of using rock-solid techniques and insights from other successful fields like statistics. There the purpose is for us humans to get insights into the data, for example, by learning more about the underlying patterns and relationships. As you read more and more about successful applications of machine learning (you have checked out www.kaggle.com already, havenât you?), you will see that applied statistics is a common field among machine learning experts.
As you will see later, the process of coming up with a decent ML approach is never a waterfall-like process. Instead, you will see yourself going back and forth in your analysis, trying out different versions of your input data on diverse sets of ML algorithms. It is this explorative nature that lends itself perfectly to Python. Being an interpreted high-level programming language, it seems that Python has been designed exactly for this process of trying out different things. What is more, it does this even fast. Sure, it is slower than C or similar statically typed programming languages. Nevertheless, with the myriad of easy-to use libraries that are often written in C, you donât have to sacrifice speed for agility.
What the book will teach you (and what it will not)
This book will give you a broad overview of what types of learning algorithms are currently most used in the diverse fields of machine learning, and where to watch out when applying them. From our own experience, however, we know that doing the âcoolâ stuff, that is, using and tweaking machine learning algorithms such as support vector machines, nearest neighbor search, or ensembles thereof, will only consume a tiny fraction of the overall time of a good machine learning expert. Looking at the following typical workflow, we see that most of the time will be spent in rather mundane tasks:
Reading in the data and cleaning it
Exploring and understanding the input data
Analyzing how best to present the data to the learning algorithm
Choosing the right model and learning algorithm
Measuring the performance correctly
When talking about exploring and understanding the input data, we will need a bit of statistics and basic math. However, while doing that, you will see that those topics that seemed to be so dry in your math class can actually be really exciting when you use them to look at interesting data.
The journey starts when you read in the data. When you have to answer questions such as how to handle invalid or missing values, you will see that this is more an art than a precise science. And a very rewarding one, as doing this part right will open your data to more machine learning algorithms and thus increase the likelihood of success.
With the data being ready in your programâs data structures, you will want to get a real feeling of what animal you are working with. Do you have enough data to answer your questions? If not, you might want to think about additional ways to get more of it. Do you even have too much data? Then you probably want to think about how best to extract a sample of it.
Often you will not feed the data directly into your machine learning algorithm. Instead you will find that you can refine parts of the data before training. Many times the machine learning algorithm will reward you with increased performance. You will even find that a simple algorithm with refined data generally outperforms a very sophisticated algorithm with raw data. This part of the machine learning workflow is called feature engineering, and is most of the time a very exciting and rewarding challenge. You will immediately see the results of being creative and intelligent.
Choosing the right learning algorithm, then, is not simply a shootout of the three or four that are in your toolbox (there will be more you will see). It is more a thoughtful process of weighing different performance and functional requirements. Do you need a fast result and are willing to sacrifice quality? Or would you rather spend more time to get the best possible result? Do you have a clear idea of the future data or should you be a bit more
conservative on that side?
Finally, measuring the performance is the part where most mistakes are waiting for the aspiring machine learner. There are easy ones, such as testing your approach with the same data on which you have trained. But there are more difficult ones, when you have imbalanced training data. Again, data is the part that determines whether your undertaking will fail or succeed.
We see that only the fourth point is dealing with the fancy algorithms. Nevertheless, we hope that this book will convince you that the other four tasks are not simply chores, but can be equally exciting. Our hope is that by the end of the book, you will have truly fallen in love with data instead of learning algorithms.
To that end, we will not overwhelm you with the theoretical aspects of the diverse ML algorithms, as there are already excellent books in that area (you will find pointers in the Appendix). Instead, we will try to provide an intuition of the underlying approaches in the individual chaptersâjust enough for you to get the idea and be able to undertake your first steps. Hence, this book is by no means the definitive guide to machine learning. It is more of a starter kit. We hope that it ignites your curiosity enough to keep you eager in trying to learn more and more about this interesting field.
In the rest of this chapter, we will set up and get to know the basic Python libraries NumPy and SciPy and then train our first machine learning using scikit-learn. During that endeavor, we will introduce basic ML concepts that will be used throughout the book. The rest of the chapters will then go into more detail through the five steps described earlier, highlighting different aspects of machine learning in Python using diverse application scenarios.
What to do when you are stuck
We try to convey every idea necessary to reproduce the steps throughout this book. Nevertheless, there will be situations where you are stuck. The reasons might range from simple typos over odd combinations of package versions to problems in understanding.
In this situation, there are many different ways to get help. Most likely, your problem will already be raised and solved in the following excellent Q&A sites:
http://metaoptimize.com/qa: This Q&A site is laser-focused on machine learning topics. For almost every question, it contains above average answers from machine learning experts. Even if you donât have any questions, it is a good habit to check it out every now and then and read through some of the answers.
http://stats.stackexchange.com: This Q&A site is named Cross Validated, similar to MetaOptimize, but is focused more on statistical problems.
http://stackoverflow.com: This Q&A site is much like the previous ones, but with broader focus on general programming topics. It contains, for example, more questions on some of the packages that we will use in this book, such as SciPy or matplotlib.
#machinelearning on https://freenode.net/: This is the IRC channel focused on machine learning topics. It is a small but very active and helpful community of machine learning experts.
http://www.TwoToReal.com: This is the instant Q&A site written by the authors to support you in topics that donât fit in any of the preceding buckets. If you post your question, one of the authors will get an instant message if he is online and be drawn in a chat with you.
As stated in the beginning, this book tries to help you get started quickly on your machine learning journey. Therefore, we highly encourage you to build up your own list of machine learning related blogs and check them out regularly. This is the best way to get to know what works and what doesnât.
The only blog we want to highlight right here (more in the Appendix) is http://blog.kaggle.com, the blog of the Kaggle company, which is carrying out machine learning competitions. Typically, they encourage the winners of the competitions to write down how they approached the competition, what strategies did not work, and how they arrived at the winning strategy. Even if you donât read anything else, this is a must.
Getting started
Assuming that you have Python already installed (everything at least as recent as 2.7 should be fine), we need to install NumPy and SciPy for numerical operations, as well as matplotlib for visualization.
Introduction to NumPy, SciPy, and matplotlib
Before we can talk about concrete machine learning algorithms, we have to talk about how best to store the data we will chew through. This is important as the most advanced learning algorithm will not be of any help to us if it will never finish. This may be simply because accessing the data is too slow. Or maybe its representation forces the operating system to swap all day. Add to this that Python is an interpreted language (a highly optimized one, though) that is slow for many numerically heavy algorithms compared to C or FORTRAN. So we might ask why on earth so many scientists and companies are betting their fortune on Python even in highly computation-intensive areas?
The answer is that, in Python, it is very easy to off-load number crunching tasks to the lower layer in the form of C or FORTRAN extensions. And that is exactly what NumPy and SciPy do (http://scipy.org/Download). In this tandem, NumPy provides the support of highly optimized multidimensional arrays, which are the basic data structure of most state of-the-art algorithms. SciPy uses those arrays to provide a set of fast numerical recipes. Finally, matplotlib (http://matplotlib.org/) is probably the most convenient and feature-rich library to plot high-quality graphs using Python.
Installing Python
Luckily, for all major operating systems, that is, Windows, Mac, and Linux, there are targeted installers for NumPy, SciPy, and matplotlib. If you are unsure about the installation process, you might want to install Anaconda Python distribution (which you can access at https://store.continuum.io/cshop/anaconda/), which is driven by Travis Oliphant, a founding contributor of SciPy. What sets Anaconda apart from other distributions such as Enthought Canopy (which you can download from https://www.enthought.com/downloads/) or Python(x,y) (accessible at http://code.google.com/p/pythonxy/wiki/Downloads), is that Anaconda is already fully Python 3 compatibleâthe Python version we will be using throughout the book.
Chewing data efficiently with NumPy and intelligently with SciPy
Letâs walk quickly through some basic NumPy examples and then take a look at what SciPy provides on top of it. On the way, we will get our feet wet with plotting using the marvelous Matplotlib package.
For an in-depth explanation, you might want to take a look at some of the more interesting examples of what NumPy has to offer at http://www.scipy.org/Tentative_NumPy_Tutorial.
You will also find the NumPy Beginnerâs Guide - Second Edition, Ivan Idris, by Packt Publishing, to be very valuable. Additional tutorial style guides can be found at http://scipy-lectures.github.com, and the official SciPy tutorial at
http://docs.scipy.org/doc/scipy/reference/tutorial.
Note
In this book, we will use NumPy in version 1.8.1 and SciPy in version 0.14.0.
Learning NumPy
So letâs import NumPy and play a bit with it. For that, we need to start the Python interactive shell:
>>> import numpy
>>> numpy.version.full_version
1.8.1
As we do not want to pollute our namespace, we certainly should not use the following code:
>>> from numpy import *
Because, for instance, numpy.array will potentially shadow the array package that is included in standard Python. Instead, we will use the following convenient shortcut:
>>> import numpy as np
>>> a = np.array([0,1,2,3,4,5])
>>> a
array([0, 1, 2, 3, 4, 5])
>>> a.ndim
1
>>> a.shape
(6,)
So, we just created an array like we would create a list in Python. However, the NumPy arrays have additional information about the shape. In this case, it is a one-dimensional array of six elements. No surprise so far.
We can now transform this array to a two-dimensional matrix:
>>> b = a.reshape((3,2))
>>> b
array([[0, 1],
[2, 3],
[4, 5]])
>>> b.ndim
2
>>> b.shape
(3, 2)
The funny thing starts when we realize just how much the NumPy package is optimized. For example, doing this avoids copies wherever possible:
>>> b[1][0] = 77
>>> b
array([[ 0, 1],
[77, 3],
[ 4, 5]])
>>> a
array([ 0, 1, 77, 3, 4, 5])
In this case, we have modified value 2 to 77 in b, and immediately see the same change reflected in a as well. Keep in mind that whenever you need a true copy, you can always
perform:
>>> c = a.reshape((3,2)).copy()
>>> c
array([[ 0, 1],
[77, 3],
[ 4, 5]])
>>> c[0][0] = -99
>>> a
array([ 0, 1, 77, 3, 4, 5])
>>> c
array([[-99, 1],
[ 77, 3],
[ 4, 5]])
Note that here, c and a are totally independent copies.
Another big advantage of NumPy arrays is that the operations are propagated to the individual elements. For example, multiplying a NumPy array will result in an array of the same size with all of its elements being multiplied:
>>> d = np.array([1,2,3,4,5])
>>> d*2
array([ 2, 4, 6, 8, 10])
Similarly, for other operations:
>>> d**2
array([ 1, 4, 9, 16, 25])
Contrast that to ordinary Python lists:
>>> [1,2,3,4,5]*2
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
>>> [1,2,3,4,5]**2
Traceback (most recent call last):
File "", line 1, in
TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'
Of course by using NumPy arrays, we sacrifice the agility Python lists offer. Simple operations such as adding or removing are a bit complex for NumPy arrays. Luckily, we have both at our hands and we will use the right one for the task at hand.
Indexing
Part of the power of NumPy comes from the versatile ways in which its arrays can be accessed.
In addition to normal list indexing, it allows you to use arrays themselves as indices by performing:
>>> a[np.array([2,3,4])]
array([77, 3, 4])
Together with the fact that conditions are also propagated to individual elements, we gain a very convenient way to access our data:
>>> a>4
array([False, False, True, False, False, True], dtype=bool) >>> a[a>4]
array([77, 5])
By performing the following command, this can be used to trim outliers:
>>> a[a>4] = 4
>>> a
array([0, 1, 4, 3, 4, 4])
As this is a frequent use case, there is the special clip function for it, clipping the values at both ends of an interval with one function call:
>>> a.clip(0,4)
array([0, 1, 4, 3, 4, 4])
Handling nonexisting values
The power of NumPyâs indexing capabilities comes in handy when preprocessing data that we have just read in from a text file. Most likely, that will contain invalid values that we will mark as not being a real number using numpy.NAN:
>>> c = np.array([1, 2, np.NAN, 3, 4]) # let's pretend we have read this from a text file
>>> c
array([ 1., 2., nan, 3., 4.])
>>> np.isnan(c)
array([False, False, True, False, False], dtype=bool)
>>> c[~np.isnan(c)]
array([ 1., 2., 3., 4.])
>>> np.mean(c[~np.isnan(c)])
2.5
Comparing the runtime
Letâs compare the runtime behavior of NumPy compared with normal Python lists. In the following code, we will calculate the sum of all squared numbers from 1 to 1000 and see how much time it will take. We perform it 10,000 times and report the total time so that our measurement is accurate enough.
import timeit
normal_py_sec = timeit.timeit('sum(x*x for x in range(1000))', number=10000)
naive_np_sec = timeit.timeit(
'sum(na*na)',
setup="import numpy as np; na=np.arange(1000)",
number=10000)
good_np_sec = timeit.timeit(
'na.dot(na)',
setup="import numpy as np; na=np.arange(1000)",
number=10000)
print("Normal Python: %f sec" % normal_py_sec)
print("Naive NumPy: %f sec" % naive_np_sec)
print("Good NumPy: %f sec" % good_np_sec)
Normal Python: 1.050749 sec
Naive NumPy: 3.962259 sec
Good NumPy: 0.040481 sec
We make two interesting observations. Firstly, by just using NumPy as data storage (Naive NumPy) takes 3.5 times longer, which is surprising since we believe it must be much faster as it is written as a C extension. One reason for this is that the access of individual elements from Python itself is rather costly. Only when we are able to apply algorithms inside the optimized extension code is when we get speed improvements. The other observation is quite a tremendous one: using the dot() function of NumPy, which does exactly the same, allows us to be more than 25 times faster. In summary, in every algorithm we are about to implement, we should always look how we can move loops over individual elements from Python to some of the highly optimized NumPy or SciPy extension functions.
However, the speed comes at a price. Using NumPy arrays, we no longer have the incredible flexibility of Python lists, which can hold basically anything. NumPy arrays always have only one data type.
>>> a = np.array([1,2,3])
>>> a.dtype
dtype('int64')
If we try to use elements of different types, such as the ones shown in the following code, NumPy will do its best to coerce them to be the most reasonable common data type:
>>> np.array([1, "stringy"])
array(['1', 'stringy'], dtype='>> np.array([1, "stringy", set([1,2,3])])
array([1, stringy, {1, 2, 3}], dtype=object)
Learning SciPy
On top of the efficient data structures of NumPy, SciPy offers a magnitude of algorithms working on those arrays. Whatever numerical heavy algorithm you take from current books on numerical recipes, most likely you will find support for them in SciPy in one way or the other. Whether it is matrix manipulation, linear algebra, optimization, clustering, spatial operations, or even fast Fourier transformation, the toolbox is readily filled. Therefore, it is a good habit to always inspect the scipy module before you start implementing a numerical algorithm.
For convenience, the complete namespace of NumPy is also accessible via SciPy. So, from now on, we will use NumPyâs machinery via the SciPy namespace. You can check this easily comparing the function references of any base function, such as:
>>> import scipy, numpy
>>> scipy.version.full_version
0.14.0
>>> scipy.dot is numpy.dot
True
The diverse algorithms are grouped into the following toolboxes:
SciPy packages Functionalities
clusterHierarchical clustering (cluster.hierarchy)
Vector quantization / k-means (cluster.vq)
constantsPhysical and mathematical constants
Conversion methods
fftpack Discrete Fourier transform algorithms
integrate Integration routines
interpolate Interpolation (linear, cubic, and so on)
io Data input and output
linalg Linear algebra routines using the optimized BLAS and LAPACK libraries ndimage n-dimensional image package
odr Orthogonal distance regression
optimize Optimization (finding minima and roots)
signal Signal processing
sparse Sparse matrices
spatial Spatial data structures and algorithms
special Special mathematical functions such as Bessel or Jacobian
stats Statistics toolkit
The toolboxes most interesting to our endeavor are scipy.stats, scipy.interpolate, scipy.cluster, and scipy.signal. For the sake of brevity, we will briefly explore some features of the stats package and leave the others to be explained when they show up in the individual chapters.
Our first (tiny) application of machine learning
Letâs get our hands dirty and take a look at our hypothetical web start-up, MLaaS, which sells the service of providing machine learning algorithms via HTTP. With increasing success of our company, the demand for better infrastructure increases to serve all incoming web requests successfully. We donât want to allocate too many resources as that would be too costly. On the other side, we will lose money, if we have not reserved enough resources to serve all incoming requests. Now, the question is, when will we hit the limit of our current infrastructure, which we estimated to be at 100,000 requests per hour. We would like to know in advance when we have to request additional servers in the cloud to serve all the incoming requests successfully without paying for unused ones.
Reading in the data
We have collected the web stats for the last month and aggregated them in ch01/data/web_traffic.tsv (.tsv because it contains tab-separated values). They are stored as the number of hits per hour. Each line contains the hour consecutively and the number of web hits in that hour.
The first few lines look like the following:
Using SciPyâs genfromtxt(), we can easily read in the data using the following code:
>>> import scipy as sp
>>> data = sp.genfromtxt("web_traffic.tsv", delimiter="\t") We have to specify tab as the delimiter so that the columns are correctly determined. A quick check shows that we have correctly read in the data:
>>> print(data[:10])
[[ 1.00000000e+00 2.27200000e+03]
[ 2.00000000e+00 nan]
[ 3.00000000e+00 1.38600000e+03]
[ 4.00000000e+00 1.36500000e+03]
[ 5.00000000e+00 1.48800000e+03]
[ 6.00000000e+00 1.33700000e+03]
[ 7.00000000e+00 1.88300000e+03]
[ 8.00000000e+00 2.28300000e+03]
[ 9.00000000e+00 1.33500000e+03]
[ 1.00000000e+01 1.02500000e+03]]
>>> print(data.shape)
(743, 2)
As you can see, we have 743 data points with two dimensions.
Preprocessing and cleaning the data
It is more convenient for SciPy to separate the dimensions into two vectors, each of size 743. The first vector, x, will contain the hours, and the other, y, will contain the Web hits in that particular hour. This splitting is done using the special index notation of SciPy, by which we can choose the columns individually:
x = data[:,0]
y = data[:,1]
There are many more ways in which data can be selected from a SciPy array. Check out http://www.scipy.org/Tentative_NumPy_Tutorial for more details on indexing, slicing, and iterating.
One caveat is still that we have some values in y that contain invalid values, nan. The question is what we can do with them. Letâs check how many hours contain invalid data, by running the following code:
>>> sp.sum(sp.isnan(y))
8
As you can see, we are missing only 8 out of 743 entries, so we can afford to remove them. Remember that we can index a SciPy array with another array. Sp.isnan(y) returns an array of Booleans indicating whether an entry is a number or not. Using ~, we logically negate that array so that we choose only those elements from x and y where y contains valid numbers:
>>> x = x[~sp.isnan(y)]
>>> y = y[~sp.isnan(y)]
To get the first impression of our data, letâs plot the data in a scatter plot using matplotlib. matplotlib contains the pyplot package, which tries to mimic MATLABâs interface, which is a very convenient and easy to use one as you can see in the following code:
>>> import matplotlib.pyplot as plt
>>> # plot the (x,y) points with dots of size 10
>>> plt.scatter(x, y, s=10)
>>> plt.title("Web traffic over the last month")
>>> plt.xlabel("Time")
>>> plt.ylabel("Hits/hour")
>>> plt.xticks([w*7*24 for w in range(10)],
['week %i' % w for w in range(10)])
>>> plt.autoscale(tight=True)
>>> # draw a slightly opaque, dashed grid
>>> plt.grid(True, linestyle='-', color='0.75')
>>> plt.show()
Note
You can find more tutorials on plotting at http://matplotlib.org/users/pyplot_tutorial.html.
In the resulting chart, we can see that while in the first weeks the traffic stayed more or less the same, the last week shows a steep increase:
Choosing the right model and learning algorithm
Now that we have a first impression of the data, we return to the initial question: How long will our server handle the incoming web traffic? To answer this we have to do the following:
1. Find the real model behind the noisy data points.
2. Following this, use the model to extrapolate into the future to find the point in time where our infrastructure has to be extended.
Before building our first modelâŠ
When we talk about models, you can think of them as simplified theoretical approximations of complex reality. As such there is always some inferiority involved, also called the approximation error. This error will guide us in choosing the right model among the myriad of choices we have. And this error will be calculated as the squared distance of the modelâs prediction to the real data; for example, for a learned model function f, the error is calculated as follows:
def error(f, x, y):
return sp.sum((f(x)-y)**2)
The vectors x and y contain the web stats data that we have extracted earlier. It is the beauty of SciPyâs vectorized functions that we exploit here with f(x). The trained model is assumed to take a vector and return the results again as a vector of the same size so that we can use it to calculate the difference to y.
Starting with a simple straight line
Letâs assume for a second that the underlying model is a straight line. Then the challenge is how to best put that line into the chart so that it results in the smallest approximation error. SciPyâs polyfit() function does exactly that. Given data x and y and the desired order of the polynomial (a straight line has order 1), it finds the model function that minimizes the error function defined earlier:
fp1, residuals, rank, sv, rcond = sp.polyfit(x, y, 1, full=True)
The polyfit() function returns the parameters of the fitted model function, fp1. And by setting full=True, we also get additional background information on the fitting process. Of this, only residuals are of interest, which is exactly the error of the approximation:
>>> print("Model parameters: %s" % fp1)
Model parameters: [ 2.59619213 989.02487106]
>>> print(residuals)
[ 3.17389767e+08]
This means the best straight line fit is the following function
f(x) = 2.59619213 * x + 989.02487106.
We then use poly1d() to create a model function from the model parameters:
>>> f1 = sp.poly1d(fp1)
>>> print(error(f1, x, y))
317389767.34
We have used full=True to retrieve more details on the fitting process. Normally, we would not need it, in which case only the model parameters would be returned.
We can now use f1() to plot our first trained model. In addition to the preceding plotting instructions, we simply add the following code:
fx = sp.linspace(0,x[-1], 1000) # generate X-values for plotting plt.plot(fx, f1(fx), linewidth=4)
plt.legend(["d=%i" % f1.order], loc="upper left")
This will produce the following plot:
It seems like the first 4 weeks are not that far off, although we clearly see that there is something wrong with our initial assumption that the underlying model is a straight line. And then, how good or how bad actually is the error of 317,389,767.34?
The absolute value of the error is seldom of use in isolation. However, when comparing two competing models, we can use their errors to judge which one of them is better. Although our first model clearly is not the one we would use, it serves a very important purpose in the workflow. We will use it as our baseline until we find a better one.
Whatever model we come up with in the future, we will compare it against the current baseline.
Towards some advanced stuff
Letâs now fit a more complex model, a polynomial of degree 2, to see whether it better understands our data:
>>> f2p = sp.polyfit(x, y, 2)
>>> print(f2p)
array([ 1.05322215e-02, -5.26545650e+00, 1.97476082e+03]) >>> f2 = sp.poly1d(f2p)
>>> print(error(f2, x, y))
179983507.878
You will get the following plot:
The error is 179,983,507.878, which is almost half the error of the straight line model. This is good but unfortunately this comes with a price: We now have a more complex function, meaning that we have one parameter more to tune inside polyfit(). The fitted polynomial is as follows:
f(x) = 0.0105322215 * x**2 - 5.26545650 * x + 1974.76082 So, if more complexity gives better results, why not increase the complexity even more?
Letâs try it for degrees 3, 10, and 100.
Interestingly, we do not see d=53 for the polynomial that had been fitted with 100 degrees. Instead, we see lots of warnings on the console:
RankWarning: Polyfit may be poorly conditioned
This means because of numerical errors, polyfit cannot determine a good fit with 100 degrees. Instead, it figured that 53 must be good enough.
It seems like the curves capture and better the fitted data the more complex they get. And also, the errors seem to tell the same story:
Error d=1: 317,389,767.339778
Error d=2: 179,983,507.878179
Error d=3: 139,350,144.031725
Error d=10: 121,942,326.363461
Error d=53: 109,318,004.475556
However, taking a closer look at the fitted curves, we start to wonder whether they also capture the true process that generated that data. Framed differently, do our models correctly represent the underlying mass behavior of customers visiting our website? Looking at the polynomial of degree 10 and 53, we see wildly oscillating behavior. It seems that the models are fitted too much to the data. So much that it is now capturing not
only the underlying process but also the noise. This is called overfitting. At this point, we have the following choices:
Choosing one of the fitted polynomial models.
Switching to another more complex model class. Splines?
Thinking differently about the data and start again.
Out of the five fitted models, the first order model clearly is too simple, and the models of order 10 and 53 are clearly overfitting. Only the second and third order models seem to somehow match the data. However, if we extrapolate them at both borders, we see them going berserk.
Switching to a more complex class seems also not to be the right way to go. What arguments would back which class? At this point, we realize that we probably have not fully understood our data.
Stepping back to go forward â another look at our data
So, we step back and take another look at the data. It seems that there is an inflection point between weeks 3 and 4. So letâs separate the data and train two lines using week 3.5 as a separation point:
inflection = 3.5*7*24 # calculate the inflection point in hours xa = x[:inflection] # data before the inflection point
ya = y[:inflection]
xb = x[inflection:] # data after
yb = y[inflection:]
fa = sp.poly1d(sp.polyfit(xa, ya, 1))
fb = sp.poly1d(sp.polyfit(xb, yb, 1))
fa_error = error(fa, xa, ya)
fb_error = error(fb, xb, yb)
print("Error inflection=%f" % (fa_error + fb_error))
Error inflection=132950348.197616
From the first line, we train with the data up to week 3, and in the second line we train with the remaining data.
Clearly, the combination of these two lines seems to be a much better fit to the data than anything we have modeled before. But still, the combined error is higher than the higher order polynomials. Can we trust the error at the end?
Asked differently, why do we trust the straight line fitted only at the last week of our data more than any of the more complex models? It is because we assume that it will capture future data better. If we plot the models into the future, we see how right we are (d=1 is again our initial straight line).
The models of degree 10 and 53 donât seem to expect a bright future of our start-up. They tried so hard to model the given data correctly that they are clearly useless to extrapolate beyond. This is called overfitting. On the other hand, the lower degree models seem not to be capable of capturing the data good enough. This is called underfitting.
So letâs play fair to models of degree 2 and above and try out how they behave if we fit them only to the data of the last week. After all, we believe that the last week says more about the future than the data prior to it. The result can be seen in the following psychedelic chart, which further shows how badly the problem of overfitting is.
Still, judging from the errors of the models when trained only on the data from week 3.5 and later, we still should choose the most complex one (note that we also calculate the error only on the time after the inflection point):
Error d=1: 22,143,941.107618
Error d=2: 19,768,846.989176
Error d=3: 19,766,452.361027
Error d=10: 18,949,339.348539
Error d=53: 18,300,702.038119
Training and testing
If we only had some data from the future that we could use to measure our models against, then we should be able to judge our model choice only on the resulting approximation error.
Although we cannot look into the future, we can and should simulate a similar effect by holding out a part of our data. Letâs remove, for instance, a certain percentage of the data and train on the remaining one. Then we used the held-out data to calculate the error. As the model has been trained not knowing the held-out data, we should get a more realistic picture of how the model will behave in the future.
The test errors for the models trained only on the time after inflection point now show a
completely different picture:
Error d=1: 6397694.386394
Error d=2: 6010775.401243
Error d=3: 6047678.658525
Error d=10: 7037551.009519
Error d=53: 7052400.001761
Have a look at the following plot:
It seems that we finally have a clear winner: The model with degree 2 has the lowest test error, which is the error when measured using data that the model did not see during training. And this gives us hope that we wonât get bad surprises when future data arrives.
Answering our initial question
Finally we have arrived at a model which we think represents the underlying process best; it is now a simple task of finding out when our infrastructure will reach 100,000 requests per hour. We have to calculate when our model function reaches the value 100,000.
Having a polynomial of degree 2, we could simply compute the inverse of the function and calculate its value at 100,000. Of course, we would like to have an approach that is applicable to any model function easily.
This can be done by subtracting 100,000 from the polynomial, which results in another polynomial, and finding its root. SciPyâs optimize module has the function fsolve that achieves this, when providing an initial starting position with parameter x0. As every entry in our input data file corresponds to one hour, and we have 743 of them, we set the starting position to some value after that. Let fbt2 be the winning polynomial of degree 2.
>>> fbt2 = sp.poly1d(sp.polyfit(xb[train], yb[train], 2)) >>> print("fbt2(x)= \n%s" % fbt2)
fbt2(x)=
2
0.086 x - 94.02 x + 2.744e+04
>>> print("fbt2(x)-100,000= \n%s" % (fbt2-100000))
fbt2(x)-100,000=
2
0.086 x - 94.02 x - 7.256e+04
>>> from scipy.optimize import fsolve
>>> reached_max = fsolve(fbt2-100000, x0=800)/(7*24)
>>> print("100,000 hits/hour expected at week %f" % reached_max[0])
It is expected to have 100,000 hits/hour at week 9.616071, so our model tells us that, given the current user behavior and traction of our start-up, it will take another month until we have reached our capacity threshold.
Of course, there is a certain uncertainty involved with our prediction. To get a real picture of it, one could draw in more sophisticated statistics to find out about the variance we have to expect when looking farther and farther into the future.
And then there are the user and underlying user behavior dynamics that we cannot model accurately. However, at this point, we are fine with the current predictions. After all, we can prepare all time-consuming actions now. If we then monitor our web traffic closely, we will see in time when we have to allocate new resources.