🔙 Quay lại trang tải sách pdf ebook Efficient Learning Machines
 Ebooks
Nhóm Zalo
For your convenience Apress has placed some of the front  matter material after the index. Please use the Bookmarks  and Contents at a Glance links to access them. 
Contents at a Glance 
About the Authors����������������������������������������������������������������������������������������������������xv About the Technical Reviewers �����������������������������������������������������������������������������xvii Acknowledgments��������������������������������������������������������������������������������������������������xix 
■Chapter 1: Machine Learning �������������������������������������������������������������������������������� 1 ■Chapter 2: Machine Learning and Knowledge Discovery ������������������������������������ 19 ■Chapter 3: Support Vector Machines for Classification��������������������������������������� 39 ■Chapter 4: Support Vector Regression ���������������������������������������������������������������� 67 ■Chapter 5: Hidden Markov Model ������������������������������������������������������������������������ 81 ■Chapter 6: Bioinspired Computing: Swarm Intelligence������������������������������������ 105 ■Chapter 7: Deep Neural Networks ��������������������������������������������������������������������� 127 ■Chapter 8: Cortical Algorithms �������������������������������������������������������������������������� 149 ■Chapter 9: Deep Learning ���������������������������������������������������������������������������������� 167 ■Chapter 10: Multiobjective Optimization����������������������������������������������������������� 185 ■Chapter 11: Machine Learning in Action: Examples������������������������������������������ 209 Index��������������������������������������������������������������������������������������������������������������������� 241
v 
Chapter 1 
Machine Learning 
Nature is a self-made machine, more perfectly automated than any automated machine.  To create something in the image of nature is to create a machine, and it was by learning  the inner working of nature that man became a builder of machines. 
—Eric Hoffer, Reflections on the Human Condition 
Machine learning (ML) is a branch of artificial intelligence that systematically applies algorithms to  synthesize the underlying relationships among data and information. For example, ML systems can be  trained on automatic speech recognition systems (such as iPhone’s Siri) to convert acoustic information in a  sequence of speech data into semantic structure expressed in the form of a string of words. 
ML is already finding widespread uses in web search, ad placement, credit scoring, stock market  prediction, gene sequence analysis, behavior analysis, smart coupons, drug development, weather  forecasting, big data analytics, and many more applications. ML will play a decisive role in the development  of a host of user-centric innovations. 
ML owes its burgeoning adoption to its ability to characterize underlying relationships within large  arrays of data in ways that solve problems in big data analytics, behavioral pattern recognition, and  information evolution. ML systems can moreover be trained to categorize the changing conditions of a  process so as to model variations in operating behavior. As bodies of knowledge evolve under the influence  of new ideas and technologies, ML systems can identify disruptions to the existing models and redesign and  retrain themselves to adapt to and coevolve with the new knowledge. 
The computational characteristic of ML is to generalize the training experience (or examples) and  output a hypothesis that estimates the target function. The generalization attribute of ML allows the system  to perform well on unseen data instances by accurately predicting the future data. Unlike other optimization  problems, ML does not have a well-defined function that can be optimized. Instead, training errors serve  as a catalyst to test learning errors. The process of generalization requires classifiers that input discrete or  continuous feature vectors and output a class. 
The goal of ML is to predict future events or scenarios that are unknown to the computer. In 1959,  Arthur Samuel described ML as the “field of study that gives computers the ability to learn without  being explicitly programmed” (Samuel 1959). He concluded that programming computers to learn from  experience should eventually eliminate the need for much of this detailed programming effort. According  to Tom M. Mitchell’s definition of ML: “A computer program is said to learn from experience E with respect  to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P,  improves with experience E.” Alan Turing’s seminal paper (Turing 1950) introduced a benchmark standard  for demonstrating machine intelligence, such that a machine has to be intelligent and responsive in a  manner that cannot be differentiated from that of a human being.
1 
Chapter 1 ■ Machine Learning 
The learning process plays a crucial role in generalizing the problem by acting on its historical experience.  Experience exists in the form of training datasets, which aid in achieving accurate results on new and unseen  tasks. The training datasets encompass an existing problem domain that the learner uses to build a general  model about that domain. This enables the model to generate largely accurate predictions in new cases. 
Key Terminology 
To facilitate the reader’s understanding of the concept of ML, this section defines and discusses some key  multidisciplinary conceptual terms in relation to ML. 
•  classifier. A method that receives a new input as an unlabeled instance of an  observation or feature and identifies a category or class to which it belongs. Many  commonly used classifiers employ statistical inference (probability measure) to  
categorize the best label for a given instance. 
•  confusion matrix (aka error matrix). A matrix that visualizes the performance of  the classification algorithm using the data in the matrix. It compares the predicted  classification against the actual classification in the form of false positive, true  
positive, false negative and true negative information. A confusion matrix for  
a two-class classifier system (Kohavi and Provost, 1998) follows: 
•  accuracy (aka error rate). The rate of correct (or incorrect) predictions made by the  model over a dataset. Accuracy is usually estimated by using an independent test set  that was not used at any time during the learning process. More complex accuracy  estimation techniques, such as cross-validation and bootstrapping, are commonly  used, especially with datasets containing a small number of instances.
2 
Chapter 1 ■ Machine Learning 
 Accuracy (AC) = + 
TP TN 
TP TN FN FP (1-1) 
+ + + 
 Precision (P) = +TP 
TP FP (1-2) 
 Recall R( ) ,true positive rate = +TP 
TP FN (1-3) 
2 
 F M− = easure ( ) + ⋅ ⋅ 
β 
β 
1 P R 
P R , (1-4) 
2 
⋅ + 
where β has a value from 0 to infinity (∞) and is used to control the weight assigned  to P and R. 
•  cost. The measurement of performance (or accuracy) of a model that predicts (or  evaluates) the outcome for an established result; in other words, that quantifies the  deviation between predicted and actual values (or class labels). An optimization  function attempts to minimize the cost function. 
•  cross-validation. A verification technique that evaluates the generalization ability  of a model for an independent dataset. It defines a dataset that is used for testing the  trained model during the training phase for overfitting. Cross-validation can also be used  to evaluate the performance of various prediction functions. In k-fold cross-validation, the training dataset is arbitrarily partitioned into k mutually exclusive subsamples  (or folds) of equal sizes. The model is trained k times (or folds), where each iteration  uses one of the k subsamples for testing (cross-validating), and the remaining k-1  subsamples are applied toward training the model. The k results of cross-validation  are averaged to estimate the accuracy as a single estimation. 
•  data mining. The process of knowledge discovery (q.v.) or pattern detection in a  large dataset. The methods involved in data mining aid in extracting the accurate  data and transforming it to a known structure for further evaluation. 
•  dataset. A collection of data that conform to a schema with no ordering  requirements. In a typical dataset, each column represents a feature and each row  represents a member of the dataset. 
•  dimension. A set of attributes that defines a property. The primary functions of  dimension are filtering, classification, and grouping. 
•  induction algorithm. An algorithm that uses the training dataset to generate a  model that generalizes beyond the training dataset. 
•  instance. An object characterized by feature vectors from which the model is either  trained for generalization or used for prediction. 
•  knowledge discovery. The process of abstracting knowledge from structured or  unstructured sources to serve as the basis for further exploration. Such knowledge is  collectively represented as a schema and can be condensed in the form of a model  or models to which queries can be made for statistical prediction, evaluation, and  further knowledge discovery.
3 
Chapter 1 ■ Machine Learning 
•  model. A structure that summarizes a dataset for description or prediction. Each  model can be tuned to the specific requirements of an application. Applications  in big data have large datasets with many predictors and features that are too  complex for a simple parametric model to extract useful information. The learning  process synthesizes the parameters and the structures of a model from a given  dataset. Models may be generally categorized as either parametric (described by  a finite set of parameters, such that future predictions are independent of the new  dataset) or nonparametric (described by an infinite set of parameters, such that  the data distribution cannot be expressed in terms of a finite set of parameters).  Nonparametric models are simple and flexible, and make fewer assumptions, but  they require larger datasets to derive accurate conclusions. 
•  online analytical processing (OLAP). An approach for resolving multidimensional  analytical queries. Such queries index into the data with two or more attributes (or  dimensions). OLAP encompasses a broad class of business intelligence data and is  usually synonymous with multidimensional OLAP (MOLAP). OLAP engines facilitate  the exploration of multidimensional data interactively from several perspectives,  thereby allowing for complex analytical and ad hoc queries with a rapid execution  time. OLAP commonly uses intermediate data structures to store precalculated  results on multidimensional data, allowing fast computation. Relational OLAP (ROLAP) uses relational databases of the base data and the dimension tables. 
•  schema. A high-level specification of a dataset’s attributes and properties. 
•  supervised learning. Learning techniques that extract associations between  independent attributes and a designated dependent attribute (the label). Supervised  learning uses a training dataset to develop a prediction model by consuming  input data and output values. The model can then make predictions of the output  values for a new dataset. The performance of models developed using supervised  learning depends upon the size and variance of the training dataset to achieve  better generalization and greater predictive power for new datasets. Most induction  algorithms fall into the supervised learning category. 
•  unsupervised learning. Learning techniques that group instances without a  prespecified dependent attribute. This technique generally involves learning  structured patterns in the data by rejecting pure unstructured noise. Clustering and  dimensionality reduction algorithms are usually unsupervised. 
•  feature vector. An n-dimensional numerical vector of explanatory variables  representing an instance of some object that facilitates processing and statistical  analysis. Feature vectors are often weighted to construct a predictor function that  is used to evaluate the quality or fitness of the prediction. The dimensionality of a  feature vector can be reduced by various dimensionality reduction techniques, such  as principal component analysis (PCA), multilinear subspace reduction, isomaps,  and latent semantic analysis (LSA). The vector space associated with these vectors is  often called the feature space.
4 
Chapter 1 ■ Machine Learning 
Developing a Learning Machine 
Machine learning aids in the development of programs that improve their performance for a given task  through experience and training. Many big data applications leverage ML to operate at highest efficiency. The  sheer volume, diversity, and speed of data flow have made it impracticable to exploit the natural capability  of human beings to analyze data in real time. The surge in social networking and the wide use of Internet based applications have resulted not only in greater volume of data, but also increased complexity of data. To  preserve data resolution and avoid data loss, these streams of data need to be analyzed in real time. 
The heterogeneity of the big data stream and the massive computing power we possess today present  us with abundant opportunities to foster learning methodologies that can identify best practices for a given  business problem. The sophistication of modern computing machines can handle large data volumes,  greater complexity, and terabytes of storage. Additionally, intelligent program-flows that run on these  machines can process and combine many such complex data streams to develop predictive models and  extract intrinsic patterns in otherwise noisy data. When you need to predict or forecast a target value,  supervised learning is the appropriate choice. The next step is to decide, depending on the target value,  between clustering (in the case of discrete target value) and regression (in the case of numerical target value). 
You start the development of ML by identifying all the metrics that are critical to a decision process. The  processes of ML synthesize models for optimizing the metrics. Because the metrics are essential to developing  the solution for a given decision process, they must be selected carefully during conceptual stages. 
It is also important to judge whether ML is the suitable approach for solving a given problem. By its  nature, ML cannot deliver perfect accuracy. For solutions requiring highly accurate results in a bounded  time period, ML may not be the preferred approach. In general, the following conditions are favorable to  the application of ML: (a) very high accuracy is not desired; (b) large volumes of data contain undiscovered  patterns or information to be synthesized; (c) the problem itself is not very well understood owing to lack  of knowledge or historical information as a basis for developing suitable algorithms; and (d) the problem  needs to adapt to changing environmental conditions. 
The process of developing ML algorithms may be decomposed into the following steps: 
1. Collect the data. Select the subset of all available data attributes that might be  useful in solving the problem. Selecting all the available data may be unnecessary  
or counterproductive. Depending upon the problem, data can either be retrieved  
through a data-stream API (such as a CPU performance counters) or synthesized  
by combining multiple data streams. In some cases, the input data streams,  
whether raw or synthetic, may be statistically preprocessed to improve usage or  
reduce bandwidth. 
2. Preprocess the Data. Present the data in a manner that is understood by the  
consumer of the data. Preprocessing consists of the following three steps: 
i. Formatting. The data needs to be presented in a useable format. Using  
an industry-standard format enable plugging the solution with multiple  
vendors that in turn can mix and match algorithms and data sources such as  
XML, HTML, and SOAP. 
ii. Cleaning. The data needs to be cleaned by removing, substituting, or fixing  
corrupt or missing data. In some cases, data needs to be normalized,  
discretized, averaged, smoothened, or differentiated for efficient usage. In  
other cases, data may need to be transmitted as integers, double precisions,  
or strings. 
iii. Sampling. Data need to be sampled at regular or adaptive intervals in a  
manner such that redundancy is minimized without the loss of information  
for transmission via communication channels.
5 
Chapter 1 ■ Machine Learning 
3. Transform the data. Transform the data specific to the algorithm and the  
knowledge of the problem. Transformation can be in the form of feature scaling,  decomposition, or aggregation. Features can be decomposed to extract the useful  components embedded in the data or aggregated to combine multiple instances  into a single feature. 
4. Train the algorithm. Select the training and testing datasets from the transformed  data. An algorithm is trained on the training dataset and evaluated against the  
test set. The transformed training dataset is fed to the algorithm for extraction of  knowledge or information. This trained knowledge or information is stored as a  model to be used for cross-validation and actual usage. Unsupervised learning,  having no target value, does not require the training step. 
5. Test the algorithm. Evaluate the algorithm to test its effectiveness and performance.  This step enables quick determination whether any learnable structures can be  
identified in the data. A trained model exposed to test dataset is measured against  predictions made on that test dataset which are indicative of the performance  
of the model. If the performance of the model needs improvement, repeat the  
previous steps by changing the data streams, sampling rates, transformations,  
linearizing models, outliers’ removal methodology, and biasing schemes. 
6. Apply reinforcement learning. Most control theoretic applications require a good  feedback mechanism for stable operations. In many cases, the feedback data  
are sparse, delayed, or unspecific. In such cases, supervised learning may not be  practical and may be substituted with reinforcement learning (RL). In contrast to  supervised learning, RL employs dynamic performance rebalancing to learn from  the consequences of interactions with the environment, without explicit training. 
7. Execute. Apply the validated model to perform an actual task of prediction. If new  data are encountered, the model is retrained by applying the previous steps. The  process of training may coexist with the real task of predicting future behavior. 
Machine Learning Algorithms 
Based on underlying mappings between input data and anticipated output presented during the learning  phase of ML, ML algorithms may be classified into the following six categories: 
•  Supervised learning is a learning mechanism that infers the underlying relationship  between the observed data (also called input data) and a target variable  
(a dependent variable or label) that is subject to prediction (Figure 1-1). The learning  task uses the labeled training data (training examples) to synthesize the model  
function that attempts to generalize the underlying relationship between the feature  vectors (input) and the supervisory signals (output). The feature vectors influence  the direction and magnitude of change in order to improve the overall performance  of the function model. The training data comprise observed input (feature) vectors  and a desired output value (also called the supervisory signal or class label).  
A well-trained function model based on a supervised learning algorithm can  
accurately predict the class labels for hidden phenomena embedded in unfamiliar  or unobserved data instances. The goal of learning algorithms is to minimize the  error for a given set of inputs (the training set). However, for a poor-quality training  set that is influenced by the accuracy and versatility of the labeled examples, the  model may encounter the problem of overfitting, which typically represents poor  generalization and erroneous classification.
6 
Chapter 1 ■ Machine Learning 
Figure 1-1. High-level flow of supervised learning
•  Unsupervised learning algorithms are designed to discover hidden structures in  unlabeled datasets, in which the desired output is unknown. This mechanism has  found many uses in the areas of data compression, outlier detection, classification,  human learning, and so on. The general approach to learning involves training  through probabilistic data models. Two popular examples of unsupervised learning  are clustering and dimensionality reduction. In general, an unsupervised learning  dataset is composed of inputs x x x x 1 2 3 n , ,  , but it contains neither target outputs  (as in supervised learning) nor rewards from its environment. The goal of ML in this  case is to hypothesize representations of the input data for efficient decision making,  forecasting, and information filtering and clustering. For example, unsupervised  training can aid in the development of phase-based models in which each phase,  synthesized through an unsupervised learning process, represents a unique  condition for opportunistic tuning of the process. Furthermore, each phase can  act as a state and can be subjected to forecasting for proactive resource allocation  or distribution. Unsupervised learning algorithms centered on a probabilistic  distribution model generally use maximum likelihood estimation (MLE), maximum  a posteriori (MAP), or Bayes methods. Other algorithms that are not based on  probability distribution models may employ statistical measurements, quantization  error, variance preserving, entropy gaps, and so on. 
7 
Chapter 1 ■ Machine Learning 
•  Semi-supervised learning uses a combination of a small number of labeled and  a large number of unlabeled datasets to generate a model function or classifier.  Because the labeling process of acquired data requires intensive skilled human labor  inputs, it is expensive and impracticable. In contrast, unlabeled data are relatively  inexpensive and readily available. Semi-supervised ML methodology operates  somewhere between the guidelines of unsupervised learning (unlabeled training  data) and supervised learning (labeled training data) and can produce considerable  improvement in learning accuracy. Semi-supervised learning has recently gained  greater prominence, owing to the availability of large quantities of unlabeled data for  diverse applications to web data, messaging data, stock data, retail data, biological  data, images, and so on. This learning methodology can deliver value of practical  and theoretical significance, especially in areas related to human learning, such as  speech, vision, and handwriting, which involve a small amount of direct instruction  and a large amount of unlabeled experience. 
•  Reinforcement learning (RL) methodology involves exploration of an adaptive  sequence of actions or behaviors by an intelligent agent (RL-agent) in a given  environment with a motivation to maximize the cumulative reward (Figure 1-2).  The intelligent agent’s action triggers an observable change in the state of the  environment. The learning technique synthesizes an adaptation model by training  itself for a given set of experimental actions and observed responses to the state of  the environment. In general, this methodology can be viewed as a control-theoretic  trial-and-error learning paradigm with rewards and punishments associated with  a sequence of actions. The RL-agent changes its policy based on the collective  experience and consequent rewards. RL seeks past actions it explored that resulted  in rewards. To build an exhaustive database or model of all the possible action reward projections, many unproven actions need to be tried. These untested  actions may have to be attempted multiple times before ascertaining their strength.  Therefore, you have to strike a balance between exploration of new possible actions  and likelihood of failure resulting from those actions. Critical elements of RL include  the following: 
•  The policy is a key component of an RL-agent that maps the control-actions to  the perceived state of the environment. 
•  The critic represents an estimated value function that criticizes the actions that  are made according to existing policy. Alternatively, the critic evaluates the  performance of the current state in response to an action taken according to  current policy. The critic-agent shapes the policy by making continuous and  ongoing corrections. 
•  The reward function estimates the instantaneous desirability of the perceived  state of the environment for an attempted control-action. 
•  Models are planning tools that aid in predicting the future course of action by  contemplating possible future situations.
8 
Chapter 1 ■ Machine Learning 
Figure 1-2. High-level flow of reinforcement learning
•  Transductive learning (aka transductive inference) attempts to predict exclusive  model functions on specific test cases by using additional observations on the  training dataset in relation to the new cases (Vapnik 1998). A local model is  established by fitting new individual observations (the training data) into a single  point in space—this, in contrast to the global model, in which new data have to  fit into the existing model without postulating any specific information related  to the location of that data point in space. Although the new data may fit into the  global model to a certain extent (with some error), thereby creating a global model  that would represent the entire problem, space is a challenge and may not be  necessary in all cases. In general, if you experience discontinuities during the model  development for a given problem space, you can synthesize multiple models at  the discontinuous boundaries. In this case, newly observed data are the processed  through the model that fulfill the boundary conditions in which the model is valid. 
•  Inductive inference estimates the model function based on the relation of data  to the entire hypothesis space, and uses this model to forecast output values for  examples beyond the training set. These functions can be defined using one of the  many representation schemes, including linear weighted polynomials, logical rules,  and probabilistic descriptions, such as Bayesian networks. Many statistical learning  methods start with initial solutions for the hypothesis space and then evolve them  iteratively to reduce error. Many popular algorithms fall into this category, including  SVMs (Vapnik 1998), neural network (NN) models (Carpenter and Grossberg 1991),  and neuro-fuzzy algorithms (Jang 1993). In certain cases, one may apply a lazy learning model, in which the generalization process can be an ongoing task that effectively  develops a richer hypothesis space, based on new data applied to the existing model. 
9 
Chapter 1 ■ Machine Learning 
Popular Machine Learning Algorithms 
This section describes in turn the top 10 most influential data mining algorithms identified by the IEEE  International Conference on Data Mining (ICDM) in December 2006: C4.5, k-means, SVMs, Apriori,  estimation maximization (EM), PageRank, AdaBoost, k–nearest neighbors (k-NN), naive Bayes, and  classification and regression trees (CARTs) (Wu et al. 2008). 
C4.5 
C4.5 classifiers are one of the most frequently used categories of algorithms in data mining. A C4.5 classifier  inputs a collection of cases wherein each case is a sample preclassified to one of the existing classes. Each  case is described by its n-dimensional vector, representing attributes or features of the sample. The output  of a C4.5 classifier can accurately predict the class of a previously unseen case. C4.5 classification algorithms  generate classifiers that are expressed as decision trees by synthesizing a model based on a tree structure.  Each node in the tree structure characterizes a feature, with corresponding branches representing possible  values connecting features and leaves representing the class that terminates a series of nodes and branches.  The class of an instance can be determined by tracing the path of nodes and branches to the terminating leaf. Given a set S of instances, C4.5 uses a divide-and-conquer method to grow an initial tree, as follows: 
•  If all the samples in the list S belong to the same class, or if the list S is small, then  create a leaf node for the decision tree and label it with the most frequent class. 
•  Otherwise, the algorithm selects an attribute-based test that branches S into multiple  subbranches (partitions) (S1, S2, …), each representing the outcome of the test.  
The tests are placed at the root of the tree, and each path from the root to the leaf  becomes a rule script that labels a class at the leaf. This procedure applies to each  subbranch recursively. 
•  Each partition of the current branch represents a child node, and the test separating  S represents the branch of the tree. 
This process continues until every leaf contains instances from only one class or further partition is  not possible. C4.5 uses tests that select attributes with the highest normalized information gain, enabling  disambiguation of the classification of cases that may belong to two or more classes. 
k-Means 
The k-means algorithm is a simple iterative clustering algorithm (Lloyd 1957) that partitions N data points  into K disjoint subsets Sj so as to minimize the sum-of-squares criterion. Because the sum of squares is the  squared Euclidean distance, this is intuitively the “nearest” mean, 
K 
2 μ (1-5) 
 J = − = ∈ ∑∑| | x , n 
j 
1 j 
j n S 
where 
xn= vector representing the nth data point μj = geometric centroid of the data points in Sj
10 
Chapter 1 ■ Machine Learning 
The algorithm consists of a simple two-step re-estimation process: 
1. Assignment: Data points are assigned to the cluster whose centroid is closest to  that point. 
2. Update: Each cluster centroid is recalculated to the center (mean) of all data  
points assigned to it. 
These two steps are alternated until a stopping criterion is met, such that there is no further change in  the assignment of data points. Every iteration requires N × K comparisons, representing the time complexity  of one iteration. 
Support Vector Machines 
Support vector machines (SVMs) are supervised learning methods that analyze data and recognize patterns.  SVMs are primarily used for classification, regression analysis, and novelty detection. Given a set of  training data in a two-class learning task, an SVM training algorithm constructs a model or classification  function that assigns new observations to one of the two classes on either side of a hyperplane, making it  a nonprobabilistic binary linear classifier (Figure 1-3). An SVM model maps the observations as points in  space, such that they are classified into a separate partition that is divided by the largest distance to the  nearest observation data point of any class (the functional margin). New observations are then predicted to  belong to a class based on which side of the partition they fall. Support vectors are the data points nearest to  the hyperplane that divides the classes. Further details of support vector machines are given in Chapter 4. 
Figure 1-3. The SVM algorithm finds the hyperplane that maximizes the largest minimum distance between  the support vectors
11 
Chapter 1 ■ Machine Learning 
Apriori 
Apriori is a data mining approach that discovers frequent itemsets by using candidate generation (Agrawal  and Srikant 1994) from a transactional database and highlighting association rules (general trends) in the  database. It assumes that any subset of a frequently occurring pattern must be frequent. Apriori performs  breadth-first search to scan frequent 1-itemsets (that is, itemsets of size 1) by accumulating the count for  each item that satisfies the minimum support requirement. The set of frequent 1-itemsets is used to find the  set of frequent 2-itemsets, and so on. This process iterates until no more frequent k-itemsets can be found.  The Apriori method that identifies all the frequent itemsets can be summarized in the following three steps: 
1. Generate candidates for frequent k + 1-itemsets (of size k + 1) from the frequent  k-itemsets (of size k). 
2. Scan the database to identify candidates for frequent k + 1-itemsets, and  
calculate the support of each of those candidates. 
3. Add those itemsets that satisfy the minimum support requirement to frequent  itemsets of size k + 1. 
Thanks in part to the simplicity of the algorithm, it is widely used in data mining applications. Various  improvements have been proposed, notably, the frequent pattern growth (FP-growth) extension, which  eliminates candidate generation. Han et al. (Han, Pei, and Yin 2000) propose a frequent pattern tree  (FP-tree) structure, which stores and compresses essential information to interpret frequent patterns and  uses FP-growth for mining the comprehensive set of frequent patterns by pattern fragment growth. This  Apriori technique enhancement constructs a large database that contains all the essential information and  compresses it into a highly condensed data structure. In the subsequent step, it assembles a conditional pattern base which represents a set of counted patterns that co-occur relative to each item. Starting at the  frequent header table, it traverses the FP-tree by following each frequent item and stores the prefix paths of  those items to produce a conditional pattern base. Finally, it constructs a conditional FP-tree for each of the  frequent items of the conditional pattern base. Each node in the tree represents an item and its count. Nodes  sharing the same label but residing on different subtrees are conjoined by a node–link pointer. The position  of a node in the tree structure represents the order of the frequency of an item, such that a node closer to the  root may be shared by more transactions in a transactional database. 
Estimation Maximization 
The estimation–maximization (EM) algorithm facilitates parameter estimation in probabilistic models  with incomplete data. EM is an iterative scheme that estimates the MLE or MAP of parameters in statistical  models, in the presence of hidden or latent variables. The EM algorithm iteratively alternates between the  steps of performing an expectation (E), which creates a function that estimates the probability distribution  over possible completions of the missing (unobserved) data, using the current estimate for the parameters,  and performing a maximization (M), which re-estimates the parameters, using the current completions  performed during the E step. These parameter estimates are iteratively employed to estimate the distribution  of the hidden variables in the subsequent E step. In general, EM involves running an iterative algorithm  with the following attributes: (a) observed data, X; (b) latent (or missing) data, Z; (c) unknown parameter, θ;  and (d) a likelihood function, L(θ; X, Z) = P(X, Z|θ). The EM algorithm iteratively calculates the MLE of the  marginal likelihood using a two-step method: 
1. Estimation (E): Calculate the expected value of the log likelihood function, with  respect to the conditional distribution of Z, given X under the current estimate of  
the parameters θ(t), such that 
 Q t ( | ( )) l E L Z X t og ( ;X Z, ) . θ θ = | ,θ ( ) [ ] θ (1-6)
12 
Chapter 1 ■ Machine Learning 
2. Maximization (M): Find the parameter that maximizes this quantity: 
 θ θ ( ) t Q + = 1 arg mθ ax ( |θ(t)). (1-7) 
PageRank 
PageRank is a link analysis search algorithm that ranks the elements of hyperlinked documents on the World  Wide Web for the purpose of measuring their importance, relative to other links. Developed by Larry Page  and Sergey Bin, PageRank produces static rankings that are independent of the search queries. PageRank  simulates the concept of prestige in a social network. A hyperlink to a page counts as a vote of support.  Additionally, PageRank interprets a hyperlink from source page to target page in such a manner that the  page with the higher rank improves the rank of the linked page (the source or target). Therefore, backlinks  from highly ranked pages are more significant than those from average pages. Mathematically simple,  PageRank can be calculated as 
 r P r Q 
Q Bp Q ( ) ( ) 
| | = , ∈∑ (1-8) 
where 
r(P) = rank of the page P 
Bp= the set of all pages linking to page P 
|Q| = number of links from page Q 
r(Q) = rank of the page Q 
AdaBoost (Adaptive Boosting) 
AdaBoost is an ensemble method used for constructing strong classifiers as linear combinations of simple,  weak classifiers (or rules of thumb) (Freund and Schapire 1997). As in any ensemble method, AdaBoost  employs multiple learners to solve a problem with better generalization ability and more accurate  prediction. The strong classifier can be evaluated as a linear combination of weak classifiers, such that T 
∑β 
H x h x t t 
( ) = ⋅ ( ), 
= 
1 
t 
where 
H(x) = strong classifier 
ht(x) = weak classifier (feature) 
The Adaboost algorithm may be summarized as follows: 
Input: 
Data-Set I x y x y x y x y ={ } ( ) 1 1 ( ) 2 2 ( ) 3 3 ( ) m m , , , , , ,, , , Base learning algorithm L 
Number of learning rounds T 
Process: 
i 1 D m 
= // Initialize weight distribution 1 
FOR (t = 1 to T) DO // Run the loop for t = T iterations ht = L(I, Dt) // Train a weak learner ht from I using Dt ∈ =t ∑ − tit i i 
D | ( h x ) | y // calculate the error of hti 
13 
Chapter 1 ■ Machine Learning 
⎝⎜ ⎞⎠⎟ 121 ln // calculate the weight of ht 
⎛ 
= −∈ 
βtt 
∈ 
t 
i ti 
D DZ e t i t i y h x 
t 
− ⋅ ⋅ = ⋅ 1( ( β )) // Update the distribution, 
+ 
t 
 // Zt is the normalization factor END 
Output: 
T 
( ) = ( ) ⎛⎝⎜ ⎞⎠⎟ = 
∑β 
H x sign h x t t 
// Strong classifier 
t 
1 
The AdaBoost algorithm is adaptive, inasmuch as it uses multiple iterations to produce a strong learner  that is well correlated with the true classifier. As shown above, it iterates by adding weak learners that are  slightly correlated with the true classifier. As part of the adaptation process, the weighting vector adjusts  itself to improve upon misclassification in previous rounds. The resulting classifier has a greater accuracy  than the weak learners’ classifiers. AdaBoost is fast, simple to implement, and flexible insofar as it can be  combined with any classifier. 
k-Nearest Neighbors 
The k-nearest neighbors (k-NN) classification methodology identifies a group of k objects in the training  set that are closest to the test object and assigns a label based on the most dominant class in this  neighborhood. The three fundamental elements of this approach are 
•  an existing set of labeled objects 
•  a distance metric to estimate distance between objects 
•  the number of nearest neighbors (k) 
To classify an unlabeled object, the distances between it and labeled objects are calculated and its  k-nearest neighbors are identified. The class labels of these nearest neighbors serve as a reference for  classifying the unlabeled object. The k-NN algorithm computes the similarity distance between a training  set, (x, y) ∈ I, and the test object, x z ˆ = (x y ˆ, ˆ), to determine its nearest-neighbor list, Iz. x represents the training  object, and y represents the corresponding training class. xˆ and yˆ represent the test object and its class,  respectively. The algorithm may be summarized as follows: 
Input: 
Training object (x, y) ∈ I and test object x z ˆ = (x y ˆ, ˆ) 
Process: 
Compute distance x dˆ = (x x ˆ, ) between z and every object (x, y) ∈ I. 
Select I I z ⊆ , the set of k closest training objects to z. 
Output (Majority Class): 
∑ 
ˆ arg ( ) 
y max F v y v i 
= = 
( , ) 
∈ 
x y I i i Z 
F(.) = 1 if argument (.) is TRUE and 0 otherwise, v is the class label. 
The value of k should be chosen carefully. A smaller value can result in noisy behavior, whereas a larger  value may include too many points from other classes.
14 
Chapter 1 ■ Machine Learning 
Naive Bayes 
Naive Bayes is a simple probabilistic classifier that applies Bayes’ theorem with strong (naive) assumption  of independence, such that the presence of an individual feature of a class is unrelated to the presence of  another feature. 
Assume that input features x x x 1 2 n ,  are conditionally independent of each other, given the class  label Y, such that 
n 
=1 Π (1-9) 
 P x x xn Y P x Y 
i ( , | ) ( | ) 1 2 = 
i 
For a two-class classification (i = 0,1), we define P(i|x) as the probability that measurement vector  x x x x = n { , } 1 2 belongs to class i. Moreover, we define a classification score 
n 
Π 
f x P ( | ) ( ) 
 P x 
j 
1 1 
n ( | ) ΠΠ f x 
( | ) 
1 
= = = 
P 
( ) 1 
1 
j 
n 
1 
f x 
j 
(1-10) 
P x ( | ) 0 
f x P ( | ) ( ) 
P 
( ) 0 
j 
= 
1 
( | ) 0 
j 
j 
= 
1 
j 
0 0 
n j 
( )ln ( | ) 
 ln ( | ) 
f x 
1 
( | ), P x ( | )ln ( ) 
1 
P 
1 
∑ (1-11) 
P x 
= + 
0 
P 
0 
j f x 
= 
1 0 j 
where P(i|x) is proportional to f(x|i)P(i) and f(x|i) is the conditional distribution of x for class i objects. The naive Bayes model is surprisingly effective and immensely appealing, owing to its simplicity and  robustness. Because this algorithm does not require application of complex iterative parameter estimation  schemes to large datasets, it is very useful and relatively easy to construct and use. It is a popular algorithm  in areas related to text classification and spam filtering. 
Classification and Regression Trees 
A classification and regression tree (CART) is a nonparametric decision tree that uses a binary recursive  partitioning scheme by splitting two child nodes repeatedly, starting with the root node, which contains the  complete learning sample (Breiman et al. 1984). The tree-growing process involves splitting among all the  possible splits at each node, such that the resulting child nodes are the “purest.” Once a CART has generated  a “maximal tree,” it examines the smaller trees obtained by pruning away the branches of the maximal  tree to determine which contribute least to the overall performance of the tree on training data. The CART  mechanism is intended to yield a sequence of nested pruned trees. The right-sized, or “honest,” tree is  identified by evaluating the predictive performance of every tree in the pruning sequence. 
Challenging Problems in Data Mining Research 
Data mining and knowledge discovery have become fields of interdisciplinary research in the areas related  to database systems, ML, intelligent information systems, expert systems, control theory, and many others.  Data mining is an important and active area of research but not one without theoretical and practical  challenges from working with very large databases that may be noisy, incomplete, redundant, and dynamic  in nature. A study by Yang and Wu (2006) reviews the most challenging problems in data mining research, as  summarized in the following sections.
15 
Chapter 1 ■ Machine Learning 
Scaling Up for High-Dimensional Data and High-Speed Data Streams 
Designing classifiers that can handle very high-dimensional features extracted through high-speed data  streams is challenging. To ensure a decisive advantage, data mining in such cases should be a continuous  and online process. But, technical challenges prevent us from computing models over large amounts  streaming data in the presence of environment drift and concept drift. Today, we try to solve this problem  with incremental mining and offline model updating to maintain accurate modeling of the current data  stream. Information technology challenges are being addressed by developing in-memory databases,  high-density memories, and large storage capacities, all supported by high-performance computing  infrastructure. 
Mining Sequence Data and Time Series Data 
Efficient classification, clustering, and forecasting of sequenced and time series data remain an open  challenge today. Time series data are often contaminated by noise, which can have a detrimental effect on  short-term and long-term prediction. Although noise may be filtered, using signal-processing techniques or  smoothening methods, lags in the filtered data may result. In a closed-loop environment, this can reduce the  accuracy of prediction, because we may end up overcompensating or underprovisioning the process itself.  In certain cases, lags can be corrected by differential predictors, but these may require a great deal of tuning  the model itself. Noise-canceling filters placed close to the data I/O block can be tuned to identify and clean  the noisy data before they are mined. 
Mining Complex Knowledge from Complex Data 
Complex data can exist in many forms and may require special techniques to extract the information  useful for making real-world decisions. For example, information may exist in a graphical form, requiring  methods for discovering graphs and structured patterns in large data. Another complexity may exist in  the form of non—independent-and-identically-distributed (non-iid) data objects that cannot be mined as  an independent single object. They may share relational structures with other data objects that should  be identified. 
State-of-the-art data mining methods for unstructured data lack the ability to incorporate domain  information and knowledge interface for the purpose of relating the results of data mining to real-world  scenarios. 
Distributed Data Mining and Mining Multi-Agent Data 
In a distributed data sensing environment, it can be challenging to discover distributed patterns and  correlate the data streamed through different probes. The goal is to minimize the amount of data exchange  and reduce the required communication bandwidth. Game-theoretic methodologies may be deployed to  tackle this challenge. 
Data Mining Process-Related Problems 
Autonomous data mining and cleaning operations can improve the efficiency of data mining dramatically.  Although we can process models and discover patterns at a fast rate, major costs are incurred by  preprocessing operations such as data integration and data cleaning. Reducing these costs through  automation can deliver a much greater payoff than attempting to further reduce the cost of model-building  and pattern-finding.
16 
Chapter 1 ■ Machine Learning 
Security, Privacy, and Data Integrity 
Ensuring users’ privacy while their data are being mined is critical. Assurance of the knowledge integrity of  collected input data and synthesized individual patterns is no less essential. 
Dealing with Nonstatic, Unbalanced, and Cost-Sensitive Data 
Data is dynamic and changing continually in different domains. Historical trials in data sampling and model  construction may be suboptimal. As you retrain a current model based on new training data, you may  experience a learning drift, owing to different selection biases. Such biases need to be corrected dynamically  for accurate prediction. 
Summary 
This chapter discussed the essentials of ML through key terminology, types of ML, and the top 10 data  mining and ML algorithms. Owing to the explosion of data on the World Wide Web, ML has found  widespread use in web search, advertising placement, credit scoring, stock market prediction, gene  sequence analysis, behavior analysis, smart coupons, drug development, weather forecasting, big data  analytics, and many more such applications. New uses for ML are being explored every day. Big data  analytics and graph analytics have become essential components of cloud-based business development.  The new field of data analytics and the applications of ML have also accelerated the development of  specialized hardware and accelerators to improve algorithmic performance, big data storage, and data  retrieval performance. 
References 
Agrawal, Rakesh, and Ramakrishnan Srikant. “Fast Algorithms for Mining Association Rules in Large  Databases.” In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB ’94),  September 12–15, 1994, Santiago de Chile, Chile, edited by Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo.  San Francisco: Morgan Kaufmann (1994): 487–499. 
Breiman, Leo, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression  Trees. Belmont, CA: Wadsworth, 1984. 
Carpenter, Gail A., and Stephen Grossberg. Pattern Recognition by Self-Organizing Neural Networks.  Massachusetts: Cambridge, MA: Massachusetts Institute of Technology Press, 1991. 
Freund, Yoav, and Robert E. Schapire. “A Decision-Theoretic Generalization of On-Line Learning and an  Application to Boosting.” Journal of Computer and System Sciences 55, no. 1 (1997): 119–139. 
Han, Jiawel, Jian Pei, and Yiwen Yin. “Mining Frequent Patterns without Candidate Generation.”  In SIGMOD/PODS ’00: ACM international Conference on Management of Data and Symposium on Principles  of Database Systems, Dallas, TX, USA, May 15–18, 2000, edited by Weidong Chen, Jeffrey Naughton,  Philip A. Bernstein. New York: ACM (2000): 1–12. 
Jang, J.-S. R. “ANFIS: Adaptive-Network-Based Fuzzy Inference System.” IEEE Transactions on Systems,  Man and Cybernetics 23, no. 3 (1993): 665–685. 
Kohavi, Ron, and Foster Provost. “Glossary of Terms.” Machine Learning 30, no. 2–3 (1998): 271–274.
17 
Chapter 1 ■ Machine Learning 
Lloyd, Stuart P. “Least Squares Quantization in PCM,” in special issue on quantization, IEEE Transactions on  Information Theory, IT-28, no. 2(1982): 129–137. 
Samuel, Arthur L. “Some Studies in Machine Learning Using the Game of Checkers,” IBM Journal of Research  and Development 44:1.2 (1959): 210–229. 
Turing, Alan M. “Computing machinery and intelligence.” Mind (1950): 433–460. 
Vapnik, Vladimir N. Statistical Learning Theory. New York: Wiley, 1998. 
Wu, Xindong, Vipin Kumar, Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda,  Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand,  and Dan Steinberg. “Top 10 Algorithms in Data Mining.” Knowledge and Information Systems 14 (2008): 1–37. 
Yang, Qiang, and Xindong Wu. “10 Challenging Problems in Data Mining Research.” International Journal of  Information Technology and Decision Making 5, no. 4 (2006): 597–604.
18 
Chapter 2 
Machine Learning and Knowledge  Discovery 
When you know a thing, to hold that you know it; and when you do not know a thing,  to allow that you do not know it—this is knowledge. 
—Confucius, The Analects 
The field of data mining has made significant advances in recent years. Because of its ability to solve  complex problems, data mining has been applied in diverse fields related to engineering, biological science,  social media, medicine, and business intelligence. The primary objective for most of the applications is  to characterize patterns in a complex stream of data. These patterns are then coupled with knowledge  discovery and decision making. In the Internet age, information gathering and dynamic analysis of  spatiotemporal data are key to innovation and developing better products and processes. When datasets  are large and complex, it becomes difficult to process and analyze patterns using traditional statistical  methods. Big data are data collected in volumes so large, and forms so complex and unstructured, that  they cannot be handled using standard database management systems, such as DBMS and RDBMS. The  emerging challenges associated with big data include dealing not only with increased volume, but also the  wide variety and complexity of the data streams that need to be extracted, transformed, analyzed, stored,  and visualized. Big data analysis uses inferential statistics to draw conclusions related to dependencies,  behaviors, and predictions from large sets of data with low information density that are subject to random  variations. Such systems are expected to model knowledge discovery in a format that produces reasonable  answers when applied across a wide range of situations. The characteristics of big data are as follows: 
•  Volume: A great quantity of data is generated. Detecting relevance and value within  this large volume is challenging. 
•  Variety: The range of data types and sources is wide. 
•  Velocity: The speed of data generation is fast. Reacting in a timely manner can be  demanding. 
•  Variability: Data flows can be highly inconsistent and difficult to manage, owing to  seasonal and event-driven peaks. 
•  Complexity: The data need to be linked, connected, and correlated to infer nonlinear  relationships and causal effects.
19 
Chapter 2 ■ Machine Learning and Knowledge Discovery 
Modern technological advancements have enabled the industry to make inroads into big data and big data  analytics. Affordable open source software infrastructure, faster processors, cheaper storage, virtualization, high  throughput connectivity, and development of unstructured data management tools, in conjunction with cloud  computing, have opened the door to high-quality information retrieval and faster analytics, enabling businesses  to reduce costs and time required to develop newer products with customized offerings. Big data and powerful analytics can be integrated to deliver valuable services, such as these: 
•  Failure root cause detection: The cost of unplanned shutdowns resulting from  unexpected failures can run into billions of dollars. Root cause analysis (RCA)  
identifies the factors determinative of the location, magnitude, timing, and nature  of past failures and learns to associate actions, conditions, and behaviors that can  prevent the recurrence of such failures. RCA transforms a reactive approach to  
failure mitigation into a proactive approach of solving problems before they occur  and avoids unnecessary escalation. 
•  Dynamic coupon system: A dynamic coupon system allows discount coupons to  be delivered in a very selective manner, corresponding to factors that maximize  
the strategic benefits to the product or service provider. Factors that regulate the  
delivery of the coupon to selected recipients are modeled on existing locality,  
assessed interest in a specific product, historical spending patterns, dynamic  
pricing, chronological visits to shopping locations, product browsing patterns, and  redemption of past coupons. Each of these factors is weighted and reanalyzed as a  function of competitive pressures, transforming behaviors, seasonal effects, external  factors, and dynamics of product maturity. A coupon is delivered in real time,  
according to the recipient’s profile, context, and location. The speed, precision, and  accuracy of coupon delivery to large numbers of mobile recipients are important  
considerations. 
•  Shopping behavior analysis: A manufacturer of a product is particularly interested in  the understanding the heat-map patterns of its competitors’ products on the store  floor. For example, a manufacturer of large-screen TVs would want to ascertain  
buyers’ interest in features offered by other TV manufacturers. This can only be  
analyzed by evaluating potential buyers’ movements and time spent in proximity  
to the competitors’ products on the floor. Such reports can be delivered to the  
manufacturer on an individual basis, in real time, or collectively, at regular intervals.  The reports may prompt manufacturers to deliver dynamic coupons to influence  
potential buyers who are still in the decision-making stage as well as help the  
manufacturer improve, remove, retain, or augment features, as gauged by buyers’  interest in the competitors’ products. 
•  Detecting fraudulent behavior: Various types of fraud related to insurance, health  care, credit cards, and identity theft cost consumers and businesses billions of  
dollars. Big data and smart analytics have paved the way for developing real-time  
solutions for identifying fraud and preventing it before it occurs. Smart analytics  
generate models that validate the patterns related to spending behavior, geolocation,  peak activity, and insurance claims. If a pattern cannot be validated, a corrective,  
preventive, or punitive action is initiated. The accuracy, precision, and velocity  
of such actions are critical to the success of isolating the fraudulent behavior. For  
instance, each transaction may evaluate up to 500 attributes, using one or more  
models in real time.
20 
Chapter 2 ■ Machine Learning and Knowledge Discovery 
•  Workload resource tuning and selection in datacenter: In a cloud service management  environment, service-level agreements (SLAs) define the expectation of quality of  
service (QoS) for managing performance loss in a given service-hosting environment  composed of a pool of computing resources. Typically, the complexity of resource  
interdependencies in a server system results in suboptimal behavior, leading to  
performance loss. A well-behaved model can anticipate demand patterns and  
proactively react to dynamic stresses in a timely and optimized manner. Dynamic  
characterization methods can synthesize a self-correcting workload fingerprint  
codebook that facilitates phase prediction to achieve continuous tuning through  
proactive workload allocation and load balancing. In other words, the codebook  
characterizes certain features, which are continually reevaluated to remodel workload  behavior to accommodate deviation from an anticipated output. It is possible,  
however, that the most current model in the codebook may not have been subjected  to newer or unidentified patterns. A new workload is hosted on a compute node  
(among thousands of potential nodes) in a manner that not only reduces the thermal  hot spots, but also improves performance by lowering the resource bottleneck. The  velocity of the analysis that results in optimal hosting of the workload in real time is  critical to the success of workload load allocation and balancing. 
Knowledge Discovery 
Knowledge extraction gathers information from structured and unstructured sources to construct a  knowledge database for identifying meaningful and useful patterns from underlying large and semantically  fuzzy datasets. Fuzzy datasets are sets whose elements have a degree of membership. Degree of membership is defined by a membership function that is valued between 0 and 1. 
The extracted knowledge is reused, in conjunction with source data, to produce an enumeration of  patterns that are added back to the knowledge base. The process of knowledge discovery involves programmatic  exploration of large volumes of data for patterns that can be enumerated as knowledge. The knowledge acquired  is presented as models to which specific queries can be made, as necessary. Knowledge discovery joins the  concepts of computer science and machine learning (such as databases and algorithms) with those of statistics  to solve user-oriented queries and issues. Knowledge can be described in different forms, such as classes of  actors, attribute association models, and dependencies. Knowledge discovery in big data uses core machine  algorithms that are designed for classification, clustering, dimensionality reduction, and collaborative filtering as  well as scalable distributed systems. This chapter discusses the classes of machine learning algorithms that are  useful when the dataset to be processed is very large for a single machine. 
Classification 
Classification is central to developing predictive analytics capable of replicating human decision making.  Classification algorithms work well for problems with well-defined boundaries in which inputs follow  a specific set of attributes and in which the output is categorical. Generally, the classification process  develops an archive of experiences entailing evaluation of new inputs by matching them with previously  observed patterns. If a pattern can be matched, the input is associated with the predefined predictive  behavioral pattern. If a pattern cannot be matched, it is quarantined for further evaluation to determine if  it is an undiscovered valid pattern or an unusual pattern. Machine-based classification algorithms follow  supervised-learning techniques, in which algorithms learn through examples (also called training sets) of  accurate decision making, using carefully prepared inputs. The two main steps involved in classification are  synthesizing a model, using a learning algorithm, and employing the model to categorize new data.
21 
Chapter 2 ■ Machine Learning and Knowledge Discovery 
Clustering 
Clustering is a process of knowledge discovery that groups items from a given collection, based on similar  attributes (or characteristics). Members of the same cluster share similar characteristics, relative to those  belonging to different clusters. Generally, clustering involves an iterative algorithm of trial and error that  
operates on an assumption of similarity (or dissimilarity) and that stops when a termination criterion is  satisfied. The challenge is to find a function that measures the degree of similarity between two items  (or data points) as a numerical value. The parameters for clustering—such as the clustering algorithm, the  distance function, the density threshold, and the number of clusters—depend on the applications and the  individual dataset. 
Dimensionality Reduction 
Dimensionality reduction is the process of reducing random variables through feature selection and  feature extraction. Dimensionality reduction allows shorter training times and enhanced generalization  and reduces overfitting. Feature selection is the process of synthesizing a subset of the original variables for  model construction by eliminating redundant or irrelevant features. Feature extraction, in contrast, is the  process of transforming the high-dimensional space to a space of fewer dimensions by combining attributes. 
Collaborative Filtering 
Collaborative filtering (CF) is the process of filtering for information or patterns, using collaborative  methods between multiple data sources. CF explores an area of interest by gathering preferences from  many users with similar interests and making recommendations based on those preferences. CF algorithms  are expected to make satisfactory recommendations in a short period of time, despite very sparse data,  increasing numbers of users and items, synonymy, data noise, and privacy issues. 
Machine learning performs predictive analysis, based on established properties learned from the  training data (models). Machine learning assists in exploring useful knowledge or previously unknown  knowledge by matching new information with historical information that exists in the form of patterns.  These patterns are used to filter out new information or patterns. Once this new information is validated  against a set of linked behavioral patterns, it is integrated into the existing knowledge database. The new  information may also correct existing models by acting as additional training data. The following sections  look at various machine learning algorithms employed in knowledge discovery, in relation to clustering,  classification, dimensionality reduction, and collaborative filtering. 
Machine Learning: Classification Algorithms 
Logistic Regression 
Logistic regression is a probabilistic statistical classification model that predicts the probability of the occurrence  of an event. Logistic regression models the relationship between a categorical dependent variable X and a  dichotomous categorical outcome or feature Y. The logistic function can be expressed as 
β β 
0 1 
 P Y X eeXX ( | ) . = +++ 
0 1 1 (2-1)
β β 
22 
Chapter 2 ■ Machine Learning and Knowledge Discovery 
The logistic function may be rewritten and transformed as the inverse of the logistic function—called  logit or log-odds—which is the key to generating the coefficients of the logistic regression, 
⎛ 
⎝⎜ ⎞⎠⎟ = + 
 logit P( ( | )) ln ( | ) 
Y X . P Y X 
1 β β 0 1 (2-2) 
P Y X = X − 
( | ) 
As depicted in Figure 2-1, the logistic function can receive a range of input values (β0+β1X) between  negative infinity and positive infinity, and the output (P(Y|X) is constrained to values between 0 and 1. 
Figure 2-1. The logistic function 
The logit transform of P(Y|X) provides a dynamic range for linear regression and can be converted  back into odds. The logistic regression method fits a regression curve, using the regression coefficients β0 and β1, as shown in Equation 2-1, where the output response is a binary (dichotomous) variable, and X is  numerical. Because the logistic function curve is nonlinear, the logit transform (see Equation 2-2) is used to  perform linear regression, in which P(Y |X) is the probability of success (Y) for a given value of X. Using the  generalized linear model, an estimated logistic regression equation can be formulated as 
n 
1 1 2 3 0 ∑
 β β (2-3) 
 logit( ( P Y | , X X , ) X Xn k ) . Xk 
= = + = 
k 
1 
The coefficients β0 and βk (k = 1, 2, ..., n) are estimated, using maximum likelihood estimation (MLE)  to model the probability that the dependent variable Y will take on a value of 1 for given values of  Xk (k = 1, 2, ..., n). 
Logistic regression is widely used in areas in which the outcome is presented in a binary format. For  example, to predict blood cholesterol based on body mass index (BMI), you would use linear regression,  because the outcome is continuous. If you needed to predict the odds of being diabetic based on BMI, you  would use logistic regression, because the outcome is binary.
23 
Chapter 2 ■ Machine Learning and Knowledge Discovery 
Random Forest 
Random forest (Breiman 2001) is an ensemble learning approach for classification, in which “weak learners”  collaborate to form “strong learners,” using a large collection of decorrelated decision trees (the random  forest). Instead of developing a solution based on the output of a single deep tree, however, random forest  aggregates the output from a number of shallow trees, forming an additional layer to bagging. Bagging constructs n predictors, using independent successive trees, by bootstrapping samples of the dataset. The  n predictors are combined to solve a classification or estimation problem through averaging. Although  individual classifiers are weak learners, all the classifiers combined form a strong learner. Whereas single  decision trees experience high variance and high bias, random forest averages multiple decision trees to  improve estimation performance. A decision tree, in ensemble terms, represents a weak classifier. The term  forest denotes the use of a number of decision trees to make a classification decision. The random forest algorithm can be summarized as follows: 
1. To construct B trees, select n bootstrap samples from the original dataset. 
2. For each bootstrap sample, grow a classification or regression tree. 
3. At each node of the tree: 
– m predictor variables (or subset of features) are selected at random from all the  predictor variables (random subspace). 
– The predictor variable that provides the best split performs the binary split on  that node. 
– The next node randomly selects another set of m variables from all predictor  
variables and performs the preceding step. 
4. Given a new dataset to be classified, take the majority vote of all the B subtrees. 
By averaging across the ensemble of trees, you can reduce the variance of the final estimation. Random  forest offers good accuracy and runs efficiently on large datasets. It is an effective method for estimating  missing data and maintains accuracy, even if a large portion of the data is missing. Additionally, random  forest can estimate the relative importance of a variable for classification. 
Hidden Markov Model 
A hidden Markov model (HMM) is a doubly stochastic process, in which the system being modeled is a  Markov process with unobserved (hidden) states. Although the underlying stochastic process is hidden  and not directly observable, it can be seen through another set of stochastic processes that produces the  sequence of observed symbols. In traditional Markov models, states are visible to an observer, and state  transitions are parameterized, using transition probabilities. Each state has a probability distribution over  output emissions (observed variables). HMM-based approaches correlate the system observations and state  transitions to predict the most probable state sequence. The states of the HMM can only be inferred from  the observed emissions—hence, the use of the term hidden. The sequence of output emissions generated  by an HMM is used to estimate the sequence of states. HMMs are generative models, in which the joint  distribution of observations and hidden states is modeled. To define a hidden Markov model, the following  attributes have to be specified (see Figure 2-2): 
•  Set of states: {S1,S2...,Sn} 
•  Sequence of states: Q=q1,q2,...,qt 
•  Markov chain property: P P ( | q S q S , , q S , ) q S ( | q S q S ) t j + − 1 1 = =t i t k = =  0 0 = = t j +1 t i =
24 
Chapter 2 ■ Machine Learning and Knowledge Discovery 
•  Set of observations: O={o1,o2,o3,...,oM} 
•  Transition probability matrix: P = { } p p ij ij , ( = = P q S t j +1 | ) q S t i = 
•  Emission probability matrix: B = = { ( bk bk j j )}, ( ) ( P x o t k = = | ) q S t j 
•  Initial probability matrix: π π = = { }, ( π = ) i i P q S 1 i 
•  HMM: M = (A,B,π) 
Figure 2-2. Attributes of an HMM 
The three fundamental problems addressed by HMMs can be summarized as follows: 
•  Model evaluation: Evaluate the likelihood of a sequence of observations for a given  HMM (M=(A,B,π)). 
•  Path decoding: Evaluate the optimal sequence of model states (Q) (hidden states) for  a given sequence of observations and HMM model M=(A,B,π). 
•  Model training: Determine the set of model parameters that best accounts for the  observed signal. 
HMMs are especially known for their application in temporal pattern recognition, such as speech,  handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges, and  bioinformatics. For further details on the HMM, see Chapter 5. 
Multilayer Perceptron 
A multilayer perceptron (MLP) is a feedforward network of simple neurons that maps sets of input data onto  a set of outputs. An MLP comprises multiple layers of nodes fully connected by directed graph, in which  each node (except input nodes) is a neuron with a nonlinear activation function. 
The fundamental component of an MLP is the neuron. In an MLP a pair of neurons is connected in two  adjacent layers, using weighted edges. As illustrated in Figure 2-3, an MLP comprises at least three layers of  neurons, including one input layer, one or more hidden layers, and one output layer. The number of input 
25 
Chapter 2 ■ Machine Learning and Knowledge Discovery 
neurons depends on the dimensions of the input features; the number of output neurons is determined  by the number of classes. The number of hidden layers and the number of neurons in each hidden layer  depend on the type of problem being solved. Fewer neurons result in inefficient learning; a larger number  of neurons results in inefficient generalization. An MLP uses a supervised-learning technique called  backpropagation for training the network. In its simple instantiation the perceptron computes an output y by  processing a linear combination of weighted real-valued inputs through a nonlinear activation function, 
n 
= +  = 
ϕ ∑
 y wi i x b 
, (2-4) 
i 
1 
where w represents the weights vector, x is the input vector, b is the bias, and ϕ is the activation function.  Generally, MLP systems choose the logistic sigmoid function 1/(1+e–x) or the hyperbolic tangent tanh(x) as  the activation functions. These functions offer statistical convenience, because they are linear near the origin  and saturate quickly when moved away from the origin. 
Figure 2-3. The MLP is fed the input features to the input layer and gets the result from the output layer; the  results are calculated in a feedforward approach from the input layer to the output layer 
The MLP learning process adjusts the weights of the hidden layer, such that the output error is reduced.  Starting with the random weights, MLP feeds forward the input pattern signals through the network and  backpropagates the error signal, starting at the output. The backpropagating error signal is made up of of the  difference between actual (On(t)) and desired (Tn) values. Error function may be summarized as  E O t T O t n n n ( ( )) = − ( ). (2-5) 
The goal of the learning process is to minimize the error function. To find the minimum value of the  error function, differentiate it, with respect to the weight matrix. The learning algorithm comprises the  following steps: 
1. Initialize random weights within the interval [1, –1]. 
2. Send an input pattern to the network.
26 
Chapter 2 ■ Machine Learning and Knowledge Discovery 
3. Calculate the output of the network. 
4. For each node n in the output layer: 
a. Calculate the error on output node n: E(On(t))=Tn–On(t). 
b. Add E(On(t)) to all the weights that connect to node n. 
5. Repeat step 2. 
To influence the convergence rate and thereby reduce the step sizes at which weights undergo an  adaptive change, a learning parameter η (< 1) is used. The i-th weight connected to j-th output can be  updated by the following rule: 
 w t w t E O t ij ij j ( ) + − 1 ( ) =η ( ( )). (2-6) 
Equation 2-6 represents an iterative weight adaptation, in which a fraction of output error at iteration  (t + 1) is added to the existing weight from iteration t. 
MLPs are commonly used for supervised-learning pattern recognition processes. There is renewed  interest in MLP backpropagation networks, owing to the successes of deep learning. Deep learning is an  approach for effectively training an MLP, using multiple hidden layers. With modern advancements in  silicon technology, deep learning is being developed to unlock the enormous big data analytics potential in  areas in which highly varying functions can be represented by deep architecture. 
Machine Learning: Clustering Algorithms 
k-Means Clustering 
k-means clustering is an unsupervised-learning algorithm of vector quantization that partitions  n observations into k clusters. The algorithm defines k centroids, which act as prototypes for their respective  clusters. Each object is assigned to a cluster with the nearest centroid when measured with a specific  distance metric. The step of assigning objects to clusters is complete when all the objects have been applied  to one of the k clusters. The process is repeated by recalculating centroids, based on previous S={S1,S1,...,Sk}  allocations, and reassigning objects to the nearest new centroids. The process continues until there is  no movement of centroids of any k cluster. Generally, a k-means clustering algorithm classifies objects  according to their features into k groups (or clusters) by minimizing the sum of squares of the distances  between the object data and the cluster centroid. 
For a given set of d-dimensional observations vectors (x1,x2,...,xn), k-means clustering partitions  n observations into k(≤n) cluster sets so as to minimize the sum of squares, 
k 
2 
∑∑ (2-7) 
 argmin || || , 
S 
where μi is the mean of the points in Si. 
=1 x∈ 
i S i 
x − μi 
The k-means clustering algorithm is easy to implement on large datasets. It has found many uses  in areas such as market segmentation, computer vision, profiling applications and workloads, optical  character recognition, and speech synthesis. The algorithm is often used as the preprocessing step for other  algorithms in order to find the initial configuration.
27 
Chapter 2 ■ Machine Learning and Knowledge Discovery 
Fuzzy k-Means (Fuzzy c-Means) 
Fuzzy k-means (also called fuzzy c-means [FCM]) (Dunn 1973; Bezdek 1981) is an extension of the k-means  algorithm that synthesizes soft clusters, in which an object can belong to more than one cluster with a  certain probability. This algorithm provides increased flexibility in assigning data objects to clusters and  allowing the data objects to maintain partial membership in multiple neighboring clusters. FCM uses the  fuzzification parameter m in range [1, n], which determines the degree of fuzziness in the clusters. Whereas  m=1 signifies crisp clustering, m>1 suggests a higher degree of fuzziness among data objects in decision  space. The FCM algorithm is based on minimization of the objective function 
C 
∑∑ ( ) || || ,2 
 J w x c x m kmj 
= − 
 (2-8) 
x 
j 
= 
1 
where x is the d-dimensional data object, cj is the d-dimensional centroid of the cluster j (see Equation 2-10),  and wk(x) is the degree of membership of x in the cluster k dependent on the fuzzification parameter m,  which controls the weighting accorded the closest centroid: 
1 
 w xc x 
k 
( ) 
. /( ) =− ⎜⎜⎞⎠⎟⎟− 
2 1 
 (2-9) 
C 
∑ 
⎛ 
|| || 
k 
|| || c x 
m 
j 
= 
1 
⎝ 
j 
− 
With FCM the d-dimensional centroid of a kth cluster (ck) is the mean of all points, weighted by their  degree of membership to that cluster: 
∑ 
m 
w x x 
k 
x 
 c = 
( ) 
( ) . (2-10) 
w x k 
m 
∑ x 
k 
The c-means clustering algorithm synthesizes cluster centers and the degree to which data objects are  assigned to them. This does not translate into hard membership functions. FCM is used in image processing  for clustering objects in an image. 
Streaming k-Means 
Streaming k-means is a two-step algorithm, consisting of a streaming step and a ball k-means step. A streaming  step traverses the data objects of size n in one pass and generates an optimal number of centroids—which  amounts to klog(n) clusters, where k is expected number of clusters. The attributes of these clusters are passed  on to the ball k-means step, which reduces the number of clusters to k. 
Streaming Step 
A streaming-step algorithm steps through the data objects one at a time and makes a decision to either add  the data object to an existing cluster or create a new one. If the distance between the centroid of the cluster  and a data point is smaller than the distance cutoff threshold, the algorithm adds the data to an existing  cluster or creates a new cluster with a probability of d/(distancecutoff). If the distance exceeds the cutoff, the  algorithm creates a new cluster with a new centroid. As more data objects are processed, the centroids of  the existing clusters may change their position. This process continues to add new clusters until the number  of existing clusters reaches a cluster cutoff limit. The number of clusters can be reduced by increasing the  distance cutoff threshold. This step is mainly used for dimensionality reduction. The output of this step is a  reduced dataset in the form of multiple clusters that are proxies for a large amount of the original data.
28 
Chapter 2 ■ Machine Learning and Knowledge Discovery 
Ball K-Means Step 
A ball k-means algorithm consumes the output of a streaming step (X = set of centroids > k) and performs  multiple independent runs to synthesize k clusters by selecting the best solution. Each run selects k centroids, using a seeding mechanism, and runs the ball k-means algorithm iteratively to refine the solution. 
The seeding process may invoke the k-means++ algorithm for optimal spreading of k clusters.  The k-means++ seeding algorithm is summarized as follows: 
1. Choose center c1 uniformly at random from X. 
2. Select a new center ci by choosing x∈X with probability, P(x), and add it to X , 2 
P x D xD i 
( ) ( )( ) = , 
∑ i X 
∈ 
2 
where D(x) is the distance between x and the nearest center that has already been chosen. 
3. Repeat step 2 until k centers c c 1 2 c X k , ,, ∈ are selected. 
4. Randomly pick two centers c c ˆ1 2 ,ˆ ∈X with probability proportional to cnor ˆ m c || ˆ ˆc || 1 22 − . 5. For each cˆi , create a ball of radius c c ˆ|| ˆ ˆc || / 1 2 − 3 around it. 
6. Recompute the new centroids c c 1 2 , by using the elements of X contained within  the ball. 
This algorithm is particularly useful in applications with a large number of data objects. The algorithm  reduces the dimensionality of the original dataset by employing the streaming operation and replacing that  data with a reduced proxy data composed of k·log(n) centroids of the original data. The reduced data act as  input to the ball k-means algorithm, which synthesizes and refines k centroids for their respective clusters. 
Machine Learning: Dimensionality Reduction 
Machine learning works through a large number of features to train most regression or classification problems.  This compounds the complexity, raises the computational requirement, and increases the time needed to  converge to a solution. A useful approach for mitigating these problems is to reduce the dimensional space of  the original features by synthesizing a lower-dimensional space. In this new, lower-dimensional space the most  important features are retained, hidden correlations between features are exposed, and unimportant features are  discarded. One of the simplest, most straightforward, and least supervised feature-reduction approaches involves  variants of matrix decomposition: singular value decomposition, eigen decomposition, and nonnegative matrix  factorization. The following sections consider some of the methods commonly used in statistical dimensionality  reduction. 
Singular Value Decomposition 
Singular value decomposition (SVD) performs matrix analysis to synthesize low-dimensional representation  of a high-dimensional matrix. SVD assists in eliminating less important parts of matrix representation,  leading to approximate representation with the desired number of dimensions. This helps in creating  a smaller representation of a matrix that closely resembles the original. SVD is useful in dimensionality  reduction, owing to the following characteristics: 
•  SVD transforms correlated variables into a set of uncorrelated ones that exposes  corresponding relationships between the data items. 
•  SVD identifies dimensions along which data points exhibit the most variation.
29 
Chapter 2 ■ Machine Learning and Knowledge Discovery 
Once you identify the points with distinct variations, you can approximate original data points with  fewer dimensions. You can define thresholds below which variations can be ignored, thereby leading to  a highly reduced dataset without degradation of the information related to inherent relationships and  interests within data points. 
If M is an m × n matrix , then you can break it down into the product of three matrices U, ∑, and VT with  the following characteristics: 
•  U is a column-orthogonal matrix. The columns of U are orthonormal  
eigenvectors of MM T. 
•  V T is a transpose of orthogonal matrix V. The columns of V are orthonormal  
eigenvectors of M TM. 
•  ∑ is a diagonal matrix, where all elements except diagonal are 0. ∑ contains square  roots of eigenvalues from U or V, in descending order. 
In its exact form, M can be rewritten as 
 M U VT = ∑ . (2-11) 
In the process of dimensionality reduction, you synthesize U and V, such that they contain elements  accounted for in the original data, in descending order of variation. You may delete elements representing  dimensions that do not exhibit meaningful variation. This can be done by setting the smallest eigenvalue  to 0. Equation 2-11 can be rewritten in its best rank-l approximate form as 
l 
 ˆM u v , , , i i iT 
= ⋅ l ∑ λ λ ⋅ ≥1 2 λ λ ≥ (2-12) 
i 
where ui and vi are the ith columns of U and V, respectively, and λi is the ith element of the diagonal matrix ∑. 
Principal Component Analysis 
When you have a swarm of points in space, the coordinates and axes you use to represent such points are  arbitrary. The points have certain variances, relative to the direction of axes chosen, indicating the spread  around the mean value in that direction. In a two-dimensional system the model is constrained by the  perpendicularity of the second axis to the first axis. But, in three-dimensional cases and higher, you can  position the nth axis perpendicular to the plane constructed by any two axes. The model is constrained by  the position of the first axis, which is positioned in the direction with the highest variance. This results in a  new feature space that compresses the swarm of points into the axes of high variance. You may select the  axes with higher variances and eliminate the axes with lower variances. Figure 2-4 illustrates the new feature  space, reduced from a dataset with 160 featuresto 59 components (axes). Each component is associated  with a certain percentage of variance, relative to other components. The first component has the highest  variance, followed by second component, and so on.
30 
Chapter 2 ■ Machine Learning and Knowledge Discovery 
Figure 2-4. The percentage of variance of a principal component transform of a dataset with 160 features  reduced to 59 components
Principal component analysis (PCA) is a widely used analytic technique that identifies patterns to  reduce the dimensions of the dataset without significant loss of information. The goal of PCA is to project a  high-dimensional feature space into a smaller subset to decrease computational cost. PCA computes new  features, called principal components (PCs), which are uncorrelated linear combinations of the original  features projected in the direction of greater variability. The key is to map the set of features into a matrix M and synthesize the eigenvalues and eigenvectors for MM T or M TM. Eigenvectors facilitate simpler solutions  to problems that can be modeled using linear transformations along axes by stretching, compressing, or  flipping. Eigenvalues provide a factor (length and magnitude of eigenvectors) whereby such transformation  occurs. Eigenvectors with larger eigenvalues are selected in the new feature space because they enclose  more information than eigenvectors with lower eigenvalues for a data distribution. The first PC has the  greatest possible variance (i.e., the largest eigenvalues) compared with the next PC (uncorrelated, relative to  the first PC), which is computed under the constraint of being orthogonal to the first component. Essentially,  the ith PC is the linear combination of the maximum variance that is uncorrelated with all previous PCs. PCA comprises the following steps: 
1. Compute the d-dimensional mean of the original dataset. 
2. Compute the covariance matrix of the features. 
3. Compute the eigenvectors and eigenvalues of the covariance matrix. 
4. Sort the eigenvectors by decreasing eigenvalue. 
5. Choose k eigenvectors with the largest eigenvalues. 
Eigenvector values represent the contribution of each variable to the PC axis. PCs are oriented in the  direction of maximum variance in m-dimensional points. 
31 
Chapter 2 ■ Machine Learning and Knowledge Discovery 
PCA is one of the most widely used multivariate methods for uncovering new, informative, uncorrelated  features; it reduces dimensionality by rejecting low-variance features and is useful in reducing the  computational requirements for classification and regression analysis. 
Lanczos Algorithm 
The Lanczos algorithm is a low-cost eigen-decomposition technique identical to truncated SVD, except that  it does not explicitly compute singular values/vectors of the matrix. The Lanczos algorithm uses a small  number of Lanczos vectors that are eigenvectors of MTM or MMT, where M is a symmetrical n × n matrix. 
Lanczos starts by seeding an arbitrary nonzero vector x0 with cardinality equal to the number of  columns of matrix M. The mth (m< 0 
•  Hyperbolic tangent (sigmoid): K x u x uT ( ) , t = + anh(β γ ) 
2 
•  Gaussian radial basis function (RBF): K x u x u ( ) , e = − xp −  
 
 
2 σ 
 
 
•  Laplacian radial basis function: K x u x u ( ) , e = − xp −  
 
 σ 
•  Randomized blocks analysis of variance (ANOVA RB) kernel: n 
k k d ( ) , ( = − ( ) − ) 
∑
2 exp σ 
K x u x u 
k 
= 
1 
•  Linear spline kernel in 1D: 
1 
K x u x u ( ) , . = + .min( ) , ( − min( , ) ( , ) ) + 2 3 x u x u x u min x u 1 + 
2 
3 
Kernel selection is heavily dependent on the data specifics. For instance, the linear kernel—the simplest  of all—is useful in large sparse data vectors. However, it ranks behind the polynomial kernel, which avoids  zeroing the Hessian. The polynomial kernel is widely used in image processing, whereas the ANOVA RB  kernel is usually reserved for regression tasks. The Gaussian and Laplace RBFs are general-purpose kernels  that are mostly applied in the absence of prior knowledge. A kernel matrix that ends up being diagonal  indicates that the feature space is redundant and that another kernel should be tried after feature reduction. 
Note that when kernels are used to transform the feature vectors from input space to kernel space for  linearly nonseparable datasets, the kernel matrix computation requires massive memory and computational  resources, for big data.
48 
Chapter 3 ■ Support Vector Machines for Classification 
Figure 3-6 displays the two-dimensional exclusive OR (XOR) data, a linearly nonseparable distribution  in input space (upper-left) as well as in the feature space. In the latter, 16 points (for different sets) are  created for the four inputs when the kernel is applied. The choice of the Gaussian RBF kernel-smoothing  parameter σ2 affects the distribution of the data in the kernel space. Because the choice of parameter value  is essential for transforming the data from a linearly nonseparable space to a linearly separable one, grid  searches are performed to find the most suitable values. 
Figure 3-6. Two-dimensional XOR data, from input space to kernel space
The primal formulation of the kernel SVM is 
N 
w w C i ξξ 12 1 
T 
∑ 
min , 
+ 
w, 
= 
i 
subject to y wi x b T ( ϕ ξ ( )i i + ) ≥ −1 and ξi ≥ ∀ 0, i, where ϕ(xi) is such that K x x x x i j i j ( , . ) = ϕ ϕ ( ) ( ) . 
Again, the SVM solution should satisfy the KKT conditions, as follows: 1. w y x iN 
N 
= ∑ i i i =1λ ϕ( ) 
2. ∑ = 
i = 
i i y 1λ 0 
3. C i − − ∝ λ i i = = 0 1, , 2 ..., N 
4. λ ϕ ξ i iTi i y w( ( ) x b + )− + i N , ,...,   1 0 = =1 2 5. ∝ ξi i = = 0 1 i N , , 2 ..., 
6. ∝ ξ i i , , > = 0 1 i N 2,..., 
49 
Chapter 3 ■ Support Vector Machines for Classification 
As mentioned earlier, the dual formulation of this problem is more efficient to solve and is used in most  
implementations of SVM: 
N 
N 
1 
∑ ∑ −   1 1 
max , λ λ i iλj i j i j 
λ 
i 
subject to 
2 
= = i 
i ∑ y 
y y x x 
i 
λi 
■ Note For a dataset size of N, the kernel matrix has N2 entries. Therefore, as N increases, computing the  kernel matrix becomes inefficient and even unfeasible, making SVM impractical to solve. However, several  algorithms have alleviated this problem by breaking the optimization problem into a number of smaller  problems. 
Multiclass SVM  
The early extensions of the SVM binary classification to the multiclass case were the work of Weston and  Watkins (1999) and Platt (2000). Researchers devised various strategies to address the multiclassification  problem, including one-versus-the-rest, pair-wise classification, and the multiclassification formulation,  discussed in turn here. 
•  One-versus-the-rest (also called one-against-all [OAA]) is probably the earliest SVM  multiclass implementation and is one of the most commonly used multiclass SVMs.  It constructs c binary SVM classifiers, where c is the number of classes. Each classifier  distinguishes one class from all the others, which reduces the case to a two-class  problem. There are c decision functions: w x b w x b Ti cT 
1 1 ϕ ϕ ( ) + ; ...; ( )i c + . The initial  
formulation of the OAA method assigns a data point to a certain class if and only if  that class has accepted it, while all other classes have not, which leaves undecided  regions in the feature space when more than one class accepts it or when all  
classes reject it. Vapnik (1998) suggested assigning data points to the class with the  highest value, regardless of sign. The final label output is given to the class that has  demonstrated the highest output value: 
class of x max w i c i x b Ti arg ( ( ) ). ≡ + =1,..., ϕ 
•  Proposed by Knerr, Personnaz, and Dreyfus (1990), and first adopted in SVM by  Friedman (1996) and Kressel (1999), pair-wise classification (also called one-against one [OAO]) builds c(c – 1)/2 binary SVMs, each of which is used to discriminate two  of the c classes only and requires evaluation of (c – 1) SVM classifiers. For training  data from the kth and jth classes, the constraints for (x y t t , ) are 
w b kjTkj kjt ϕ ξ x ( ( )t + ) ≥ −1 , for y k t = , 
w b kjTkj kjt ϕ ξ x ( ( )t + ) ≤ −1 , + for y j t = , 
ξkjt ≥ 0.
50 
Chapter 3 ■ Support Vector Machines for Classification 
•  The multiclassification objective function probably has the most compact form, as it  optimizes the problem in a single step. The decision function is the same as that of  the OAA technique. The multiclassification objective function constructs c two-class  rules, and c decision functions solve the following constraints: 
w b y w b Ty mTm im 
ϕ ϕ x x ξ i i ( ) + ≥ ( ) + + 2 − , ξim ≥ 0 . 
i i 
For reasonable dataset sizes, the accuracy of the different multiclassification techniques is comparable.  For any particular problem, selection of the optimal approach depends partly on the required accuracy and  partly on the development and training time goals. For example, from a computational cost perspective,  OAA and OAO are quite different. Let’s say, for instance, that there are c different classes of N instances and  that T(N1) represents the time for learning one binary classifier. Using N1 examples, OAA will learn in cN3,  whereas OAO will require 4(c – 1)N3/ c2. 
Although the SVM parametric model allows for adjustments when constructing the discriminant  function, for multiclass problems these parameters do not always fit across the entire dataset. For this  reason, it is sometimes preferable to partition the data into subgroups with similar features and derive the  classifier parameters separately. This process results in a multistage SVM (MSVM), or hierarchical SVM,  which can produce greater generalization accuracy and reduce the likelihood of overfitting, as shown by  Stockman (2010). A graphical representation of a single SVM and an MSVM is presented in Figure 3-7. 
C1 
SVM 
. . . 
C1 C2 
Cn 
MSVM 
C2, C3, … ,Cn 
MSVM 
C1 
C3...Cn 
Single Multiclass SVM Multistage SVM 
Figure 3-7. Single multiclass SVM and MSVM flows 
With a multistage approach, different kernel and tuning parameters can be optimized for each stage  separately. The first-stage SVM can be trained to distinguish between a single class and the rest of the classes.  At the next stage, SVM can tune a different kernel to further distinguish among the remaining classes. Thus,  there will be a binary classifier, with one decision function to implement at each stage. 
Hierarchical SVM as an alternative for multiclass SVM has merit in terms of overall model error. SVM  accuracy approaches the Bayes optimal rule as an appropriate kernel choice and in smoothing metaparameter  M 
values. Also, by definition, for a multiclass problem with M ci classes, and an input vector x,  
∑ ( ) = 
1| ,  
i 
= 
1 
P ci x 
because classes should cover all the search space. When the classes being considered are not equiprobable,  the maximum P c xi ( ) | has to be greater than 1/M; otherwise, the sum will be less than 1. Let’s say, for  example, that the probability of correct classification is 
M 
M 
∑ ∑ ∫ 
= ∈ = ( ) ( ) P P x R c P c p x c dx c ( , ) , |
i 
i i 
i 
i 
i 
= = 
1 1 
R 
i 
51 
Chapter 3 ■ Support Vector Machines for Classification 
where Ri is the region of the feature space in which the decision is in favor of ci. Because of the definition of  region Ri, 
M 
M 
1 | , 
∑∫ ∑∫ 
= ( ) ( ) ≥ ( ) P P x c p x dxM p x dx ci 
i 
= = 
1 1 
i 
R 
i i R ➩ PM c ≥ 1; 
hence, the probability of multiclassification error is 
P PMMM e c = − ≤ − = − 1 11 1. 
As the number of classes M increases, Pe increases for a multiclassification flat formulation.  For a hierarchical classification the multiclassification task is reduced at each stage to a binary one,  with Pe = 12. Thus, the cumulative error for the hierarchical task is expected to converge asymptotically to  a lower value than with a flat multiclassification task. 
SVM with Imbalanced Datasets 
In many real-life applications and nonsynthetic datasets, the data are imbalanced; that is, the important  class—usually referred to as the minority class—has many fewer samples than the other class, usually  referred to as the majority class. Class imbalance presents a major challenge for classification algorithms  whenever the risk loss for the minority class is higher than for the majority class. When the minority data  points are more important than the majority ones, and the main goal is to classify those minority data points  correctly, standard machine learning that is geared toward optimized overall accuracy is not ideal; it will  result in hyperplanes that favor the majority class and thus generalize poorly. 
When dealing with imbalanced datasets, overall accuracy is a biased measure of classifier goodness.  Instead, the confusion matrix, and the information on true positive (TP) and false positive (FP) that it holds,  are a better indication of classifier performance. Referred to as matching matrix in unsupervised learning,  and as error matrix or contingency matrix in fields other than machine learning, a confusion matrix provides  a visual representation of actual versus predicted class accuracies. 
ACCURACY METRICS  
A confusion matrix is as follows: 
Predicted/Actual Class Positive Class Negative Class 
Positive Class TP FP 
Negative Class FN TN 
Accuracy is the number of data points correctly classified by the classification algorithm: AccuracyTP TN 
TP TN FN FP = + 
+ + + . 
The positive class is the class that is of utmost importance to the designer and usually is the  minority class.
52 
Chapter 3 ■ Support Vector Machines for Classification 
True positive (TP) (also called recall in some fields) is the number of data points correctly classified from  the positive class. 
False positive (FP) is the number of data points predicted to be in the positive class but in fact belonging  to the negative class. 
True negative (TN) is the number of data points correctly classified from the negative class. 
False negative (FN) is the number of data points predicted to be in the negative class but in fact  belonging to the positive class. 
Sensitivity (also called true positive rate [TPR] or recall rate [RR]) is a measure of how well a  classification algorithm classifies data points in the positive class: 
SensitivityTP 
TP FN = + . 
Specificity (also called true negative rate [TNR ]) is a measure of how well a classification algorithm  classifies data points in the negative class: 
SpecificityTN 
TN FP = + . 
Receiver operating characteristic (ROC) curves offer another useful graphical representation for  classifiers operating on imbalanced datasets. Originally developed during World War II by radar and  electrical engineers for communication purposes and target prediction, ROC is also embraced by diagnostic  decision making. Fawcett (2006) provided a comprehensive introduction to ROC analysis, highlighting  common misconceptions. 
The original SVM formulation did not account for class imbalance during its supervised learning phase.  But, follow-up research proposed modifications to the SVM formulation for classifying imbalanced datasets. Previous work on SVM addressed class imbalance either by preprocessing the data or by proposing  algorithmic modification to the SVM formulation. Kubat (1997) recommended balancing a dataset by  randomly undersampling the majority class instead of oversampling the minority class. However, this results  in information loss for the majority class. Veropoulos, Campbell, and Cristianini (1999) introduced different  loss functions for the positive and negative classes to penalize the misclassification of minority data points.  Tax and Ruin (1999) solved the class imbalance by using the support vector data description (SVDD), which  aims at finding a sphere that encompasses the minority class and separates it from the outliers as optimally  as possible. Feng and Williams (1999) suggested general scaled SVM (GS-SVM), another variation of SVM,  which introduces a translation of the hyperplane after training the SVM. The translation distance is added  to the SVM formulation; translation distance is computed by projecting the data points on the normal vector  of the trained hyperplane and finding the distribution scales of the whole dataset (Das 2012). Chang and  Lin (2011) proposed weighted scatter degree SVM (WSD-SVM), which embeds the global information in the  GS-SVM by using the scatter of the data points and their weights, based on their location. Many efforts have been made to learn imbalanced data at the level of both the data and the algorithm.  Preprocessing the data before learning the classifier was done through oversampling of the minority class to  balance the class distribution by replication or undersampling of the larger class, which balances the data by  eliminating samples randomly from that class (Kotsiantis, Kanellopoulos, and Pintelas 2006). Tang et al. (2009)  recommended the granular SVM repetitive undersampling (GSVM-RU) algorithm, which, instead of using  random undersampling of the majority class to obtain a balanced dataset, uses SVM itself—the idea being to  form multiple majority information granules, from which local majority support vectors are extracted and then  aggregated with the minority class. Another resampling method for learning classifiers from imbalanced data  was suggested by Ou, Hung, and Oyang (2006) and Napierała, Stefanowski, and Wilk (2010). These authors  concluded that only when the data suffered severely from noise or borderline examples would their proposed 
53 
Chapter 3 ■ Support Vector Machines for Classification 
resampling methods outperform the known oversampling methods. The synthetic minority oversampling  technique (SMOTE) algorithm (Chawla et al. (2002) oversamples the minority class by introducing artificial  minority samples between a given minority data point and its nearest minority neighbors. Extensions of the  SMOTE algorithm have been developed, including one that works in the distance space (Koknar-Tezel and  Latecki 2010). Cost-sensitive methods for imbalanced data learning have also been used. These methods  define a cost matrix for misclassifying any data sample and fit the matrix into the classification algorithm  (He and Garcia 2009). 
Tax and Duin (2004) put forward the one-class SVM, which tends to learn from the minority class only.  The one-class SVM aims at estimating the probability density function, which gives a positive value for the  elements in the minority class and a negative value for everything else. 
By introducing a multiplicative factor z to the support vector of the minority class, Imam, Ting, and  Kamruzzaman (2006) posited that the bias of the learned SVM will be reduced automatically, without  providing any additional parameters and without invoking multiple SVM trainings. 
Akbani, Kwek, and Japkowicz (2004) proposed an algorithm based on a combination of the SMOTE  algorithm and the different error costs for the positive and negative classes. Wang and Japkowicz (2010)  also aggregated the different penalty factors as well as using an ensemble of SVM classifiers to improve  the error for a single classifier and treat the problem of the skewed learned SVM. In an attempt to improve  classification of imbalanced datasets using SVM standard formulation, Ajeeb, Nayal, and Awad (2013)  suggested a novel minority SVM (MinSVM), which, with the addition of one constraint to the SVM objective  function, separates boundaries that are closer to the majority class. Consequently, the minority data points  are favored, and the probability of being misclassified is smaller. 
Improving SVM Computational Requirements  
Despite the robustness and optimality of the original SVM formulation, SVMs do not scale well  computationally. Suffering from slow training convergence on large datasets, SVM online testing time can be  suboptimal; SVMs write the classifier hyperplane model as a sum of support vectors whose number cannot  be estimated ahead of time and may total as much as half the datasets. Thus, it is with larger datasets that  SVM fails to deliver efficiently, especially in the case of nonlinear classification. Large datasets impose heavy  computational time and storage requirements during training, sometimes rendering SVM even slower than  ANN, itself notorious for slow convergence. For this reason, support vector set cardinality may be a problem  when online prediction requires real-time performance on platforms with limited computational and power  supply capabilities, such as mobile devices. 
Many attempts have been made to speed up SVM. A survey related to SVM and its variants reveals a  dichotomy between speedup strategies. The first category of techniques applies to the training phase of the  SVM algorithm, which incurs a heftier computational cost in its search for the optimal separator. The intent of  these algorithms is to reduce the cardinality of the dataset and speed up the optimization solver. The second  category of techniques aims to accelerate the testing cycle. With the proliferation of power-conscious mobile  devices, and the ubiquity of computing pushed from the cloud to these terminals, reducing the SVM testing  cycle can be useful in applications in which computational resources are limited and real-time prediction  is necessary. For example, online prediction on mobile devices would greatly benefit from reducing the  computations required to perform a prediction. 
To reduce the computational complexity of the SVM optimization problem, Platt (1998) developed  the sequential minimal optimization (SMO) method, which divides the optimization problem into two  quadratic program (QP) problems. This decomposition relieves the algorithm of large memory requirements  and makes it feasible to train SVM on large datasets. Therefore, this algorithm grows alternately linearly and  quadratically, depending on dataset size. SMO speeds up the training phase only, with no control over the  number of support vectors or testing time. To achieve additional acceleration, many parallel implementations 
54 
Chapter 3 ■ Support Vector Machines for Classification 
of SMO (Zeng et al. 2008; Peng, Ma, and Hong 2009; Catanzaro et al. 2008; Alham et al. 2010; Cao et al. 2006)  were developed on various parallel programming platforms, including graphics processing unit (GPU)  (Catanzaro et al. 2008), Hadoop MapReduce (Alham et al. 2010), and message passing interface (MPI)  (Cao et al. 2006). 
Using the Cholesky factorization (Gill and Murray 1974), Fine (2002) approximated the kernel matrix by  employing a low-rank matrix that requires updates that scale linearly with the training set size. The matrix  is then fed to a QP solver to obtain an approximate solution to the SVM classification problem. Referred  to as the Cholesky product form QP, this approach showed significant training time reduction, with its  approximation of the optimal solution provided by SMO. However, if the training set contains redundant  features, or if the support vectors are scaled by a large value, this method fails to converge (Fine and  Scheinberg 2002). 
Instead of decomposing the optimization problem, Lee (2001a) reformulated the constraint  optimization as an unconstrained, smooth problem that can be solved using the Newton-Armijo  algorithm in quadratic time. This reformulation resulted in improved testing accuracy of the standard  SVM formulation (Vapnik 1999) on several databases (Lee 2001). Furthermore, Lee (2001) argued that this  reformulation allows random selection of a subset of vectors and forces creation of more support vectors,  without greatly affecting the prediction accuracy of the model. 
Margin vectors were identified by Kong and Wang (2010) by computing the self and the mutual center  distances in the feature space and eliminating the statistically insignificant points, based on the ratio and  center distance of those points. The training set was forced to be balanced, and results were compared with  those found using reduced SVM (RSVM) on three datasets from the University of California, Irvine, Machine  Learning Repository (Frank and Asuncion 2010). The authors found that the model resulted in better  generalization performance than with RSVM but that it required slightly more training time, owing to the  overhead of computing the ratios and center distances. 
Zhang (2008) identified boundary vectors, using the k-nearest neighbors (k-NN algorithm. With this  method the distance between each vector and all other vectors is computed, and the vectors that have  among their k-NN a vector of opposing class are retained. For linearly nonseparable problems, k-NN is  applied in the kernel space, where the dataset is linearly separable. The preextract boundary vectors are  used to train SVM. Because this subset is much smaller than the original dataset, training will be faster, and  the support vector set will be smaller. 
Downs, Gates, and Masters (2002) attempted to reduce the number of support vectors used in the  prediction stage by eliminating vectors from the support vector set produced by an SMO solver that are  linearly dependent on other support vectors. Hence, the final support vector set is formed of all linearly  independent support vectors in the kernel space obtained by using row-reduced echelon form. Although  this method produced reduction for polynomial kernels, and RBF with large sigma values, the number of,  support vectors reduced could not be predicted ahead of time and was dependent on the kernel and the  problem. 
Nguyen (2006) reduced the support vector set by iteratively replacing the two nearest support vectors  belonging to the same class, using a constructed support vector that did not belong to the original training  set. The algorithm was applied after training the SVM on the training set and obtaining the support vector  set. The algorithm was tested on the United States Postal Service database (Le Cun 1990) and achieved  
significant reduction in support vector set cardinality, with little reduction in prediction accuracy. Rizk, Mitri, and Awad (2013) proposed a local mixture–based SVM (LMSVM), which exploits the  increased separability provided by the kernel trick, while introducing a one-time computational cost.  LMSVM applies kernel k-means clustering to the data in kernel space before pruning unwanted clusters,  based on a mixture measure for label heterogeneity. Extending this concept, Rizk, Mitri, and Awad (2014)  put forward knee-cut SVM (KCSVM) and knee-cut ordinal optimization–inspired SVM (KCOOSVM), with  a soft trick of ordered kernel values and uniform subsampling to reduce the computational complexity of  SVM, while maintaining an acceptable impact on its generalization capability.
55 
Chapter 3 ■ Support Vector Machines for Classification 
Case Study of SVM for Handwriting Recognition  
Automated handwriting recognition (HWR) is becoming popular in several offline and online sensing  tasks. Developing robust yet computationally efficient algorithms is still a challenging problem, given  the increased awareness of energy-aware computing. Offline sensing occurs by optically scanning words  and then transforming those images to letter code usable in the computer software environment. Online  recognition automatically converts the writing on a graphics tablet or pen-based computer screen into  letter code. HWR systems can also be classified as writer dependent or writer independent, with dependent  systems’ having a higher recognition rate, owing to smaller variance in the provided data. 
Because isolated-letter HWR is an essential step for online HWR, we present here a case study on  developing an efficient writer-independent HWR system for isolated letters, using pen trajectory modeling  for feature extraction and an MSVM for classification (Hajj and Awad 2012). In addition to underlining the  
importance of the application, this case study illustrates how stationary features are created from sequential  data and how a multiclass task is converted into a hierarchical one. Usually, hidden Markov models (HMM)  are better for modeling and recognizing sequential data, but with an appropriate feature generation scheme,  
an SVM model can be used to model variable sequence length for moderate handwriting vocabularies. The proposed HWR workflow is composed of preprocessing; feature extraction; and a hierarchical,  three-stage classification phase. 
Preprocessing 
The UJIpenchars database can be transformed into a sequence of points suitable for feature extraction in  a way similar to preprocessing performed a step typically found in many HWR systems. The preprocessing  comprises correcting the slant; normalizing the dimensions of the letter; and shifting the coordinates, with  respect to the center of mass. 
To correct the slant, the input, consisting of a sequence of collected points, is first written in the form  of a series of vectors with polar coordinates, and then only vectors with an angle equal to or less than  50 degrees with the vertical are considered. The slant is computed by averaging the angles of the significant  vectors. Next, the letter is rotated by the slant angle, and the data are normalized so that all letters have the  same dimensions. Finally, the shifting of the coordinates, with respect to the center of mass, fits the letter  into a square of unit dimension with a centroid with the coordinates (0, 0). 
Figure 3-8 shows two letters before (left) and after (right) the preprocessing stage. 
Figure 3-8. Examples of letters before (left) and after (right) preprocessing
56 
Chapter 3 ■ Support Vector Machines for Classification 
Feature Extraction  
To obtain different representations of the letters, a set of feature vectors of fixed length should be computed.  The preprocessed data, consisting of strokes of coordinate pairs [x(t), y(t)], can be modeled, using a pen  trajectory technique (Jaeger 2008), and the set of features is obtained after averaging the following functions: 
•  Writing direction: Defined by 
s t ( ) = ( )( ) ( ) = ( )( ) 
∆ 
x t 
∆ 
s tt y t 
cos ; α α t sin , 
∆ 
∆ 
where Δx, Δy, and Δs are defined as 
∆x t( ) = − x t( ) 1 1 − + x t( ), 
∆y t( ) = − y t( ) 1 1 − + y t( ), 
∆ ∆ s t( ) = x t( ) + ∆y t( ) 2 2. 
•  Curvature: Defined by the sine and cosine of the angle defined by the points (x(t - 2),  y(t - 2)); (x(t), y(t)); and (x(t + 2), y(t + 2)). Curvature can be calculated from the writing  direction, using the following equations: 
cos c β α ( )t t = − os ( ) 1 1 cos s α α ( ) t t + + − in ( ) 1 1 sin , α( ) t + 
sinβ α ( )t t = − cos ( 1 1 )sinα α ( ) t t + − − sin ( 1 1 )cosα( ) t + . 
•  Aspect of the trajectory: Computed according to the equation 
y t x t ( ) = ( ) ( ) − ( ) 
A t y t x t 
∆ ∆ 
∆ ∆ . 
( ) ( ) + ( ) 
•  Curliness: Describes the deviation of the points from a straight line formed by the  previous and following points in the sequence by the equation 
C t( ) = L t( )/ , max( ) ∆ ∆ x y − 2, 
where L(t) represents the length of the trajectory from point (x(t - 1), y(t - 1)) to point  (x(t + 1), y(t + 1)). 
In addition to the previous functions, the following global features are computed: 
•  Linearity: Measured by the average distance from each point of the sequence to the  straight line joining the first and last points in the sequence: 
1 . 
LNN= ∑di 
•  Slope of the sequence: Measured by the cosine and sine of the angle formed by the  straight line joining the first and last points in the sequence and a horizontal line. 
•  Ascenders and descenders: Describes the number of points of the sequence below  (descenders) or above (ascenders) the baseline (the straight horizontal line on which  the letter is written), each weighted by its distance to the baseline. 
•  Variance of coordinates (for both dimensions): Measures the expansion of the points  around the center of mass. 
•  Ratio of variances: Represents the proportion of the width to the height of the letter.
57 
Chapter 3 ■ Support Vector Machines for Classification 
•  Cumulative distance: The sum of the length of the segments of line joining  
consecutive points of the sequence. 
•  Average distance to the center, The mean of the distances from each point of the  sequence to the center of mass of the letter. 
Hierarchical, Three-Stage SVM 
After the preprocessing and feature extraction stages, a three-stage classifier recognizes one of the 52 classes  (26 lowercase and 26 uppercase letters). 
•  Using a binary SVM classifier, the first stage classifies the instance as one of two  classes: uppercase or lowercase letter. 
•  Using OAA SVM, the second stage classifies the instance as one of the manually  determined clusters shown in Table 3-1. 
Table 3-1. Lower- and Uppercase Clusters 
Lowercase Clusters Uppercase Clusters 
Cluster 1: a c e o Cluster 2: b d l t Cluster 3: f h k Cluster 4: g z j Cluster 5: p q Cluster 6: i r s Cluster 7: u v w x Cluster 8: m n 
Cluster 9: A B P R Cluster 10: C D G O Q Cluster 11: E F I L Cluster 12: J K T 
Cluster 13: M N H Cluster 14: S Y Z X Cluster 15: U V W
•  Using OAA SVM, with a simple majority vote, the third stage identifies the letter  as one of the 52 classes (or subclusters). Figure 3-9 displays the hierarchy of the  three-stage system. 
58 
Chapter 3 ■ Support Vector Machines for Classification 
STAGE 1  
SVM 
Lower  
STAGE 2  
SVM 
C1 - C8? 
C 1 
STAGE 3  
SVM 
a, e,  
c, o? ae 
Input letter Lower or  Upper Case 
case 
Upper  case 
C 2 
. 
. 
. 
C 8 
C9 - C15? 
. 
. 
. 
A, B,  
c 
. o . 
. 
Output letter 
C 9 
P, R? AB 
P 
. 
R 
C 10 
. 
. 
. 
. 
. 
. 
. 
. 
C 15 
Figure 3-9. Hierarchical, three-stage SVM 
Experimental Results 
Experimental results, implemented with the MATLAB R2011a SVM toolbox, showed (using a four-fold  cross-validation) an average accuracy of 91.7 percent—or, an error rate of 8.3 percent, compared with  an error rate of 10.85 percent, using 3NN (Prat et al. 2009). The three stages of the classifier achieved,  respectively, 99.3 percent, 95.7 percent, and 96.5 percent accuracy. The kernel used for the three stages was  an RBF with parameters tuned using a grid search algorithm. Our proposed preprocessing helped improve  the general accuracy of the recognizer by approximately 1.5 percent to 2 percent. 
Figure 3-10 presents a confusion histogram demonstrating the occurrence of the predicted classified  labels, along with their true labels. For example, in the first column, of the six letter a’s, five were correctly  recognized, and one was mistaken for c. Generally, no particular trend was observed in this confusion  matrix, and the error may be assumed to be randomly distributed among all classes.
59 
Chapter 3 ■ Support Vector Machines for Classification 
Figure 3-10. Confusion plot for classified label versus true label 
Because a flat SVM architecture may seem computationally less expensive, it was compared with  the proposed three-stage SVM, using OAO and OAA SVM techniques. Table 3-2 shows the recognition  rates obtained using the proposed architecture, compared with a flat SVM technique as well as the3NN  algorithm. The accuracy attained ranged from 65 percent, using OAA, to 82 percent, using OAO, whereas  the hierarchical SVM structure reached 91.7 percent. This is due to the fact that, with a three-stage SVM,  both the metaparameters of SVM (i.e., the regularization parameter between the slack and hyperplane  parameters) and the kernel specifics can be better modified independently during each phase of training  and better tailored to the resulting data subsets than a flat SVM model can be for the whole dataset.
Table 3-2. Recognition Rate Comparison 
Architecture Recognition Rate (%) 
Flat SVM OAA 65 
Flat SVM OAO 82 
3NN (Prat et al. 2009) 89.15 
Three-Stage SVM 91.8 
60 
Chapter 3 ■ Support Vector Machines for Classification 
Complexity Analysis 
Tables 3-3 and 3-4, respectively, provide the required operations for the preprocessing and feature extraction  stages of the three-stage SVM, where a letter is represented by a sequence of strokes of length N, with M being the number of significant vectors, and K, the data size. 
Table 3-3. Required Operations for the Preprocessing Stage 
Step Total Operations 
Representing letter in a sequence of vector 8N 
Computing slant M + 1 
Rotating letter N 
Normalizing dimensions 2N 
Shifting to center of mass 4N + 2 
Table 3-4. Required Operations for the Feature Extraction Stage 
Feature Total Operations 
Writing direction 7N 
Curvature 6N 
Aspect 2N 
Curliness 14N 
Linearity 6N + 1 
Slope 7 
Ascenders and descenders 6N 
Variance 8N + 4 
Ratio of variances 1 
Cumulative distance 5N - 5 
Average distance 4N 
Table 3-5 compares the required operations for the classification process using three-stage SVM and  the 3NN algorithm . Both SVM optimal hyperplane coefficients and support vectors were computed during  the training process. Given an input pattern represented by a multidimensional (11) vector x and a w vector  representing the decision boundary (hyperplane), the decision function for the classification phase is  reduced to a sign function.
61 
Chapter 3 ■ Support Vector Machines for Classification 
Table 3-5. Comparison of Three-Stage SVM and 3NN Classifiers 
Classifier Decision Function Total Operations Three-Stage SVM C w x x w T ( ) = + 0 12 operations per classifier; in total,  168 operations (the class requiring the  
most classifiers) 
3NN (Prat et al. 2009) D x( ) , z x = − ( ) z x + + ... ( ) − z 1 1250 502 150 operations per distance measure; in  total, 3 50 * K = 150 * K
The online classification task is much costlier using a 3NN classifier compared with a hierarchical SVM. In  fact, every classification task requires the Euclidian distance calculation to all points in the dataset, which would  be an expensive cost to incur in the presence of a large dataset. Additionally, with the lack of a classification  model, the k-NN technique is a non parametric approach and requires access to all the data each time an  instance is recognized. With SVM, in contrast, separating class boundaries is learned offline, during the training  phase, and at runtime the computational cost of SVM training is not present. Only preprocessing, feature  extraction, and a simple multiplication operation with the hyperplane parameters are involved in the online  testing process. An advantage of 3NN, however, is that no training is required, as opposed to the complex SVM  classification step. 
References 
Aizerman, M., E. Braverman, and L. Rozonoer. “Theoretical Foundations of the Potential Function Method  in Pattern Recognition Learning.” Automation and Remote Control 25 (1964): 821–837. 
Ajeeb, N., A. Nayal, and M. Awad. “Minority SVM for Linearly Separable and Imbalanced Datasets.” In  IJCNN 2013: Proceedings of the 2013 International Joint Conference on Neural Networks, 1–5. Piscataway, NJ:  Institute for Electrical and Electronics Engineers, 2013. 
Akbani, Rehan, Stephen Kwek, and Nathalie Japkowicz. “Applying Support Vector Machines to Imbalanced  Datasets.” In Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy,  September 2004, edited by Jean-François Boulicaut, Floriana Esposito, Fosca Giannotti, and Dino Pedreschi,  39–50. Berlin: Springer, 2004. 
Alham, N. K., Maozhen Li, S. Hammoud, Yang Liu, and M. Ponraj. “A Distributed SVM for Image  Annotation.” In FSKD 2010: Proceedings of the Seventh International Conference on Fuzzy Systems and  Knowledge Discovery, edited by Maozhen Li, Qilian Liang, Lipo Wang and Yibin Song, 2983–2987.  Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2010. 
Aronszajn, N. “Theory of Reproducing Kernels.” Transactions of the American Mathematical Society 68,  no. 3 (1950): 337–404. 
Bartlett, P. and J. Shawe-Taylor, “Generalization Performance of Support Vector Machines and Other Pattern  Classifiers”, Advances in Kernel Methods: Support Vector Learning, 1999. 
Ben-Hur, Asa, and Jason Weston. “A User’s Guide to Support Vector Machines.” Data Mining Techniques for  Life Sciences, edited by Oliviero Carugo and Frank Eisenhaber, 223-239. New York: Springer, 2010. 
Bennett, Kristen P., and O. L. Mangasarian. “Robust Linear Programming Discrimination of Two Linearly  Inseparable Sets.” Optimization Methods and Software 1, no. 1: (1992): 23–34. 
Boser, Bernard E., Isabelle M. Guyon, and Vladimir N. Vapnik. “A Training Algorithm for Optimal Margin  Classifiers.” In COLT ’92: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, edited by David Haussler, 144–152. New York: ACM, 1992. 
62 
Chapter 3 ■ Support Vector Machines for Classification 
Cao, L. J., S. S. Keerthi, Chong-Jin Ong, J. Q. Zhang, U. Periyathamby, Xiu Ju Fu, and H. P. Lee. “Parallel  Sequential Minimal Optimization for the Training of Support Vector Machines.” IEE Transactions on Neural  Networks 17, no. 4 (2006): 1039–1049. 
Catanzaro, Bryan Christopher, Narayanan Sundaram, and Kurt Keutzer. “Fast Support Vector Machine  Training and Classification on Graphics Processors.” In ICML ’08: Proceedings of the 25th International  Conference on Machine Learning, edited by William Cohen, Andrew McCallum, and Sam Roweis, 104–111.  New York: ACM, 2008. 
Chang, Chih-Chung, and Chih-Jen Lin. “LIBSVM: A Library for Support Vector Machines,” in “Large-Scale  Machine Learning,” edited by Huan Liu and Dana Nau, special issue, ACM Transactions on Intelligent  Systems and Technology 2, no. 3 (2011). 
Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. “SMOTE: Synthetic  Minority Over-Sampling Technique.” Journal of Artificial Intelligence Research 16 (2002): 321–357. 
Cover, Thomas M. “Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications  in Pattern Recognition.” IEEE Transactions on Electronic Computers 14 (1995): 326–334. 
Das, Barnan. “Implementation of SMOTEBoost Algorithm Used to Handle Class Imbalance Problem in  Data,” 2012. www.mathworks.com/matlabcentral/fileexchange/37311-smoteboost. 
Downs, Tom, Kevin E. Gates, and Annette Masters. “Exact Simplification of Support Vector Solutions.”  Journal of Machine Learning Research 2 (2002): 293–297. 
Duda, Richard O., and Peter E. Hart. Pattern Classification and Scene Analysis. New York: Wiley, 1973. Fawcett, Tom. “An Introduction to ROC Analysis. “Pattern Recognition Letters 27 (2006): 861–874. 
Feng, Jianfeng, and P. Williams. “The Generalization Error of the Symmetric and Scaled Support Vector  Machines.” IEEE Transactions on Neural Networks 12, no. 5 (1999): 1255–1260. 
Fine, Shai, and Katya Scheinberg. “Efficient SVM Training Using Low-Rank Kernel Representations.” Journal  of Machine Learning Research 2 (2002): 243–264. 
Fisher, Marshall L. “The Lagrangian Relaxation Method for Solving Integer Programming Problems.  “Management Science 50, no. 12 (2004):1861–1871. 
Frank, A., and A. Asuncion. University of California, Irvine, Machine Learning Repository. Irvine: University  of California, 2010. www.ics.uci.edu/`mlearn/MLRepository.html. 
Friedman, J. “Another Approach to Polychotomous Classification.” Technical report, Stanford University, 1996. 
Hajj, N., and M. Awad.” Isolated Handwriting Recognition via Multi-Stage Support Vector Machines.” In  Proceedings of the 6th IEEE International Conference on Intelligent Systems,” edited by Vladimir Jotsov,  Krassimir Atanassov, 152–157. Piscataway, NJ: Institute for Electrical and Electronic Engineers, 2012. 
He, Heibo, and Edwardo A. Garcia. “Learning from Imbalanced Data.” IEEE Transactions on Knowledge and  Data Engineering 21, no. 9(2009):1263–1284. 
Gill, Philip E., and Walter Murray. “Newton-Type Methods for Unconstrained and Linearly Constrained  Optimization.” Mathematical Programming 7 (1974): 311–350. 
Imam, Tasadduq, Kai Ming Ting, and Joarder Kamruzzaman. “z-SVM: An SVM for Improved Classification  of Imbalanced Data.” In AI 2006: Advances in Artificial Intelligence; Proceedings of the 19th Australian Joint  Conference on Artificial Intelligence, Hobart, Australia, December 4–8, 2006, edited by Abdul Sattar and  Byeong-Ho Kang, 264–273, 2006. Berlin: Springer, 2006. 
S. Jaeger, S. Manke, J. Reichert, A. Waibel, “Online Handwriting Recognition: the NPen++ Recognizer”,  International Journal on Document Analysis and Recognition, vol.3, no.3, 169-180, March 2008.
63 
Chapter 3 ■ Support Vector Machines for Classification 
Jin, A-Long, Xin Zhou, and Chi-Zhou Ye. “Support Vector Machines Based on Weighted Scatter Degree.” In  Artificial Intelligence and Computational Intelligence: Proceedings of the AICI Third International Conference,  Taiyuan, China, September 24–25, 2011, Part III, edited by Hepu Deng, Duoqian Miao, Jingsheng Lei, and Fu  Lee Wang, 620–629. Berlin: Springer, 2011. 
Karush, William. “Minima of Functions of Several Variables with Inequalities as Side Constraints.” Master’s  thesis, University of Chicago, 1939. 
Knerr, S., L. Personnaz, and G. Dreyfus. “Single-Layer Learning Revisited: A Stepwise Procedure for Building  and Training a Neural Network.” In Neurocomputing: Algorithms, Architectures and Applications; NATO  Advanced Workshop on Neuro-Computing, Les Arcs, Savoie, France, 1989, edited by Françoise Fogelman  Soulié and Jeanny Hérault, 41–50. Berlin: Springer, 1990. 
Koknar-Tezel, S., and L. J. Latecki. “Improving SVM Classification on Imbalanced Data Sets in Distance  Spaces.” In ICDM ’09: Proceedings of the Ninth IEEE International Conference on Data Mining, edited by Wei  Wang, Hillol Kargupta, Sanjay Ranka, Philip S. Yu, and Xindong Wu, 259–267. Piscataway, NJ: Institute for  Electrical and Electronics Engineers, 2010. 
Kong, Bo, and Hong-wei Wang. “Reduced Support Vector Machine Based on Margin Vectors.” In CiSE 2010  International Conference on Computational Intelligence and Software Engineering,” 1–4. Piscataway,  NJ: Institute for Electrical and Electronic Engineers, 2010. 
Kotsiantis, Sotiris, Dimitris Kanellopoulos, and Panayiotis Pintelas. “Handling Imbalanced Datasets: A  Review.”GESTS International Transactions on Computer Science and Engineering 30, no. 1 (2006): 25–36. 
Kressel, Ulrich H.-G. “Pairwise Classification and Support Vector Machines.” In Advances in Kernel Methods:  Support Vector Learning, edited by Bernhard Schölkopf, Christopher J. C. Burges, and Alexander J. Smola,  255–268. Cambridge, MA: Massachusetts Institute of Technology Press, 1999. 
Kuhn, H. W., and A. W. Tucker. “Nonlinear Programming.” In Proceedings of the Second Berkeley Symposium  on Mathematical Statistics and Probability, edited by Jerzy Neyman, 481–492. Berkeley: University of  California Press, 1951. 
Le Cun, Y., B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel. “Handwritten  Digit Recognition with a Back-Propagation Network.” In Advances in Neural Information Processing Systems,  edited by D. S. Touretzky, 396–404. San Mateo, CA: Morgan Kaufmann, 1990. 
Lee, Yuh-Jye, and O. L. Mangasarian. “SSVM: A Smooth Support Vector Machine for Classification.”  Computational Optimization and Applications 20 (2001a): 5–22. 
Lee, Yuh-Jye, and Olvi L. Mangasarian. “RSVM: Reduced Support Vector Machines.” In Proceedings of the  First SIAM International Conference on Data Mining, edited by Robert Grossman and Vipin Kumar, 5–7.  Philadelphia: Society for Industrial and Applied Mathematics, 2001b . 
Liu, Xin, and Ying Ding. “General Scaled Support Vector Machines.” In ICMLC 2011: Proceedings of the 3rd  International Conference on Machine Learning and Computing. Piscataway, NJ: Institute of Electrical and  Electronics Engineers, 2011. 
Mangasarian, O. L. “Linear and Nonlinear Separation of Patterns by Linear Programming.” Operations  Research 13, no. 3 (1965): 444–452. 
Meyer, David, Friederich Leisch, and Kurt Hornik.” The Support Vector Machine Under Test.”  Neurocomputing 55, nos. 1–2 (2003): 169–186. 
Napierała, Krystyna, Jerzy Stefanowski, and Szyman Wilk. “Learning from Imbalanced Data in Presence  of Noisy and Borderline Examples.” In Rough Sets and Current Trends in Computing: Proceedings of the  7th RSCTC International Conference, Warsaw, Poland, June 2010,edited by Marcin Szczuka, Marzena  Kryszkiewicz, Sheela Ramanna, Richard Jensen, and Qinghua Hu, 158–167. Berlin: Springer, 2010.
64 
Chapter 3 ■ Support Vector Machines for Classification 
Nguyen, Duc Dung, and Tuo Bao Ho. “A Bottom-Up Method for Simplifying Support Vector Solutions.” IEEE  Transactions on Neural Networks. 17, no. 3 (2006): 792–796. 
Ou, Yu-Yen, Hao-Geng Hung, and Yen-Jen Oyang, “A Study of Supervised Learning with Multivariate  Analysis on Unbalanced Datasets.” In IJCNN ’06: Proceedings of the 2006 International Joint Conference on  Neural Networks, 2201–2205. Piscataway, NJ: Institute for Electrical and Electronic Engineers, 2006. 
Peng, Peng, Qian-Lee Ma, and Lei-Ming Hong. “The Research of the Parallel SMO Algorithm for Solving  SVM.” In ICMLC 2009: Proceedings of the 2009 International Conference on Machine Learning and  Cybernetics, 1271–1274. Piscataway, NJ: Institute for Electrical and Electronics Engineers, 2009. 
Platt, John C. “Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines.”  Technical report MSR-TR-98-14, 1998. 
Platt, John C., Nello Cristianini, and John Shawe-Taylor. “Large Margin DAGs for Multiclass Classification.”  In Advances in Neural Information Processing Systems 12 (NIPS ‘99), edited S. A. Solla, T. K. Leen, and  K.-R. Müller, 547–553. Cambridge, MA: Massachusetts Institute of Technology Press, 2000. 
Poggio, Tomaso, and Federico Girosi. “Networks for Approximation and Learning.” Proceedings of the IEEE 78,  no. 9 (1990): 1481–1497. 
Prat, Federico, Andrés Marzal, Sergio Martín, Rafael Ramos-Garijo, and María José Castro. “A Template Based Recognition System for On-line Handwritten Characters.” Journal of Information Science and  Engineering 25 (2009): 779–791. 
Rizk, Y., N. Mitri, and M. Awad. “A Local Mixture Based SVM for an Efficient Supervised Binary  Classification.” In IJCNN 2013: Proceedings of the International Joint Conference on Neural Networks, 1–8.  Piscataway, NJ: Institute for Electrical and Electronics Engineers, 2013. 
Rizk, Y., N. Mitri, and M. Awad. “An Ordinal Kernel Trick for a Computationally Efficient Support Vector  Machine.” In IJCNN 2014: Proceedings of the 2014 International Joint Conference on Neural Networks,  3930–3937. Piscataway, NJ: Institute for Electrical and Electronics Engineers, 2014. 
Schölkopf, B., John C. Platt, John C. Shawe-Taylor, Alex J. Smola, and Robert C. Williamson. “Estimating the  Support of a High-Dimensional Distribution.” Neural Computation 13, no. 7 (2001):1443–1471. 
Smith, F. W. “Pattern Classifier Design by Linear Programming.” IEEE Transactions on Computers, C-17.  no. 4 (1968): 367–372. 
Stockman, M., and M. Awad. “Multistage SVM as a Clinical Decision Making Tool for Predicting Post  Operative Patient Status.” IKE ’10: Proceedings of the 2010 International Conference on Information and  Knowledge Engineering. Athens, GA: CSREA, 2010. 
Suykens, J. A. K. Suykens, and J. Vandewalle. “Least Squares Support Vector Machine Classifiers.” Neural  Processing Letters 9, no. 3 (1999): 293–300. 
Tang, Yuchun, Yan-Qing Zhang, Nitesh V. Chawla, and Sven Krasser. “SVMs Modeling for Highly Imbalanced  Classification.” Journal of Latex Class Files 1, no. 11 (2002). www3.nd.edu/~dial/papers/SMCB09.pdf. 
Tang, Yuchun, Yan-Qing Zhang, N. V. Chawla, and Sven Krasser. “SVMs Modeling for Highly Imbalanced  Classification.” Systems, Man, and Cybernetics B: IEEE Transactions on Cybernetics 39, no. 1 (2009): 281–288. 
Tax, David M. J., and Robert P. W. Ruin. “Support Vector Domain Description.” Pattern Recognition Letters  20 (1999): 1191–1199. 
Tax, David M. J., and Robert P. W. Duin. “Support Vector Data Description.” Machine Learning 54 (2004): 45–66. Vapnik, Vladimir N. The Nature of Statistical Learning Theory. New York: Springer, 1995. Vapnik, Vladimir N. Statistical Learning Theory. New York: Wiley, 1998.
65 
Chapter 3 ■ Support Vector Machines for Classification 
Vapnik, Vladimir N. The Nature of Statistical Learning Theory, Second Edition. New York: Springer, 1999. Vapnik, V., and A. Chervonenkis. “A Note on One Class of Perceptrons.” Automation and Remote Control 25 (1964). 
Vapnik, V., and A. Lerner. “Pattern Recognition Using Generalized Portrait Method.” Automation and Remote  Control 24 (1963): 774–780. 
Veropoulos, K., C. Campbell, and N. Cristianini. “Controlling the Sensitivity of Support Vector Machines.”  In IJCAI ‘99: Proceedings of the 16th International Joint Conference on Artificial Intelligence, edited by  Thomas Dean, 55–60. San Francisco: Morgan Kaufmann, 1999. 
Wahba, Grace. Spline Models for Observational Data. CBMS-NSF Regional Conference Series in Applied  Mathematics 59. Philadelphia: Society for Industrial and Applied Mathematics, 1990. 
Wang, Benjamin X., and Nathalie Japkowicz. “Boosting Support Vector Machines for Imbalanced Data Sets.”  Knowledge and Information Systems 25, no. 1 (2010): 1–20. 
Weston, J., and C. Watkins. Support Vector Machines for Multi-Class Pattern Recognition. In ESANN 1999:  Proceedings of the 7th European Symposium on Artificial Neural Networks, Bruges, Belgium, 21–23 April 1999,  219–224. 1999. https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es1999-461.pdf. 
Zeng, Zhi-Qiang, Hong-Bin Yu, Hua-Rong Xu, Yan-Qi Xie, and Ji Gao. “Fast Training Support Vector  Machines Using Parallel Sequential Minimal Optimization.” In ISKE 2008: Proceedings of the 3rd  International Conference on Intelligent System and Knowledge Engineering, edited by Shaozi Li, Tianrui Li,  and Da Ruan, 997–1001. Piscataway, NJ: Institute for Electrical and Electronics Engineers, 2008. 
Zhang, Li, Ning Ye, Weida Zhou, and Licheng Jiao. “Support Vectors Pre-Extracting for Support Vector  Machine Based on K Nearest Neighbour Method.” In ICIA 2008: Proceedings of the 2008 International  Conference on Information and Automation, 1353–1358. Piscataway, NJ: Institute of Electrical and  Electronics Engineers, 2008. 
Zhang, Xuegong. “Using Class-Center Vectors to Build Support Vector Machines.” Neural Networks for Signal  Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop, edited by Yu-Hen Hu, Jan  Larsen, Elizabeth Wilson, and Scott Douglas, 3–11. Piscataway, NJ: Institute for Electrical and Electronic  Engineers, 1999. 
Zhuang, Ling, and Honghua Dai. “Parameter Optimization of Kernel-Based One-Class Classifier on  Imbalance Text Learning.” Journal of Computers 1, no. 7 (2006): 32–40.
66 
Chapter 4 
Support Vector Regression 
The key to artificial intelligence has always been the representation. 
—Jeff Hawkins 
Rooted in statistical learning or Vapnik-Chervonenkis (VC) theory, support vector machines (SVMs) are  well positioned to generalize on yet-to-be-seen data. The SVM concepts presented in Chapter 3 can be  generalized to become applicable to regression problems. As in classification, support vector regression 
(SVR) is characterized by the use of kernels, sparse solution, and VC control of the margin and the number  of support vectors. Although less popular than SVM, SVR has been proven to be an effective tool in real-value  function estimation. As a supervised-learning approach, SVR trains using a symmetrical loss function,  which equally penalizes high and low misestimates. Using Vapnik’s ε-insensitive approach, a flexible tube  of minimal radius is formed symmetrically around the estimated function, such that the absolute values  of errors less than a certain threshold ε are ignored both above and below the estimate. In this manner,  points outside the tube are penalized, but those within the tube, either above or below the function, receive  no penalty. One of the main advantages of SVR is that its computational complexity does not depend on  the dimensionality of the input space. Additionally, it has excellent generalization capability, with high  prediction accuracy. 
This chapter is designed to provide an overview of SVR and Bayesian regression. It also presents a case  study of a modified SVR applicable to circumstances in which it is critically necessary to eliminate or strictly  limit underestimating a function. 
SVR Overview 
The regression problem is a generalization of the classification problem, in which the model returns a  continuous-valued output, as opposed to an output from a finite set. In other words, a regression model  estimates a continuous-valued multivariate function. 
SVMs solve binary classification problems by formulating them as convex optimization problems  (Vapnik 1998). The optimization problem entails finding the maximum margin separating the hyperplane,  while correctly classifying as many training points as possible. SVMs represent this optimal hyperplane with  support vectors. The sparse solution and good generalization of the SVM lend themselves to adaptation to  regression problems. SVM generalization to SVR is accomplished by introducing an ε-insensitive region  around the function, called the ε-tube. This tube reformulates the optimization problem to find the tube that  best approximates the continuous-valued function, while balancing model complexity and prediction error.  More specifically, SVR is formulated as an optimization problem by first defining a convex ε-insensitive loss  function to be minimized and finding the flattest tube that contains most of the training instances. Hence, a  multiobjective function is constructed from the loss function and the geometrical properties of the tube. 
67 
Chapter 4 ■ Support Vector Regression 
Then, the convex optimization, which has a unique solution, is solved, using appropriate numerical  optimization algorithms. The hyperplane is represented in terms of support vectors, which are training  samples that lie outside the boundary of the tube. As in SVM, the support vectors in SVR are the most  influential instances that affect the shape of the tube, and the training and test data are assumed to  be independent and identically distributed (iid), drawn from the same fixed but unknown probability  distribution function in a supervised-learning context. 
SVR: Concepts, Mathematical Model, and Graphical  Representation 
SVR problem formulation is often best derived from a geometrical perspective, using the one-dimensional  example in Figure 4-1. The continuous-valued function being approximated can be written as in Equation 4-1.  For multidimensional data, you augment x by one and include b in the w vector to simply the mathematical  notation, and obtain the multivariate regression in Equation 4-2. 
M 
 y f x w x b w xj j b y b x w M = = < >+ = + ∈ ∈ ∑ = ( ) , , , ,   , 1(4-1) 
j 
T 
 f x wbx w x b x w 
T M ( ) = ,     = + ∈ + 
1  (4-2) 
1 
Figure 4-1. One-dimensional linear SVR
SVR formulates this function approximation problem as an optimization problem that attempts to find  the narrowest tube centered around the surface, while minimizing the prediction error, that is, the distance  between the predicted and the desired outputs. The former condition produces the objective function in  Equation 4-3, where   w is the magnitude of the normal vector to the surface that is being approximated: 
 min . w w 122 (4-3) 
68 
Chapter 4 ■ Support Vector Regression 
To visualize how the magnitude of the weights can be interpreted as a measure of flatness, consider the  following example: 
M 
f x w wix x w i M 
( , ) = , , ∈ ∈ . ∑ =   1 
i 
Here, M is the order of the polynomial used to approximate a function. As the magnitude of the vector w increases, a greater number of wi are nonzero, resulting in higher-order solutions, as shown in Figure 4-2.  The horizontal line is a 0th-order polynomial solution and has a very large deviation from the desired  outputs, and thus, a large error. The linear function, a 1st-order polynomial, produces better approximations  for a portion of the data but still underfits the training data. The 6th-order solution produces the best  tradeoff between function flatness and prediction error. The highest-order solution has zero error but a  high complexity and will most likely overfit the solution on yet to be seen data. The magnitude of w acts as a  regularizing term and provides optimization problem control over the flatness of the solution. 
Figure 4-2. Solutions with various orders
The constraint is to minimize the error between the predicted value of the function for a given input  and the actual output. SVR adopts an ε-insensitive loss function, penalizing predictions that are farther  than ε from the desired output. The value of ε determines the width of the tube; a smaller value indicates  a lower tolerance for error and also affects the number of support vectors and, consequently, the solution  sparsity. Intuitively, the latter can be visualized for Figure 4-1. If ε is decreased, the boundary of the tube is  shifted inward. Therefore, more datapoints are around the boundary, which indicates more support vectors.  Similarly, increasing ε will result in fewer points around the boundary. 
Because it is less sensitive to noisy inputs, the ε-insensitive region makes the model more robust. Several  loss functions can be adopted, including the linear, quadratic, and Huber ε, as shown in Equations 4-4, 4-5,  and 4-6, respectively. As demonstrated in Figure 4-3, the Huber loss function is smoother than the linear  and quadratic functions, but it penalizes all deviations from the desired output, with greater penalty as the  error increases. The choice of loss function is influenced by a priori information about the noise distribution  affecting the data samples (Huber 1964), the model sparsity sought, and the training computational  complexity. The loss functions presented here are symmetrical and convex. Although asymmetrical loss  functions can be adopted to limit either underestimation or overestimation, the loss functions should be  convex to ensure that the optimization problem has a unique solution that can be found in a finite number of  steps. Throughout this chapter, the derivations will be based on the linear loss function of Equation 4-4. 
69 
Chapter 4 ■ Support Vector Regression 
⎧⎨⎪⎩⎪0 (4-4) , , ( ) ( ) = − ( ) ≤ 
ε 
 L y f x w y f x w 
ε , , , ; y f x w otherwise ε − ( ) − 
⎧ 
, , ( ) ( ) =− ( ) ≤ 
y f x w 
ε 
0 
⎨⎪ 
 L y f x w 
, ; 
(4-5) 
ε , , 2 
y f x w otherwise ε 
⎩⎪ 
⎧ 
( ) − ( ) − 2 
⎪⎪ 
( ) ( ) = 
 L y f x w 
c y f x w c y f x w c − ( ) − − ( ) > 
, , 
2 
(4-6) 
, , 
⎨ 
⎪⎪ 
⎩ 
2 
1 
− ( ) − ( ) ≤ y f x w y f x w c 
, , 
2 
Figure 4-3. Loss function types: (a) linear, (b) quadratic, and (c) Huber 
ASYMMETRICAL LOSS FUNCTIONS 
Some researchers have proposed modification to loss functions to make them asymmetrical. Shim,  Yong, and Hwang (2011) used an asymmetrical ε-insensitive loss function in support vector quantile  regression (SVQR) in an attempt to decrease the number of support vectors. The authors altered the  insensitivity according to the quantile and achieved a sparser model. Schabe (1991) proposed a  
two-sided quadratic loss function and a quasi-quadratic s-loss function for Bayes parameter estimation,  and Norstrom (1996) replaced the quadratic loss function with an asymmetrical loss function to  derive a general class of functions that approach infinity near the origin for Bayesian risk analysis.  Nath and Bhattacharyya (2007) presented a maximum margin classifier that bounds misclassification  for each class differently, thus allowing for different tolerances levels. Lee, Hsieh, and Wang (2005)  reformulated the typical SVR approach into a nonconstrained problem, thereby only solving a system  of linear equations rather than a convex quadratic one. Pan and Pan (2006) compared three* different  loss functions for economic tolerance design: Taguchi’s quadratic loss function, inverted normal loss  function, and revised inverted normal loss function. 
Adopting a soft-margin approach similar to that employed in SVM, slack variables ξ, ξ* can be added  to guard against outliers. These variables determine how many points can be tolerated outside the tube  illustrated in Figure 4-1. 
Based on Equations 4-3 and 4-4, the optimization problem in Equation 4-7 is obtained; C is a  regularization—thus, a tuneable parameter that gives more weight to minimizing the flatness, or the error, for  this multiobjective optimization problem. For example, a larger C gives more weight to minimizing the error.  This constrained quadratic optimization problem can be solved by finding the Lagrangian (see Equation 4-8).  The Lagrange multipliers, or dual variables, are λ, λ*, α,α* and are nonnegative real numbers.
70 
Chapter 4 ■ Support Vector Regression 
subject to  
1 * 
21   w C i i iN 
+ + ∑ = ξ ξ (4-7) 
 min , 
2 
y w i x i N T − ≤i i ε ξ + = * 1... 
w x y i N Ti i − ≤ + = i ε ξ 1... 
ξ ξi i , i N... * ≥ = 0 1 
1 
  w w C y w x i i iNi iNiTi , , , , , , * * * * * ( ) ξ ξ λ λ α α = + ξ ξ + + α ε − − ∑ ∑ = = 
2 
* 
1 1   ( ) − 
ξ 
i 
2 
i iNiTi i i i i i iN 
(4-8) 
+ −( ) + − − − + ∑ ∑ = = 
* * 
α ε ξ λξ λ ξ 
y w x 
1 1 
The minimum of Equation 4-8 is found by taking its partial derivatives with respect to the variables  and setting them equal to zero, based on the Karush-Kuhn-Tucker (KKT) conditions. The partial derivatives  with respect to the Lagrange multipliers return the constraints, which have to be less than or equal to zero,  as illustrated in Equation 4-9. The final KKT condition states that the product of the Lagrange multipliers  and the constraints is equal to zero (see Equation 4-10). The Lagrange multipliers that are equal to zero  correspond to data inside the tube, whereas the support vectors have nonzero-valued Lagrange multipliers.  The solution is written in terms of the support vector only—hence, the solution sparsity. The function  approximation is represented in Equation 4-12. By replacing Equation 4-9 in Equation 4-8, the dual form of  the optimization problem can be written as shown in Equation 4-13. 
δ 
 
i i i iN ∑ = ( ) * 
δ α α 
= − − = 0 
w w x 1 
δ 
 
* * 
λ α 
= − − = 
0 
δξ 
* 
C 
i i 
δ 
i 
 
λ α 
= − − = 
0 
δξ 
i 
δ 
C 
 
∑ 
i i i iN 
* 
(4-9) 
δλ ξ 
  
= ≤ * 
0 
i 
= 
1 
δ 
 
∑ 
i iN 
δλ ξ = ≤ 
0 
δ 
 
i 
T 
= 
1 
* 
δα ε ξ 
i i y w x 
= − − − ≤ 0 
δ 
 
* 
i 
i 
T 
δα i 
i i = −y w+ − x ε ξ − ≤ 0 i 
T 
( ) − + − − = α ε ξ y w x 
0 
i i 
T 
i i 
* * 
( ) − − − = 
α ε ξ 
y w x i 
(4-10) 
  
i i 
i i 
0 
∀ 
λ ξ 
i i 
* * 
= 
0 
, 
λ ξ i i 
= 
0 
w x i i i iNSV 
= − ( ) ∑ = α α * 
1 (4-11)
71 
Chapter 4 ■ Support Vector Regression 
 f x i i x xi C Ti i iNSV ( ) = − ( ) ∈ ∑ = α α α α * * , , [ , ]0 1(4-12) SV SVy 1 112 j iTj iN 
Nx x SV SV ( ) ∑∑ == 11 , (4-13) 
 max , ** * * * 
α α − + ε α( ) α α + − ( ) α α − − ( ) α α −α ∑ ∑ = = i i iNi i i iNi i j 
j 
subject to 
α α i i α αi i iNSV * * ( ) − = ∈ , , [ ] , ∑ = 0 0 1 C 
At the beginning of this section, the weights vector w was augmented with the scalar b, and the  derivation of the SVR’s mathematical formulation was carried out, disregarding the explicit computation of b (see Equation 4-2). However, b could have been calculated from the KKT conditions, as shown next. 
Training data that belong to the outside of the boundary of the tube will have nonzero αi or αi*; they  cannot both be zero, because that would mean that the instance (xi, yi) belongs to the lower and upper  boundary, which is not possible. Therefore, the corresponding constraints will be satisfied with equality, as  demonstrated in Equation 4-14. Furthermore, because the point is not outside the tube, ξi = 0 , leading to  the result in Equation 4-15 when α ∈( ,C)0 . Equation 4-16 computes b. Performing the same analysis for αi*,  one gets Equations 4-17 and 4-18. 
 y w i x b T − −i i − − ε ξ = 0 (4-14)  y w i x b T − −i − = ε 0 (4-15) 
 b y w x iT = − i −ε (4-16) 
 − + y w i x b − − = Ti ε 0 (4-17) b y w x iT = − + −i ε (4-18) 
Instead of using the KKT conditions, one could have also computed b, while solving the optimization  problem, using the interior-point method, which can converge to an optimal solution in logarithmic time by  navigating along the central path of the feasible region. The central path is determined by solving the primal  and dual optimization problems simultaneously. 
Kernel SVR and Different Loss Functions: Mathematical  Model and Graphical Representation 
The previous section dealt with data in the feature space, assuming f(x) is linear. For non linear functions,  the data can be mapped into a higher dimensional space, called kernel space, to achieve a higher accuracy,  using kernels that satisfy Mercer’s condition (see Figure 4-4), as discussed previously for classification.  Therefore, replacing all instances of x in Equations 4-1–4-18 with k(xi, xj) yields the primal formulation  shown in Equation 4-19, where ϕ(.) is the transformation from feature to kernel space. Equation 4-20  describes the new weight vector in terms of the transformed input. The dual problem is represented in  Equation 4-21, and the function approximation f(x) is in Equation 4-22, where k(.,.), the kernel, is as  illustrated in Equation 4-23.
72 
Chapter 4 ■ Support Vector Regression 
1 * 
21   w C i i iN 
+ + ∑ = ξ ξ (4-19) 
 min , 
2 
subject to  
y w i x i N T − ϕ ε ( )i i ≤ +ξ = * 1,..., 
w x y i N T 
ϕ ε i i i ( ) − ≤ + = ξ 1,..., 
ξ ξi i , , i N ..., * ≥ = 0 1 
N 
SV 
∑ ( ) * 
α α ϕ (4-20) 
= − ( ) 
 w x 
N 
i 
N 
= 
i i i 
1 
N 
SV SV SV NSV 
1 
2 ∑( ) α α i i − ( ) α α j j − k x( ) i j x * * , (4-21) 
 max ,* * * α α − + ε α( ) α α + − ( ) α − = = = = 
∑ ∑ ∑ 
i 
i i 
y 
i i i 
1 1 1 1 
i 
N 
j 
SV 
i 
, , , ,..., , * * ∈[ ] = − ( ) = 0 1 ∑ 0 α αi i SV α α C i N i i 
N 
SV 
i 
= 
1 
∑ , * 
α α (4-22) 
( ) = − ( ) ( ) 
 f x k x x 
i 
= 
i i i 
1 
 k x x x x i i ( ) , . =ϕ ϕ ( ) ( ) (4-23) 
2.5 
2 
1.5 
1 
0.5 
0 
-0.5 
-1 
-1.5 
-2 
-2.5 
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 
Figure 4-4. Nonlinear regression
73 
Chapter 4 ■ Support Vector Regression 
Bayesian Linear Regression  
Unlike SVR, Bayesian linear regression is a generative, as opposed to discriminant, method, that builds  linear regression models based on Bayesian inference. After specifying a model, the method computes the  posterior distribution of parameters and model predictions. This statistical analysis allows the method to  determine model complexity during training, which results in a model that is less likely to overfit. 
For simplicity, assume that a single output yp ∈ are predicted using the model parameters w learned from a set of predictor variables X sized k ×1 and observations Y sized n×1 . The observations Y are assumed to have the distribution in Equation 4-24, where σ2 is the variance of the uncertainty in the  observations: 
 P Y | , w X σ σ , ~ Xw, I 2 2 ( )  ( ) (4-24) 
Once the model has been specified, the model parameters’ posterior distributions can be estimated.  This is done by first assuming a prior distribution of the model parameters (see Equation 4-25). Given the  model variance and observations, the posterior distribution of the model parameters (which is Gaussian)  is as shown in Equation 4-26, with the mean computed in Equation 4-27, and the standard deviation scale  
factor, in Equation 4-28. The mean is simply the Moore-Penrose pseudoinverse of the predictive variables  multiplied by the observations. Given some observations, the posterior probability of the model variance is  computed, and an inverse chi-squared distribution (see Equation 4-29), with n k − degrees of freedom and  a scale factor s2 (see Equation 4-30), is obtained. The scale factor is the error between the model’s predicted  output and an observation. 
1 ( ) ∝ (4-25) 
2 
 P w,σ σ 2 
2 2 
2 , | , , ( ) = ~ , ( ) ( ) 
| σ σ σ 
 P w YP Y w X P w 
P Y w v E w | | 
σ σ 2 
( )  ( ) (4-26) 
2 
w X E X X Y T T = ( )−1 (4-27) 
 v X w XT = ( )−1 (4-28) 
P Y σ inv n k s σ σ 22 22 2 | | ( ) = ( ) ( ) 
 P YP Y P 
( ) ~ , − −  ( ) (4-29) 
T 
2 E = ( ) − ( ) − 
 s Y Xw Y Xw 
E 
− 
n k 
(4-30) 
The marginal posterior distribution of the model parameters, given the observations, is a multivariate  Student’s t-distribution, shown in Equation 4-31 and computed in Equation 4-32, with n k − degrees of  freedom, wE mean, and s2 scale factor, as P w|σ Y2 ( ) , has a normal distribution, and P Y σ2 ( ) | has an inverse  chi-squared distribution. 
 P w Y t n k w s E ( ) | ~ , ( ) − , 2 (4-31) 2 2 2 , (4-32)
 P w( ) | | Y P = ( ) w Y P Y ( ) | d ∫ 
σ 
σ σ σ 2 
74 
Chapter 4 ■ Support Vector Regression 
Given the model parameter probability distributions and a set of predictive variables Xp, the marginal  posterior predictive distribution Yp, which is a multivariate Student’s t-distribution (see Equation 4-33) can  be determined. The mean is computed in Equation 4-34, and the variance, in Equation 4-35. The predictive  distribution variance depends on the uncertainty in the observed data and the model parameters. 
 P YP p Y t n k E Y Y var Y Y p ( ) | | ~ , ( ) − ( ), , ( ) |σ2 (4-33) E Yp p Y X wE ( ) | = (4-34) 
 var Yp p Y I X vw p XT |σ σ 2 2 ( ) , = + ( ) (4-35) 
The concept of Bayesian regression is displayed in Figure 4-5, in which the sample input data available  during training would have been generated by a Gaussian distribution. If these instances represent their  population well, the regression model is expected to generalize well. 
Figure 4-5. One-dimensional regression example illustrating the Gaussian conditional probability  distributions of the output on the input and model parameters 
DISCRIMINANT VS. GENERATIVE MODELS 
A generative approach models the joint probability distribution of the data and class labels p(x, Ck),  based on the prior probability distributions of the class labels p(Ck) and the likelihood probability  distribution p x Ck ( ) | . The joint distribution computes the posterior probability distributions p C( ) k |k , which will be used to map datapoints to class labels. 
A discriminant approach directly computes the posterior probability distributions p C x k ( ) | without  computing the joint probability distribution p(x,Ck). A discriminant approach produces a mapping from  the datapoints to the class labels without computing probability distributions. Therefore, this approach  performs the inference and decision stages in one step.
75 
Chapter 4 ■ Support Vector Regression 
Advantages Disadvantages 
Generative •  Robust to outliers 
•  Can easily update decision model 
•  Allows combination of classifiers trained  
on different types of data by applying  
probability rules 
•  Can improve prediction accuracy by  
measuring confidence in classification based  
on posterior distributions and not making  
predictions when confidence is low 
Discriminant •  Computationally less demanding •  Simple to implement 
•  Computationally demanding •  Requires a lot of training data •  Suffers from the curse of  
dimensionality 
•  Sensitive to noisy data and outliers •  Requires retraining for any changes  in the decision model 
Asymmetrical SVR for Power Prediction: Case Study 
Justification: In many instances of approximation, there is an uneven consequence of misprediction,  based on whether the error is above or below the target value (Stockman et al. 2012a, 2012b). For example,  in power prediction an incorrect low estimate may be of much more concern than an overestimate.  Underpredicting can lead to insufficient cooling of datacenters, inadequate uninterruptible power supply  (UPS), unavailable processor resources, needless powering down of chip components, and so on. In the case  of forest fire behavior prediction, a lower estimate of the threat can lead to greater property damage as well  as loss of life, owing to a lack of adequate supply of personnel and equipment. 
In these instances, it is crucial to minimize misestimates on one side of a boundary, even at the risk of  reducing the accuracy of the entire estimation. It is necessary to restrict the loss function so that a minimal  number of under- or overestimates occur. This leads to an asymmetrical loss function for training, in which a  greater penalty is applied when the misestimate is on the wrong side of the boundary. 
Approach: Asymmetrical and lower-bounded SVR (ALB-SVR) was proposed by Stockman, Awad, and  Khanna (2012a). This approach modifies the SVR loss functions and corresponding error functions, such  that the ε-tube is only above the function, as demonstrated in Figure 4-6. The penalty parameter C is split  into C+ and C- so that different penalties can be applied to the upper and lower mispredictions. 
Figure 4-6. (a) SVR and (b) ALB-SVR (Source: Intel, 2012)
76 
Chapter 4 ■ Support Vector Regression 
ALB-SVR uses the Huber insensitive loss function (Popov and Sautin 2008). This function is similar to  the ε-insensitive loss function; however, it increases quadratically for small errors outside the ε-bound but  below a certain threshold ∂ > ε and then linearly beyond ∂. This makes it robust with respect to outliers.  The Huber insensitive loss function is represented by: 
⎧ ⎪⎪ 
0 
2 
if t y 
− ≤ 
ε 
∂ ( ) = 
L t, y ⎨ 
ε HuberSVR 
( ) − − < − < ∂ ε ε 
t y if t y 
⎪⎪ ε if t y . ( ) ∂ − − ≥ ∂ ( ) ∂ − − − 
⎩ 
ε 
2 
t y 
ALB-SVR modifies the Huber insensitive loss function as follows: 
≥ − ( ) ≤ 
ε 
if t y 
⎧ 
⎪⎪⎪ 
∂ − ( ) = L t y 
0 0 0 2 
( ) − ( ) − < t y if t y 
− , ( ) 
⎨ 
t y ε HuberALB SVR 2 
( ) − < − < ∂ ε ε 
if t y 
( ) 
⎪⎪⎪ 
( ) ∂ − ( ) − − ∂ − − ≥ ∂ ε ε 
2 
. 
⎩ 
Thus, the solution is: 
t y if t y 
L 
L 
⎢⎢⎢⎢⎢⎤⎦⎥⎥⎥⎥⎥ )( ), 
⎡ 
+ − 
1 2 2 
∑ ∑ 
+ − 
α α εα α ( ) ( ) − − − 
t 
i i i 
1 C 1 
i i 
max 
+ − 
α α 
i 
= 
2 
i 
= 
, 
1 
∑ 
− + − − ⋅ 
+ 
2 α α i i αi i j 
( 
α 
x x 
− − 
i 
, 
i j 
⎣ 
and the resulting optimization problem: L 
⎡ 
L 
1 2 2 
⎢⎢⎢⎢⎤⎦⎥⎥⎥⎥ − + − α α i i αi i j )( )x x 
+ − 
∑ ∑ 
+ − 
α α εα α − − + − t 
( ) ( ) 
i i i 
1 1 C 
i i 
max 
+ − 
α α 
i 
= 
2 
i 
= 
, 
+ 
1 
∑ 
( 
α 
+ 
2 − − ⋅ 
⎣ 
, 
i j 
i 
− ≤ − ≤ = + − C C ( ) α α i i i L 1.. 
L 
+ − ( ) α α − = . 
∑ i i 1 
0 
By substituting the new loss function, ALB-SVR’s empirical risk becomes 
1 
L 
∑ − − 
( ) = ALB SVR i i R y LL t y empi 
= 
ε ( , ). 1 
The maximum additional empirical risk for ALB-SVR can be computed as 
L 
L 
∑ ∑ ( ) − + 
ε. 
i y t 
y t 
∈ − ( )≤ ∈( ) − > 
ε ε 
i y t 
Validation: ALB-SVR was tested on a dataset used by David et al. (2010) and Stockman et al. (2010)  that consists of 17,765 samples of five attributes of memory activity counters, with the actual corresponding  power consumed in watts, as measured directly by a memory power riser. The memory power model  attributes are activity, read, write, CKE = high, and CKE = low. ALB-SVR was implemented with a modified 
77