Foundations Of Machine Learning Pdf Download
The book is dedicated to all students interested in machine learning who are not content with only running lines of (deep-learning) code but who are eager to learn about this discipline's assumptions, limitations, and perspectives. When I was a student, my dream was to become an AI researcher and save humankind with intelligent robots. For several reasons, I abandoned such ambitions (but you never know). In exchange, I discovered that machine learning is much more than a conventional research domain since it is intimately associated with the scientific process transforming observations into knowledge. The first version of this book was made publicly available in 2004 with two objectives and one ambition. The first objective was to provide a handbook to ULB students since I was (and still am) strongly convinced that a decent course should come with a decent handbook. The second objective was to group together all the material that I consider fundamental (or at least essential) for a Ph.D. student to undertake a thesis in my lab. At that time, there were already plenty of excellent machine learning reference books. However, most of the existing work did not sufficiently acknowledge what machine learning owes to statistics and concealed (or did not make explicit enough, notably because of incomplete or implicit notation) important assumptions underlying the process of inferring models from data. The ambition was to make a free academic reference on the foundations of machine learning available on the web. There are several reasons for providing free access to this work: I am a civil servant in an institution that already takes care of my salary; most of the material is not original (though its organisation, notation definition, exercises, code and structure represent the primary added value of the author); in many parts of the world access to expensive textbooks or reference material is still difficult for the majority of students; most of the knowledge underlying this book was obtained by the author thanks to free (or at least non charged) references and, last but not least, education seems to be the last societal domain where a communist approach may be as effective as rewarding. Personally, I would be delighted if this book could be used to facilitate the access of underfunded educational and research communities to state-of-the-art scientific notions.
Discover the world's research
- 20+ million members
- 135+ million publications
- 700k+ research projects
Join for free
Handbook
Statistical foundations of machine learning
Second edition
Gianluca Bontempi
Machine Learning Group
Computer Science Department
ULB, Universit´e Libre de Bruxelles,
Brussels, Belgium
mlg.ulb.ac.be
September 17, 2021
2
And indeed all things that are
known have number. For it is not
possible that anything whatsoever be
understood or known without this.
Philolaus, 400 BC
Not everything that can be counted
counts, and not everything that
counts can be counted.
W. B. Cameron, 1963
Preface to the 2021 edition
The book is dedicated to all students interested in machine learning who are not
content with only running lines of (deep-learning) code but who are eager to learn
about this discipline's assumptions, limitations, and perspectives. When I was a
student, my dream was to become an AI researcher and save humankind with in-
telligent robots. For several reasons, I abandoned such ambitions (but you never
know). In exchange, I discovered that machine learning is much more than a con-
ventional research domain since it is intimately associated with the scientific process
transforming observations into knowledge.
The first version of this book was made publicly available in 2004 with two
objectives and one ambition. The first objective was to provide a handbook to ULB
students since I was (and still am) strongly convinced that a decent course should
come with a decent handbook. The second objective was to group together all the
material that I consider fundamental (or at least essential) for a Ph.D. student to
undertake a thesis in my lab. At that time, there were already plenty of excellent
machine learning reference books. However, most of the existing work did not
sufficiently acknowledge what machine learning owes to statistics and concealed (or
did not make explicit enough, notably because of incomplete or implicit notation)
important assumptions underlying the process of inferring models from data.
The ambition was to make a free academic reference on the foundations of ma-
chine learning available on the web. There are several reasons for providing free
access to this work: I am a civil servant in an institution that already takes care
of my salary; most of the material is not original (though its organisation, notation
definition, exercises, code and structure represent the primary added value of the
author); in many parts of the world access to expensive textbooks or reference ma-
terial is still difficult for the majority of students; most of the knowledge underlying
this book was obtained by the author thanks to free (or at least non charged) refer-
ences and, last but not least, education seems to be the last societal domain where
a communist approach may be as effective as rewarding. Personally, I would be de-
lighted if this book could be used to facilitate the access of underfunded educational
and research communities to state-of-the-art scientific notions.
Though machine learning was already a hot topic at the end of the 20th century,
nowadays, it is definitely surrounded by a lot of hype and excitement. The number
of publications describing or using a machine learning approach in the last decades
is countless, making it impossible to address the heterogeneity of the domain in
a single book. Therefore, it is interesting to check how much material from the
first edition is still useful: reassuringly enough, the more the nature of the content
is fundamental, the less it is prone to obsolescence. Nevertheless, a lot of new
things (not only deep learning) happened in the domain, and, more specifically, I
realised the importance of some fundamental concepts that were neglected in the
first edition.
In particular, during those years, I realised the importance of exposing young
researchers to notions of multivariate dependency and independence. These notions
are brilliantly summarised in the topic of graphical models whose knowledge is es-
3
4
sential to grasp aspects of dimensionality reduction and feature selection. Secondly,
I (re)discovered that the foundations of machine learning lie in epistemology, the
branch of philosophy aiming to explain the meaning of knowledge and the process of
discovering it. Third, I became convinced that a process of discovering knowledge
from data should not be limited to modelling associations but aimed at discovering
causal mechanisms. Finally, I added a number of exercises, R scripts, and Shiny
dashboards to visualise and illustrate (sometimes too abstract) probabilistic and
estimation notions. In this sense, I am convinced that the adoption of Monte Carlo
simulation to introduce probabilistic concepts should be a more common habit in
introductory statistics classes.
For sure, I am strongly indebted to a lot of authors and their publications. I
hope I acknowledged them adequately in the bibliography. If I did not give enough
credit to some of the existing works, please do not hesitate to contact me. Last
but not least, the book is dedicated to all my ULB students and MLG researchers
in whom I have tried for many years to inculcate complex concepts of statistical
learning. Their eyes staring at my hand-waving, while I was trying to elucidate
some abstruse notions, were the best indicators of how to adapt, select and improve
the book's content.
To all those who want to send a note or continue to follow my machine learning
journey, see you on my blog https://datascience741.wordpress.com.
Acknowledgements
Though the book is not peer-reviewed, the added value of writing a handbook for
students and researchers is that they are typically very careful readers and willing to
pinpoint mistakes, inconsistencies, bad English and (a lot of) typos. First, I would
like to thank (in random order) the MLG researchers who sent me very useful
comments: Abhilash Miranda, Yann-a¨el Le Borgne, Souhaib Ben Taieb, Jacopo
De Stefani, Patrick Meyer, Olivier Caelen, Liran Lerman. Thanks as well to the
following students and readers (in random order) for their comments and remarks:
Robin de Haes, Mourad Akandouch, Zheng Liangliang, Olga Ibanez Sol´e, Maud
Destree, Wolf De Wulf, Dieter Vandesande, Miro-Manuel Matagne, Henry Morgan,
Pascal Tribel. A big thank to all of you! And do not hesitate to drop me an email
if you have comments or remarks!
Contents
Index 4
1 Introduction 15
1.1 Notations ................................. 22
2 Setting the foundations 27
2.1 Deductivelogic .............................. 27
2.2 Formal and empirical science . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Induction, projection, and abduction . . . . . . . . . . . . . . . . . . 29
2.4 Hume and the induction problem . . . . . . . . . . . . . . . . . . . . 30
2.5 Logical positivism and verificationism . . . . . . . . . . . . . . . . . 31
2.6 Popper and the problem of induction . . . . . . . . . . . . . . . . . . 32
2.7 Instrumentalism ............................. 33
2.8 Epistemology and machine learning: the cross-fertilisation . . . . . . 33
3 Foundations of probability 37
3.1 The random model of uncertainty . . . . . . . . . . . . . . . . . . . . 37
3.1.1 Axiomatic definition of probability . . . . . . . . . . . . . . . 39
3.1.2 Visualisation of probability measures . . . . . . . . . . . . . . 39
3.1.3 Symmetrical definition of probability . . . . . . . . . . . . . . 40
3.1.4 Frequentist definition of probability . . . . . . . . . . . . . . 41
3.1.5 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . 42
3.1.6 Independence and conditional probability . . . . . . . . . . . 42
3.1.7 Thechainrule .......................... 45
3.1.8 The law of total probability and the Bayes' theorem . . . . . 45
3.1.9 Direct and inverse conditional probability . . . . . . . . . . . 47
3.1.10 Logics and probabilistic reasoning . . . . . . . . . . . . . . . 49
3.1.11 Combined experiments . . . . . . . . . . . . . . . . . . . . . . 50
3.1.12 Array of joint/marginal probabilities . . . . . . . . . . . . . . 52
3.2 Randomvariables............................. 54
3.3 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.1 Parametric probability function . . . . . . . . . . . . . . . . . 55
3.3.2 Expected value, variance and standard deviation of a discrete
r.v.................................. 55
3.3.3 Entropy and relative entropy . . . . . . . . . . . . . . . . . . 58
3.4 Continuous random variable . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.1 Mean, variance, moments of a continuous r.v. . . . . . . . . . 59
3.4.2 Univariate Normal (or Gaussian) distribution . . . . . . . . . 60
3.5 Jointprobability ............................. 61
3.5.1 Marginal and conditional probability . . . . . . . . . . . . . . 62
3.5.2 Independence........................... 63
3.5.3 Chainrule............................. 64
5
6CONTENTS
3.5.4 Conditional independence . . . . . . . . . . . . . . . . . . . . 65
3.5.5 Entropy in the continuous case . . . . . . . . . . . . . . . . . 66
3.5.5.1 Joint and conditional entropy . . . . . . . . . . . . . 66
3.6 Bivariate continuous distribution . . . . . . . . . . . . . . . . . . . . 67
3.6.1 Correlation ............................ 68
3.7 Normal distribution: the multivariate case . . . . . . . . . . . . . . . 70
3.7.1 Bivariate normal distribution . . . . . . . . . . . . . . . . . . 71
3.7.2 Gaussian mixture distribution . . . . . . . . . . . . . . . . . . 72
3.7.3 Linear transformations of Gaussian variables . . . . . . . . . 72
3.8 Mutualinformation............................ 73
3.8.1 Conditional mutual information . . . . . . . . . . . . . . . . . 74
3.8.2 Joint mutual information . . . . . . . . . . . . . . . . . . . . 74
3.8.3 Partial correlation coefficient . . . . . . . . . . . . . . . . . . 75
3.9 Functions of random variables and Monte Carlo simulation . . . . . 76
3.10 Linear combinations of r.v. . . . . . . . . . . . . . . . . . . . . . . . 77
3.10.1 The sum of i.i.d. random variables . . . . . . . . . . . . . . . 77
3.11Conclusion ................................ 77
3.12Exercises ................................. 78
4 Graphical models 85
4.1 Conditional independence and multivariate distributions . . . . . . . 85
4.2 Directed acyclic graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3 Bayesiannetworks ............................ 86
4.3.1 Bayesian network and d-separation . . . . . . . . . . . . . . . 90
4.3.2 D-separation and I-map . . . . . . . . . . . . . . . . . . . . . 91
4.3.2.1 D-separation and faithfulness . . . . . . . . . . . . . 91
4.3.3 Skeleton and I-equivalence . . . . . . . . . . . . . . . . . . . . 93
4.3.4 Stable distributions . . . . . . . . . . . . . . . . . . . . . . . 94
4.4 Markovnetworks............................. 94
4.4.1 Separating vertices, separated subsets and independence . . . 95
4.4.2 Directed and undirected representations . . . . . . . . . . . . 95
4.5 Conclusions................................ 96
5 Parametric estimation 97
5.1 Classicalapproach ............................ 97
5.1.1 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Empirical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3 Plug-in principle to define an estimator . . . . . . . . . . . . . . . . 100
5.3.1 Sampleaverage.......................... 101
5.3.2 Sample variance . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4 Sampling distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.1 Shiny dashboard . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5 The assessment of an estimator . . . . . . . . . . . . . . . . . . . . . 103
5.5.1 Bias and variance . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.5.2 Estimation and the game of darts . . . . . . . . . . . . . . . 104
5.5.3 Bias and variance of ˆ
µ...................... 104
5.5.4 Bias of the estimator ˆ
σ2 ..................... 105
5.5.5 A tongue-twister exercise . . . . . . . . . . . . . . . . . . . . 106
5.5.6 Bias/variance decomposition of MSE . . . . . . . . . . . . . . 107
5.5.7 Consistency............................ 107
5.5.8 Efficiency ............................. 108
5.6 The Hoeffding's inequality . . . . . . . . . . . . . . . . . . . . . . . . 108
5.7 Sampling distributions for Gaussian r.v.s . . . . . . . . . . . . . . . . 109
5.8 The principle of maximum likelihood . . . . . . . . . . . . . . . . . . 109
CONTENTS 7
5.8.1 Maximum likelihood computation . . . . . . . . . . . . . . . 111
5.8.2 Maximum likelihood in the Gaussian case . . . . . . . . . . . 111
5.8.3 Cramer-Rao lower bound . . . . . . . . . . . . . . . . . . . . 113
5.8.4 Properties of m.l. estimators . . . . . . . . . . . . . . . . . . 114
5.9 Interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.9.1 Confidence interval of µ ..................... 115
5.10 Combination of two estimators . . . . . . . . . . . . . . . . . . . . . 117
5.10.1 Combination of m estimators .................. 118
5.10.1.1 Linear constrained combination . . . . . . . . . . . 118
5.11 Testing hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.11.1 Types of hypothesis . . . . . . . . . . . . . . . . . . . . . . . 119
5.11.2 Types of statistical test . . . . . . . . . . . . . . . . . . . . . 119
5.11.3 Pure significance test . . . . . . . . . . . . . . . . . . . . . . . 120
5.11.4 Tests of significance . . . . . . . . . . . . . . . . . . . . . . . 120
5.11.5 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . 121
5.11.6 The hypothesis testing procedure . . . . . . . . . . . . . . . . 122
5.11.7 Choiceoftest........................... 123
5.11.8 UMP level-α test......................... 125
5.11.9 Likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . 125
5.12Parametrictests ............................. 125
5.12.1 z-test (single and one-sided) . . . . . . . . . . . . . . . . . . . 126
5.12.2 t-test: single sample and two-sided . . . . . . . . . . . . . . . 127
5.13 A posteriori assessment of a test . . . . . . . . . . . . . . . . . . . . 128
5.14Conclusion ................................ 129
5.15Exercises ................................. 130
6 Nonparametric estimation and testing 135
6.1 Nonparametric methods . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.2 Estimation of arbitrary statistics . . . . . . . . . . . . . . . . . . . . 136
6.3 Jackknife ................................. 137
6.3.1 Jackknife estimation . . . . . . . . . . . . . . . . . . . . . . . 137
6.4 Bootstrap ................................. 139
6.4.1 Bootstrap sampling . . . . . . . . . . . . . . . . . . . . . . . 139
6.4.2 Bootstrap estimate of the variance . . . . . . . . . . . . . . . 139
6.4.3 Bootstrap estimate of bias . . . . . . . . . . . . . . . . . . . . 141
6.4.4 Bootstrap confidence interval . . . . . . . . . . . . . . . . . . 141
6.4.5 The bootstrap principle . . . . . . . . . . . . . . . . . . . . . 142
6.5 Randomisation tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.5.1 Randomisation and bootstrap . . . . . . . . . . . . . . . . . . 144
6.6 Permutationtest ............................. 144
6.7 Considerations on nonparametric tests . . . . . . . . . . . . . . . . . 145
6.8 Exercises ................................. 146
7 Statistical supervised learning 147
7.1 Introduction................................ 147
7.2 Estimating dependencies . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.3 Dependency and classification . . . . . . . . . . . . . . . . . . . . . . 152
7.3.1 The Bayes classifier . . . . . . . . . . . . . . . . . . . . . . . 154
7.3.2 Inverse conditional distribution . . . . . . . . . . . . . . . . . 155
7.4 Dependency and regression . . . . . . . . . . . . . . . . . . . . . . . 157
7.5 Assessment of a learning machine . . . . . . . . . . . . . . . . . . . . 158
7.5.1 An illustrative example . . . . . . . . . . . . . . . . . . . . . 159
7.6 Functional and empirical risk . . . . . . . . . . . . . . . . . . . . . . 164
7.6.1 Consistency of the ERM principle . . . . . . . . . . . . . . . 165
8CONTENTS
7.6.2 Key theorem of learning . . . . . . . . . . . . . . . . . . . . . 166
7.6.2.1 Entropy of a set of functions . . . . . . . . . . . . . 167
7.6.2.2 Distribution independent consistency . . . . . . . . 168
7.6.3 The VC dimension . . . . . . . . . . . . . . . . . . . . . . . . 169
7.7 Generalisation error . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.7.1 The decomposition of the generalisation error in regression . 170
7.7.2 The decomposition of the generalisation error in classification 173
7.8 The hypothesis-based vs the algorithm-based approach . . . . . . . . 174
7.9 The supervised learning procedure . . . . . . . . . . . . . . . . . . . 175
7.10 Validation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
7.10.1 The resampling methods . . . . . . . . . . . . . . . . . . . . . 177
7.11Concludingremarks ........................... 179
7.12Exercises ................................. 179
8 The machine learning procedure 181
8.1 Introduction................................ 181
8.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.3 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.4 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
8.5 Thedataset ................................ 184
8.6 Parametric identification . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.6.1 Error functions . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.6.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . 186
8.6.2.1 The linear least-squares method . . . . . . . . . . . 186
8.6.2.2 Iterative search methods . . . . . . . . . . . . . . . 186
8.6.2.3 Gradient-based methods . . . . . . . . . . . . . . . 187
8.6.2.4 Gradient descent . . . . . . . . . . . . . . . . . . . . 187
8.6.2.5 The Newton method . . . . . . . . . . . . . . . . . . 189
8.6.2.6 The Levenberg-Marquardt algorithm . . . . . . . . 190
8.6.3 Online gradient-based algorithms . . . . . . . . . . . . . . . . 192
8.6.4 Alternatives to gradient-based methods . . . . . . . . . . . . 192
8.7 Regularisation............................... 193
8.8 Structural identification . . . . . . . . . . . . . . . . . . . . . . . . . 194
8.8.1 Model generation . . . . . . . . . . . . . . . . . . . . . . . . . 194
8.8.2 Validation............................. 195
8.8.2.1 Testing ......................... 195
8.8.2.2 Holdout......................... 196
8.8.2.3 Cross-validation in practice . . . . . . . . . . . . . . 196
8.8.2.4 Bootstrap in practice . . . . . . . . . . . . . . . . . 196
8.8.2.5 Complexity based criteria . . . . . . . . . . . . . . . 197
8.8.2.6 A comparison of validation methods . . . . . . . . . 199
8.8.3 Model selection criteria . . . . . . . . . . . . . . . . . . . . . 199
8.8.3.1 The winner-takes-all approach . . . . . . . . . . . . 199
8.8.3.2 The combination of estimators approach . . . . . . 200
8.9 Partition of dataset in training, validation and test . . . . . . . . . . 201
8.10 Evaluation of a regression model . . . . . . . . . . . . . . . . . . . . 201
8.11 Evaluation of a binary classifier . . . . . . . . . . . . . . . . . . . . . 202
8.11.1 Balanced Error Rate . . . . . . . . . . . . . . . . . . . . . . . 203
8.11.2 Specificity and sensitivity . . . . . . . . . . . . . . . . . . . . 203
8.11.3 Additional assessment quantities . . . . . . . . . . . . . . . . 203
8.11.4 Receiver Operating Characteristic curve . . . . . . . . . . . . 204
8.11.5 Precision-recall curves . . . . . . . . . . . . . . . . . . . . . . 204
8.12 Multi-class problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
8.13Concludingremarks ........................... 207
CONTENTS 9
8.14Exercises ................................. 207
9 Linear approaches 211
9.1 Linearregression ............................. 211
9.1.1 The univariate linear model . . . . . . . . . . . . . . . . . . . 211
9.1.2 Least-squares estimation . . . . . . . . . . . . . . . . . . . . . 212
9.1.3 Maximum likelihood estimation . . . . . . . . . . . . . . . . . 214
9.1.4 Partitioning the variability . . . . . . . . . . . . . . . . . . . 214
9.1.5 Test of hypotheses on the regression model . . . . . . . . . . 215
9.1.5.1 Thet-test........................ 215
9.1.6 Interval of confidence . . . . . . . . . . . . . . . . . . . . . . 216
9.1.7 Variance of the response . . . . . . . . . . . . . . . . . . . . . 216
9.1.8 Coefficient of determination . . . . . . . . . . . . . . . . . . . 217
9.1.9 Multiple linear dependence . . . . . . . . . . . . . . . . . . . 217
9.1.10 The multiple linear regression model . . . . . . . . . . . . . . 217
9.1.11 The least-squares solution . . . . . . . . . . . . . . . . . . . . 218
9.1.12 Least-squares and non full-rank configurations . . . . . . . . 219
9.1.13 Properties of least-squares estimators . . . . . . . . . . . . . 219
9.1.14 Variance of the prediction . . . . . . . . . . . . . . . . . . . . 220
9.1.15 The HAT matrix . . . . . . . . . . . . . . . . . . . . . . . . . 220
9.1.16 Generalisation error of the linear model . . . . . . . . . . . . 221
9.1.16.1 The expected empirical error . . . . . . . . . . . . . 221
9.1.16.2 The PSE and the FPE . . . . . . . . . . . . . . . . 223
9.1.17 The PRESS statistic . . . . . . . . . . . . . . . . . . . . . . 226
9.1.18 Dual linear formulation . . . . . . . . . . . . . . . . . . . . . 227
9.1.19 The weighted least-squares . . . . . . . . . . . . . . . . . . . 228
9.1.20 Recursive least-squares . . . . . . . . . . . . . . . . . . . . . . 228
9.1.20.1 1st Recursive formulation . . . . . . . . . . . . . . . 229
9.1.20.2 2nd Recursive formulation . . . . . . . . . . . . . . 230
9.1.20.3 RLS initialisation . . . . . . . . . . . . . . . . . . . 230
9.1.20.4 RLS with forgetting factor . . . . . . . . . . . . . . 230
9.2 Linear approaches to classification . . . . . . . . . . . . . . . . . . . 231
9.2.1 Linear discriminant analysis . . . . . . . . . . . . . . . . . . . 232
9.2.1.1 Discriminant functions in the Gaussian case . . . . 233
9.2.1.2 Uniform prior case . . . . . . . . . . . . . . . . . . . 234
9.2.1.3 LDA parameter identification . . . . . . . . . . . . . 236
9.2.2 Perceptrons............................ 236
9.2.3 Support vector machines . . . . . . . . . . . . . . . . . . . . . 238
9.3 Conclusion ................................ 242
9.4 Exercises ................................. 242
10 Nonlinear approaches 245
10.1 Nonlinear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
10.1.1 Artificial neural networks . . . . . . . . . . . . . . . . . . . . 248
10.1.1.1 Feed-forward architecture . . . . . . . . . . . . . . . 248
10.1.1.2 Back-propagation . . . . . . . . . . . . . . . . . . . 250
10.1.1.3 Approximation properties . . . . . . . . . . . . . . . 253
10.1.2 From shallow to deep learning architectures . . . . . . . . . . 254
10.1.3 From global modelling to divide-and-conquer . . . . . . . . . 257
10.1.4 Classification and Regression Trees . . . . . . . . . . . . . . . 257
10.1.4.1 Learning in Regression Trees . . . . . . . . . . . . . 259
10.1.4.2 Parameter identification . . . . . . . . . . . . . . . . 259
10.1.4.3 Structural identification . . . . . . . . . . . . . . . . 259
10.1.5 Basis Function Networks . . . . . . . . . . . . . . . . . . . . . 262
10 CONTENTS
10.1.6 Radial Basis Functions . . . . . . . . . . . . . . . . . . . . . . 262
10.1.7 Local Model Networks . . . . . . . . . . . . . . . . . . . . . . 262
10.1.8 Neuro-Fuzzy Inference Systems . . . . . . . . . . . . . . . . . 263
10.1.9 Learning in Basis Function Networks . . . . . . . . . . . . . . 265
10.1.9.1 Parametric identification: basis functions . . . . . . 266
10.1.9.2 Parametric identification: local models . . . . . . . 266
10.1.9.3 Structural identification . . . . . . . . . . . . . . . . 268
10.1.10 From modular techniques to local modelling . . . . . . . . . . 268
10.1.11 Local modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 270
10.1.11.1 Nadaraya-Watson estimators . . . . . . . . . . . . . 270
10.1.11.2 Higher order local regression . . . . . . . . . . . . . 272
10.1.11.3 Parametric identification in local regression . . . . . 272
10.1.11.4 Structural identification in local regression . . . . . 275
10.1.11.5 The kernel function . . . . . . . . . . . . . . . . . . 275
10.1.11.6 The local polynomial order . . . . . . . . . . . . . . 275
10.1.11.7 The bandwidth . . . . . . . . . . . . . . . . . . . . . 276
10.1.11.8 The distance function . . . . . . . . . . . . . . . . . 277
10.1.11.9 The selection of local parameters . . . . . . . . . . . 278
10.1.11.10Bias/variance decomposition of the local constant
model.......................... 279
10.2 Nonlinear classification . . . . . . . . . . . . . . . . . . . . . . . . . . 281
10.2.1 Direct estimation via regression techniques . . . . . . . . . . 281
10.2.1.1 The nearest-neighbour classifier . . . . . . . . . . . 281
10.2.2 Direct estimation via cross-entropy . . . . . . . . . . . . . . . 284
10.2.3 Density estimation via the Bayes theorem . . . . . . . . . . . 285
10.2.3.1 Naive Bayes classifier . . . . . . . . . . . . . . . . . 285
10.2.3.2 SVM for nonlinear classification . . . . . . . . . . . 287
10.3 Is there a best learner? . . . . . . . . . . . . . . . . . . . . . . . . . . 288
10.4Conclusions ................................ 290
10.5Exercises ................................. 292
11 Model averaging approaches 309
11.1Stackedregression ............................ 309
11.2Bagging .................................. 310
11.3Boosting.................................. 312
11.3.1 The Ada Boost algorithm . . . . . . . . . . . . . . . . . . . . 312
11.3.2 The arcing algorithm . . . . . . . . . . . . . . . . . . . . . . . 314
11.3.3 Bagging and boosting . . . . . . . . . . . . . . . . . . . . . . 315
11.4RandomForests ............................. 315
11.4.1 Why are Random Forests successful? . . . . . . . . . . . . . . 316
11.5 Gradient boosting trees . . . . . . . . . . . . . . . . . . . . . . . . . 316
11.6Conclusion ................................ 317
11.7Exercises ................................. 317
12 Feature selection 319
12.1 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . 320
12.2 Approaches to feature selection . . . . . . . . . . . . . . . . . . . . . 325
12.3Filtermethods .............................. 325
12.3.1 Principal component analysis . . . . . . . . . . . . . . . . . . 326
12.3.1.1 PCA: the algorithm . . . . . . . . . . . . . . . . . . 327
12.3.2 Clustering............................. 329
12.3.3 Ranking methods . . . . . . . . . . . . . . . . . . . . . . . . . 329
12.4Wrappingmethods ............................ 330
12.4.1 Wrapping search strategies . . . . . . . . . . . . . . . . . . . 331
CONTENTS 11
12.4.2 The Cover and van Campenhout theorem . . . . . . . . . . . 332
12.5Embeddedmethods ........................... 332
12.5.1 Shrinkage methods . . . . . . . . . . . . . . . . . . . . . . . . 333
12.5.1.1 Ridge regression . . . . . . . . . . . . . . . . . . . . 333
12.5.1.2 Lasso .......................... 334
12.5.2 Kernelmethods.......................... 335
12.5.3 Dual ridge regression . . . . . . . . . . . . . . . . . . . . . . . 336
12.5.4 Kernel function . . . . . . . . . . . . . . . . . . . . . . . . . . 336
12.6 Similarity matrix and non numeric data . . . . . . . . . . . . . . . . 337
12.7 Averaging and feature selection . . . . . . . . . . . . . . . . . . . . . 338
12.8 Information-theoretic perspective . . . . . . . . . . . . . . . . . . . . 338
12.8.1 Relevance, redundancy and interaction . . . . . . . . . . . . . 338
12.8.2 Information-theoretic filters . . . . . . . . . . . . . . . . . . . 341
12.8.3 Information-theoretic notions and generalisation . . . . . . . 341
12.9 Assessment of feature selection . . . . . . . . . . . . . . . . . . . . . 342
12.10Conclusion ................................ 343
12.11Exercises ................................. 344
13 From prediction to causal knowledge 347
13.1 About the notion of cause . . . . . . . . . . . . . . . . . . . . . . . . 348
13.2 Causality and dependencies . . . . . . . . . . . . . . . . . . . . . . . 349
13.2.1 Simpson's paradox . . . . . . . . . . . . . . . . . . . . . . . . 351
13.3 Causal vs associational knowledge . . . . . . . . . . . . . . . . . . . 353
13.4 The two main problems in causality . . . . . . . . . . . . . . . . . . 355
13.5 Causality and potential outcomes . . . . . . . . . . . . . . . . . . . . 355
13.5.1 Causaleffect ........................... 356
13.5.2 Estimation of causal effect . . . . . . . . . . . . . . . . . . . . 357
13.5.3 Assignment mechanisms assumptions . . . . . . . . . . . . . . 358
13.5.4 About unconfoundness . . . . . . . . . . . . . . . . . . . . . . 358
13.5.5 Randomised designs . . . . . . . . . . . . . . . . . . . . . . . 359
13.5.5.1 Estimation of the treatment effect . . . . . . . . . . 360
13.5.5.2 Stratified (or conditionally) randomised experiments 361
13.5.6 Observational study . . . . . . . . . . . . . . . . . . . . . . . 361
13.5.7 Strategies for estimation in observational studies . . . . . . . 362
13.6 From potential outcomes to graphical models . . . . . . . . . . . . . 362
13.7 Causal Bayesian network . . . . . . . . . . . . . . . . . . . . . . . . . 363
13.7.1 Causal networks and Structural Causal Models . . . . . . . . 365
13.7.2 Pre and post-intervention distributions . . . . . . . . . . . . . 365
13.7.3 Causal effect estimation and identification . . . . . . . . . . . 366
13.7.3.1 Backdoor criterion . . . . . . . . . . . . . . . . . . . 368
13.7.3.2 Beyond sufficient set: do-calculus . . . . . . . . . . 370
13.7.4 Selectionbias ........................... 370
13.8Counterfactual .............................. 372
13.9 Causal structure identification . . . . . . . . . . . . . . . . . . . . . . 374
13.9.1 Constraint-based approaches . . . . . . . . . . . . . . . . . . 375
13.9.1.1 Normal conditional independence test . . . . . . . . 375
13.9.1.2 Skeleton discovery . . . . . . . . . . . . . . . . . . . 376
13.9.1.3 Dealing with immoralities in the skeleton . . . . . . 376
13.9.1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . 378
13.10Beyond conditional independence . . . . . . . . . . . . . . . . . . . . 378
13.10.1 Causality and feature selection . . . . . . . . . . . . . . . . . 379
13.10.2 Beyond observational equivalence . . . . . . . . . . . . . . . . 379
13.10.2.1 Learning directionality in bivariate associations . . . 380
13.11Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
12 CONTENTS
14 Conclusions 383
14.1 About ML limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 383
14.2Abitofethics............................... 384
14.3 Take-homenotions ............................ 385
14.4Recommendations ............................ 385
A Unsupervised learning 387
A.1 Probability density estimation . . . . . . . . . . . . . . . . . . . . . . 387
A.1.1 Nonparametric density estimation . . . . . . . . . . . . . . . 387
A.1.1.1 Kernel-based methods . . . . . . . . . . . . . . . . . 388
A.1.1.2 k-Nearest Neighbors methods . . . . . . . . . . . . . 389
A.1.2 Semi-parametric density estimation . . . . . . . . . . . . . . . 389
A.1.2.1 Mixture models . . . . . . . . . . . . . . . . . . . . 389
A.1.2.2 The EM algorithm . . . . . . . . . . . . . . . . . . . 390
A.1.2.3 The EM algorithm for the mixture model . . . . . . 390
A.2 K-meansclustering............................ 392
B Linear algebra notions 393
B.1 Rankofamatrix ............................. 393
B.2 Innerproduct............................... 393
B.3 Diagonalisation .............................. 394
B.4 QRdecomposition ............................ 394
B.5 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . 394
B.6 Chain rules of differential calculus . . . . . . . . . . . . . . . . . . . 395
B.7 Quadraticnorm.............................. 396
B.8 Quadratic programming . . . . . . . . . . . . . . . . . . . . . . . . . 396
B.9 The matrix inversion formula . . . . . . . . . . . . . . . . . . . . . . 396
C Probabilistic notions 399
C.1 Common univariate discrete probability functions . . . . . . . . . . . 399
C.1.1 The Bernoulli trial . . . . . . . . . . . . . . . . . . . . . . . . 399
C.1.2 The Binomial probability function . . . . . . . . . . . . . . . 399
C.2 Common univariate continuous distributions . . . . . . . . . . . . . . 400
C.2.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . 400
C.2.2 The chi-squared distribution . . . . . . . . . . . . . . . . . . 400
C.2.3 Student's t -distribution ..................... 400
C.2.4 F-distribution........................... 402
C.3 Common statistical hypothesis tests . . . . . . . . . . . . . . . . . . 402
C.3.1 χ2 -test: single sample and two-sided . . . . . . . . . . . . . . 402
C.3.2 t-test: two samples, two sided . . . . . . . . . . . . . . . . . . 402
C.3.3 F-test: two samples, two sided . . . . . . . . . . . . . . . . . 403
C.4 Transformation of random variables and vectors . . . . . . . . . . . . 403
C.5 Correlation and covariance matrices . . . . . . . . . . . . . . . . . . 404
C.6 Convergence of random variables . . . . . . . . . . . . . . . . . . . . 404
C.6.1 Example.............................. 405
C.7 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . 405
C.8 The Chebyshev's inequality . . . . . . . . . . . . . . . . . . . . . . . 405
C.9 Empirical distribution properties . . . . . . . . . . . . . . . . . . . . 405
C.10Usefulrelations .............................. 406
C.11 Minimum of expectation vs. expectation of minimum . . . . . . . . . 406
C.12 Taylor expansion of function . . . . . . . . . . . . . . . . . . . . . . 407
C.13 Proof of Eq. (7.5.28) . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
C.14 Biasedness of the quadratic empirical risk . . . . . . . . . . . . . . . 407
CONTENTS 13
D Plug-in estimators 409
E Kernel functions 411
F Companion R package 413
G Companion R Shiny dashboards 415
G.1 List of Shiny dashboards . . . . . . . . . . . . . . . . . . . . . . . . . 415
Chapter 1
Introduction
Over the last decades, a growing number of organisations have been allocating a
vast amount of resources to construct and maintain databases and data warehouses.
In scientific endeavours, data refers to carefully collected observations about some
phenomenon under study. In business, data capture information about economic
trends, critical markets, competitors, and customers. In manufacturing, data record
machinery performances and production rates in different conditions. There are
essentially two reasons why people gather increasing volumes of data. First, they
think some valuable assets are implicitly coded within them, and, second, computer
technology enables effective data storage and processing at reduced costs.
The idea of extracting useful knowledge from volumes of data is common to many
disciplines, from statistics to physics, from econometrics to system identification
and adaptive control. The procedure for finding useful patterns in data is known
by different names in different communities, viz., knowledge extraction, pattern
analysis, data processing. In the artificial intelligence community, the most common
name is machine learning [71]. More recently, the set of computational techniques
and tools to support the modelling of a large amount of data is grouped under the
more general label of data science.
The need for programs that can learn was stressed by Alan Turing, who argued
that it might be too ambitious to write from scratch programs for tasks that even
humans must learn to perform. This handbook aims to present the statistical
foundations of machine learning intended as the discipline which deals with the
automatic design of models from data. In particular, we focus on supervised learning
problems (Figure 1.1), where the goal is to model the relation between a set of input
variables and one or more output variables, which are considered to be dependent
on the inputs in some manner.
Since the handbook deals with artificial learning methods, we do not take into
consideration any argument of biological or cognitive plausibility of the learning
methods we present. Learning is postulated here as a problem of statistical estima-
tion of the dependencies between variables on the basis of empirical data.
The relevance of statistical analysis arises as soon as there is a need to extract
useful information from data records obtained by repeatedly measuring an observed
phenomenon. Suppose we are interested in learning about the relationship1 between
two observed variables x(e.g. the height of a child) and y (e.g. the weight of a
child), which are quantitative observations of some phenomenon of interest (e.g.
obesity during childhood). Sometimes, the a priori knowledge that describes the
relation between x and y is available. In other cases, no satisfactory theory exists,
and all that we can use are repeated measurements of x and y.
1Note that the term relation simply denotes the statistical association (due to a probabilistic
dependency) between the two variables and has no causal connotation.
15
16 CHAPTER 1. INTRODUCTION
Figure 1.1: The supervised learning setting. Machine learning aims to infer from
observed data the best model of the stochastic input/output dependency.
In this book, our focus is the second situation where we assume that only a set of
observed data is available. The reasons for addressing this problem are essentially
two. First, the more complex is the input/output relation, the less effective will be
the contribution of a human expert in extracting a model of the relation. Second,
data-driven modelling may be a valuable support for the designer also in modelling
tasks where she can take advantage of existing knowledge.
Though machine learning is becoming a central component in many (so-called)
intelligent applications, we deem that simply considering it as a powerful computa-
tional technology would be utterly reductive. The process of extracting knowledge
from observations lies at the root of the modern scientific process, and the most
challenging issues in machine learning relate to well-established philosophical and
epistemological problems, notably induction or the notion of truth. This is the
reason why we added in this new version of the handbook a preliminary chapter to
situate the machine learning problem into the broader context of human knowledge
acquisition.
Modelling from data
Modelling from data is often viewed as an art, mixing an expert's insight with the
information contained in the observations. A typical modelling process cannot be
considered as a sequential process but is better represented as a loop with many
feedback paths and interactions with the model designer. Various steps are repeated
several times aiming to reach, through continuous refinements, a good description
of the phenomenon underlying the data.
The modelling process consists of a preliminary phase that brings the data from
their original form to a structured configuration and a learning phase that aims to
select the model, or hypothesis, that best approximates the data (Figure 1.2).
The preliminary phase can be decomposed in the following steps:
Problem formulation. Here the model designer chooses a particular application
domain, a phenomenon to be studied, a number of descriptive variables and
hypothesises the existence of a (stochastic) relation (or dependency) between
the measurable variables. The definition of the input variables (and where
necessary their transformations) is a very crucial step and is called feature
17
Figure 1.2: The modelling process and its decomposition in the preliminary phase
and learning phase.
18 CHAPTER 1. INTRODUCTION
Figure 1.3: A training set for a simple supervised learning problem with one input
variable x and one output variable y . The dots represent the observed samples.
engineering. It is important to stress here the proactive role played by the
human (in contrast to a tabula rasa approach), and that this role is a necessary
condition for any knowledge process.
Experimental design. This step aims to return a dataset which, ideally, should
be made of observations that are well-representative of the phenomenon in
order to maximise the performance of the modelling process [55].
Pre-processing. In this step, raw data are cleaned to make learning easier. Pre-
processing includes a large set of actions on the observed data, such as noise
filtering, outlier removal, missing data treatment [124], feature selection, and
so on.
Once the preliminary phase has returned the dataset in a structured input/output
form (e.g. a two-column table), called training set , the learning phase begins. A
graphical representation of a training set for a simple learning problem with one
input variable xand one output variable y is given in Figure 1.3. This manuscript
will mostly focus on this second phase assuming that the preliminary steps have
already been performed by the model designer.
Suppose that, on the basis of the collected data, we wish to learn the unknown
dependency existing between the xvariable and the yvariable. The knowledge of
this dependency could shed light on the observed phenomenon and let us predict
the value of the output yfor a given input (e.g. what is the expected weight of a
child who is 120cm tall?). What is difficult and tricky in this task is the finiteness
and the random nature of data. For instance, a second set of observations of the
same pair of variables could produce a dataset (Figure 1.4) that is not identical to
the one in Figure 1.3 though both originate from the same measurable phenomenon.
This simple fact suggests that a simple interpolation of the observed data would
not produce an accurate model of the data.
The goal of machine learning is to formalise and optimise the procedure which
brings from data to model and consequently from data to predictions. A learning
procedure can be concisely defined as a search, in a space of possible model config-
urations, of the model which best represents the phenomenon underlying the data.
As a consequence, a learning procedure requires both a search space where possible
solutions may be found and an assessment criterion that measures the quality of
the solutions in order to select the best one.
The search space is defined by the designer using a set of nested classes with
increasing capacity (or representation power). For our introductory purposes, it is
19
Figure 1.4: A second realisation of the training set for the same phenomenon ob-
served in Figure 1.3. The dots represent the observed examples.
Figure 1.5: Training set and three parametric models which belong to the class of
first order polynomials.
sufficient to consider here a class as a set of input/output models (e.g. the set of
polynomial models) with the same model structure (e.g. second-order degree) and
the capacity of the class as a measure of the set of input/output mappings which
can be approximated by the models belonging to the class.
Figure 1.5 shows the training set of Figure 1.3 together with three parametric
models which belong to the class of first-order polynomials. Figure 1.6 shows the
same training set with three parametric models, which belong to the class of second-
order polynomials.
The reader could visually decide whether the class of second-order models is
more suitable or not than the first-order class to model the dataset. At the same
time, she could guess which among the three plotted models is the one that produces
the best fitting.
In real high-dimensional settings, however, a visual assessment of the quality
of a model is neither possible nor sufficient. Data-driven quantitative criteria are
therefore required. We will assume that the goal of learning is to achieve a good
statistical generalisation. This means that the learned model is expected to return
an accurate prediction of the dependent (output) variable for new (unseen) values
of the independent (input) variables. By new values we intend values which are not
part of the training set but are generated by the same stochastic process.
Once the classes of models and the assessment criteria are fixed, the goal of
a learning algorithm is to search i) for the best class of models and ii) for the
best parametric model within such a class. Any supervised learning algorithm is
then made of two nested loops denoted as the structural identification loop and the
parametric identification loop.
20 CHAPTER 1. INTRODUCTION
Figure 1.6: Training set and three parametric models which belong to the class of
second-order polynomials.
Structural identification is the outer loop that seeks the model structure which
is expected to have the best accuracy. It is composed of a validation phase, which
assesses each model structure on the basis of the chosen assessment criterion, and a
selection phase which returns the best model structure on the basis of the validation
output. Parametric identification is the inner loop that returns the best model for a
fixed model structure. We will show that the two procedures are intertwined since
the structural identification requires the outcome of the parametric step in order to
assess the goodness of a class.
Statistical machine learning
On the basis of the previous section, we could argue that learning is nothing more
than a standard problem of optimisation. Unfortunately, the reality is far more
complex. In fact, because of the finite amount of data and their random nature,
there exists a strong correlation between parametric and structural identification
steps, which makes non-trivial the problem of assessing and, finally, choosing the
prediction model. In fact, the random nature of the data demands a definition of
the problem in stochastic terms and the adoption of statistical procedures to choose
and assess the quality of a prediction model. In this context, a challenging issue is
how to determine the class of models more appropriate to our problem. Since the
results of a learning procedure are found to be sensitive to the class of models chosen
to fit the data, statisticians and machine learning researchers have proposed over
the years a number of machine learning algorithms. Well-known examples are linear
models, neural networks, local modelling techniques, support vector machines, and
regression trees. The aim of such learning algorithms, many of which are presented
in this book, is to combine high generalisation with an effective learning procedure.
However, the ambition of this handbook is to present machine learning as a sci-
entific domain that goes beyond the mere collection of computational procedures.
Since machine learning is deeply rooted in conventional statistics, any introduc-
tion to this topic must include some introductory chapters to the foundations of
probability, statistics and estimation theory. At the same time, we intend to show
that machine learning widens the scope of conventional statistics by focusing on a
number of topics often overlooked by statistical literature, like nonlinearity, large
dimensionality, adaptivity, optimisation and analysis of massive datasets.
It is important to remark, also, that the recent adoption of machine learning
models is showing the limitation of pure black-box approaches, targeting accuracy
at the cost of interpretability. This is made evident by the embedding of automatic
approaches in decision-making processes with impact on ethical, social, political, or
21
juridical aspects. While we are personally skeptical about gaining any interpretabil-
ity from a large number of parameters and hyperparameters underlying a supervised
learner, we are confident that human insight can be obtained by techniques able
to reduce or modularise large variate tasks. In this direction, feature selection and
causal inference techniques are promising approaches to master the complexity of
data-driven modelling and return human accessible descriptions (e.g. in the form
of mechanisms).
This manuscript aims to find a good balance between theory and practice by
situating most of the theoretical notions in a real context with the help of practical
examples and real datasets. All the examples are implemented in the statistical
programming language R [160] made available by the companion package gbcode
(Appendix F). In this second edition, we provide as well a number of Shiny dash-
boards (Appendix G) to give the reader a more tangible idea of somewhat abstract
concepts. For an introduction to R we refer the reader to [53, 189]. This prac-
tical connotation is particularly important since machine learning techniques are
nowadays more and more embedded in plenty of technological domains, like bioin-
formatics, robotics, intelligent control, speech and image recognition, multimedia,
web and data mining, computational finance, business intelligence.
Outline
The outline of the book is as follows. Chapter 2 is one of the novelties of the second
edition. Its aim is to situate the process of modelling from data in a larger epistemo-
logical domain dealing with the problem of extracting knowledge from observations.
We deem it interesting to show how some of the formal problems addressed in the
book dates back to old philosophical disputes and works. Chapter 3 summarises the
relevant background material in probability. Chapter 4 has been added to introduce
graphical modelling, a flexible and interpretable way of representing large variate
problems in probabilistic terms. In particular, this formalism puts into evidence the
importance of conditional independence as a key notion to illustrate the properties
of dependencies and simplify the modelling of large dimensional tasks. Chapter 5 in-
troduces the parametric approach to parametric estimation and hypothesis testing.
Chapter 6 presents some nonparametric alternatives to the parametric techniques
discussed in Chapter 5. Chapter 7 introduces supervised learning as the statistical
problem of assessing and selecting a hypothesis function on the basis of input/output
observations. Chapter 8 reviews the steps which lead from raw observations to a
final model. This is a methodological chapter that introduces some algorithmic
procedures underlying most of the machine learning techniques. Chapter 9 presents
conventional linear approaches to regression and classification. Chapter 10 intro-
duces some machine learning techniques which deal with nonlinear regression and
classification tasks. Chapter 11 presents the model averaging approach, a recent and
powerful way for obtaining improved generalisation accuracy by combining several
learning machines. Chapter 12 deals with the problem of dimensionality reduction
and in particular with feature selection strategies. Chapter 13 has been added in the
2nd edition to make clear the limitations of associational approaches and to stress
the risk of wrong extrapolation and biases if pure statistical results are interpreted
in a causal manner. We believe that causal reasoning represents the ultimate step
in the data analytics process going from data to knowledge.
Although the book focuses on supervised learning, some related notions of un-
supervised learning and density estimation are presented in Appendix A.
22 CHAPTER 1. INTRODUCTION
1.1 Notations
Throughout this manuscript, boldface denotes random variables and normal font is
used for instances (realisations) of random variables. Strictly speaking, one should
always distinguish in notation between a random variable and its realisation. How-
ever, we will adopt this extra notational burden only when the meaning is not
clear from the context. Then we will use Prob {z } (or (p (z )) as a shorthand for
Prob {z =z} ((pz (z )) when the identity of the random variable is clear from the
context.
As far as variables are concerned, lowercase letters denote scalars or vectors of
observables, greek letters denote parameter vectors, and uppercase denotes matrices.
Uppercase in italics denotes generic sets while uppercase in greek letters denotes
sets of parameters.
Gender-neutral pronoun: computer sciences suffer from the gender issue and
probably much more than other sciences. Of course, you won't find any solution in
this book but the author (a man) felt odd in referring to a generic reader by using
a masculine pronoun only. He then decided to use as much as possible a "(s)he"
notation or, alternatively, a (balanced) random gender choice.
Generic notation
-θ : Parameter vector.
-θ : Random parameter vector.
-M : Matrix.
- [N× n ] or [N, n]: Dimensionality of a matrix with N rows and ncolumns.
-MT : Transpose of the matrix M.
- diag[m1 , . . . , mN ]: Diagonal matrix with diagonal [m1 , . . . , mN ]
-M : Random matrix.
-ˆ
θ: Estimate of θ.
-ˆ
θ: Estimator of θ.
-τ : Index in an iterative algorithm.
Probability Theory notation
- Ω : Set of possible outcomes.
-ω : Outcome (or elementary event).
-{E} : Set of possible events.
-E : Event.
- Prob {E} : Probability of the event E.
- (Ω, {E}, Prob {·} ): Probabilistic model of an experiment.
-Z : Domain of the random variable z.
1.1. NOTATIONS 23
-P (z ): Probability distribution of a discrete random variable z . Also Pz (z).
-F (z ) = Prob {z ≤z } : Distribution function of a continuous random variable
z. Also Fz (z ).
-p (z ): Probability density of a continuous r.v.. Also pz (z).
-E [z ]: Expected value of the random variable z.
-Ex [z ] = R X z (x, y)p(x ) dx : Expected value of the random variable z averaged
over x.
- Var [z ]: Variance of the random variable z.
-LN (θ ): Likelihood of a parameter θgiven the dataset DN .
-lN (θ ): Log-Likelihood of a parameter θgiven the dataset DN .
-U (a, b ): univariate uniform probability density between a and b≥ a .
-N (µ, σ2 ): univariate Normal probability density with mean µand variance
σ2 (Section 3.4.2).
-z∼pz (z ): random variable z with probability density pz (z).
-z∼ N (µ, σ 2 ): random variable z with Normal density with mean µand
variance σ2 .
Learning Theory notation
-x : Multidimensional random input variable.
-xj :j th component of the multidimensional input variable.
-X ⊂ Rn : Input space.
-y : Multidimensional output variable.
-Y ⊂ R : Output space.
-xi :i th observation of the random vector x.
-xij :i th observation of the jth component of the random vector x.
-f (x ): Target regression function.
-w : Random noise variable.
-zi =hxi , yi i : Input-output example (also observation or data point): ith case
in training set.
-N : Number of observed examples in the training set.
-DN ={z1 , z2, . . . , zN } : Training set.
- Λ: Class of hypothesis.
-α : Hypothesis parameter vector.
-h (x, α ): Hypothesis function.
24 CHAPTER 1. INTRODUCTION
- Λs : Hypothesis class of capacity (or complexity) s.
-L (y, f (x, α )): Loss function.
-R (α ): Functional risk.
-α0 : arg minα∈Λ R (α ).
-Remp (α ): Empirical functional risk.
-αN : Parameter which minimises the empirical risk of DN
-GN : Mean integrated squared error (MISE).
-l : Number of folds in cross-validation.
-ˆ
Gcv : Cross-validation estimate of GN .
-ˆ
Gloo : Leave-one-out estimate of GN .
-Ntr : Number of examples used for training in cross-validation.
-Nts : Number of examples used for test in cross-validation.
-D(i) : Training set with the ith example set aside.
-αN(i ) : Parameter which minimises the empirical risk of D(i) .
-ˆ
Gbs : Bootstrap estimate of GN .
-D(b) : Bootstrap training set of size N generated by DN with replacement.
-α(b) : Parameter which minimises the empirical risk of the bootstrap set D(b) .
-B : Number of bootstrap examples.
Data analysis notation
-xi :i th row of matrix X.
-x·j :j th column of matrix X.
-xij :j th element of vector xi .
-Xij :ij th element of matrix X.
-q : Query point (point in the input space where a prediction is required).
- ˆ yq : Prediction in the query point.
- ˆ y−j
i: Leave-one-out prediction in x i with the j th s example set aside.
-eloo
j=y j −ˆ y−j
j: Leave-one-out error with the j th example set aside.
-K (· ): Kernel function.
-B : Bandwidth.
1.1. NOTATIONS 25
-β : Linear coefficients vector.
-ˆ
β: Least-squares parameters vector.
-ˆ
β−j : Least-squares parameters vector with the jth example set aside.
-hj (x, α ): jth ,j = 1, . . . , m , local model in a modular architecture.
-ρj : Activation or basis function.
-ηj : Set of parameters of the activation function.
26 CHAPTER 1. INTRODUCTION
Chapter 2
Setting the foundations:
machine learning and
epistemology
Machine learning is a relatively new discipline, but its foundations rest on much
older notions like modelling, reasoning, information, truth, knowledge, uncertainty,
induction. Nowadays, much of those notions have a mathematical and/or com-
putational interpretation, also thanks to machine learning. Nevertheless, before
reaching a mathematical formalisation, they have been the object of an extensive
philosophical inquiry and discussion. The aim of this chapter (primarily inspired by
the book [81]) is to provide a rapid historical journey over the most important con-
tributions of philosophy to epistemology, the branch of philosophy of science that
investigates how humans extract and attain knowledge in the scientific process.
The two main phases of human reasoning are the acquisition of true knowledge
and its manipulation in a truth-preserving manner. Induction is concerned with the
first part, while deductive logic addresses the second one. In ancient times, logic
was the only aspect of knowledge that deserved the attention of philosophers and
epistemologists. A possible reason was that, until the scientific revolution, it was a
common belief that either truth was inaccessible (e.g. the allegory of Plato's cave)
or could be attained only by an initiatory process of inspiration, made possible by
the benevolence of God.
2.1 Deductive logic
The most ancient discipline formalising the notions of truth, reasoning, and knowl-
edge is logic, whose origin dates back to Aristotle. Logic is concerned with defining
the properties that reasoning mechanisms should have in order to transform con-
sistently true statements into other true statements. The objects of reasoning are
arguments, i.e. groups of propositions where a proposition is a statement that can
be either true or false. According to [106] an argument (or inference) is made of
two groups of statements, one of which (premises) is claimed to provide support for
the other (conclusions). For instance
If A, then C
A.
C∴
27
28 CHAPTER 2. SETTING THE FOUNDATIONS
is an argument where the groups of premises is made of the two propositions ("If
A, then C" and "A") and the conclusion is the proposition "C". Premises are the
statements that define the evidence while the conclusion is the statement that the
evidence is supposed to imply. An argument consisting of exactly two premises and
one conclusion, like the one above, is called a syllogism. If one of the two premises is
in the conditional form (as the example above), it is called a hypothetical syllogism.
Logic cannot, in general, tell whether premises are true or false (factual claim). It
is instead concerned with the quality of the reasoning process, which links premises
to conclusions (inferential claim). Its purpose is to develop methods and techniques
that allow us to distinguish good arguments (where the premises do support the
conclusion) from bad ones. In particular, logic distinguishes between validity and
sound arguments. An argument is valid if
•it is logically impossible for the conclusion to be false when the premises are
true,
•conclusion is a logical consequence of (it follows from) the premises,
•it is truth-preserving, i.e. the conclusion is implicitly contained in the premises.
Two examples of valid arguments are
1. Premises: "If A, then C." and "A is true". Conclusion:"C is true".
2. Premises: "Every F is G." and "b is F". Conclusion: "b is G".
Validity is something that is determined by the relationship between premises and
conclusion ("does the premises support the conclusion?") and not by the actual
truth of premises and/or conclusions1. It follows that valid arguments are risk-free
arguments. Note also that the validity of an argument depends only on its form
(or pattern) and not on the content (i.e. no matter what are the substitutes for A
and C in the first argument). A valid argument is also called a deductive argument.
Examples of deductive arguments are arguments in which the conclusion depends
on some arithmetic or geometric computations or mathematical demonstrations.
All arguments in pure mathematics are deductive.
An argument is sound if it is valid and its premises are true. Soundness for
deductive logic has to do with both the validity and truth of the premises. Every
sound argument, by definition, will have a true conclusion as well. For instance, the
argument
All Italians play pretty good football
Gianluca is Italian.
Gianluca plays pretty good football ∴
is valid since the conclusion follows necessarily from the premises but not sound
(otherwise, Gianluca would have been playing for Fiorentina AC).
2.2 Formal and empirical science
Epistemologists are used to distinguishing between formal and empirical sciences.
Deductive arguments are the workhorse of formal sciences like geometry and math-
ematics. Those disciplines are built on a number of axioms, taken for true, and
1with the exception that a deductive argument with true premises and a false conclusion is
necessarily invalid
2.3. INDUCTION, PROJECTION, AND ABDUCTION 29
on an effective truth-preserving mechanism. As such, they reason about a concep-
tual world, not necessarily in relation to the material world, where it is possible
to define notions of truth, correctness, and soundness. On the empirical side, we
find disciplines like physics, biology, and economics, whose statements are supposed
to have a strong relationship with (some aspects of) sensible human experiences.
Though empirical sciences often rely on formal sciences to define notions, concepts,
and models, the validity of an empirical science proposition does not derive exclu-
sively from its formal truth but essentially from the fact that its predictions are
in accordance with experimental observations. Empirical sciences make then use of
inductive arguments where the content of the conclusion is in some way intended
to go beyond the content of the premises: a typical example is a prediction about
a future event based on the observation of some events, i.e. the supervised learning
scenario illustrated in Figure 1.1.
Modern empirical science, and the critical analysis of its inductive basis, be-
gan around the 16th and 17th centuries when the demand for new technologies
(e.g. for military or exploration reasons) stimulated the inquiry into the origins of
knowledge. In 1620 Francis Bacon, an English philosopher (1561-1626), published
the Novum Organum , which presented an inductivist view of science. According to
Bacon, scientific reasoning consists of making generalisations, or inductions , from
observations to general laws of nature (e.g. moving to the conclusion that all swans
are white after a number of historical observations). In other terms, the observa-
tions are supposed to induce the formulation of natural laws in the mind of the
scientist.
2.3 Induction, projection, and abduction
Induction is defined as an inference in which one takes the past as grounds for
beliefs about the future or the observed as grounds for beliefs about the unob-
served. In other words, an inference is ampliative , i.e. it has more content in the
conclusion than in the premises, unlike logical reasoning, which is deductive and
non-ampliative. In inductive inference, the premises or departure points are called
data or observations, and the conclusions are referred to as hypotheses2. A proba-
bilistic language is usually adapted to express a hypothesis derived from induction.
Induction has the following properties that contrast with the deductive pattern of
inference [19]
1. The conclusion (e.g. hypothesis h (D )) follows non-monotonically from the
premises (e.g. the dataset D). The addition of an extra premise (i.e. more
data) might change the conclusion even when the extra premise does not
contradict any of the other premises. In other terms, D1 ⊂D2 6⇒ h ( D1 ) ⊂
h( D2 ), where h( D ) is the inductive consequence of the set of observations D.
2. The truth of the premises is not enough to guarantee the truth of the conclu-
sion as there is no correspondence to the notion of deductive validity.
3. There is an information gain in induction since a hypothesis asserts more than
data alone.
Another substantial difference is that, while logical arguments derive their va-
lidity from their form, this does not apply to inductive arguments: two inductive
arguments may have the same form, but one may be good and the other not. So
inductive inference is both useful and unsafe: no conclusion is a guaranteed truth,
and it can dissolve even if no premise is removed.
2Note that this should not be confused with what mathematicians call mathematical induction
which is a kind of deduction
30 CHAPTER 2. SETTING THE FOUNDATIONS
There are several forms of inductive arguments:
1. Statement about a sample drawn from a population ⇒Statement about the
population as a whole
2. Statement about a population ⇒Statement about a sample
3. Statement about a sample ⇒ Statement about a new sample
4. Observation of facts ⇒ Hypothesis
The third form of inference is also called projection [81] and is implemented in statis-
tical learning by memory-based (e.g. lazy learning in Section 10.1.11) or transduc-
tion algorithms. The fourth form is also known as abduction , explanatory inference,
deduction in reverse or inference to the best explanation. Abduction is a less ambi-
tious form of induction since it does not infer to a generalisation but to a hypothesis
that explains the data. In abduction, given h→ D and the observation of D we
infer the condition h. The rationale is that explanatory considerations are a guide
to inference: in other words, the hypothesis that would (if correct) best explains
the evidence is the hypothesis that is most likely to be correct. Note that this is
the mechanism typically used in statistical hypothesis testing (Section 5.11).
An example of abduction is the Darwin theory. At his time, Darwin inferred the
hypothesis of natural selection because, though not entailed by biological evidence,
natural selection would provide the best explanation of that evidence. Darwin
did not witness specific cases of evolution but formulated his hypothesis as an
explanation of the available observations.
2.4 Hume and the induction problem
The downside of the induction success is its problematic and unsafe aspect, i.e. the
projection of regularity onto unseen cases. The main problem of induction is how
to justify the inference from the observed (data) to the unobserved (laws of nature),
from the past (historical time series) to the future (e.g. prediction).
David Hume (1711-1776) was a Scottish philosopher who studied the problem of
induction from a philosophic perspective. In 1739 he published A treatise of human
nature, one of the most influential books of Western philosophy. According to Hume,
all reasonings concerning nature are founded on experience, and all reasonings from
experience are founded on the supposition that the course of nature will continue
uniformly the same or in other terms that the future will be like the past. Any
attempt to show, based on experience, that a regularity that has held in the past
will hold in the future too will be circular (since based on the principle of regularity
itself).
So empirical sciences rely on a supposition that, as shown by Hume, has no
logical necessity. In other words, there is no contradiction in supposing that the
future could be totally unlike the past (Figure 2.1) since we have no logical reason
to expect that the past resembles the future.
So why do humans expect the future to be like the past? According to Hume,
this is part of human nature: we have inductive habits, but we cannot justify them.
The principle of uniformity of nature is not a priori true, nor it can be proved
empirically. There is no reason beyond induction to justify inductive reasoning.
Thus, Hume offers a naturalistic explanation of the psychological mechanism by
which empirical predictions are made but not any rational justification for this
practice. Our inductive practices rest on habit and custom and cannot be justified
by rational argument. Induction is psychologically natural to us [81].
2.5. LOGICAL POSITIVISM AND VERIFICATIONISM 31
Figure 2.1: Falsification of the inductive hypothesis "Are all swans white?"
2.5 Logical positivism and verificationism
Logical positivism is a philosophical movement belonging to the wider family of
empiricism, which developed in Europe after World War I and was established by a
group of people (including Schlick, Neurath, and Carnap), also known as the Vienna
Circle. They were inspired by the developments in sciences at the beginning of the
XXth century, notably the work of Einstein. Two are the central ideas (or dogmas)
of logical positivism: the distinction between analytic and synthetic sentences and
the verifiability theory of meaning [81].
Analytic sentences are true or false, whatever is the world state. Analytical
truths (e.g. in mathematics and logic) are necessary but somewhat empty. Math-
ematics does not describe the world and is independent of experience: they are a
convention to use symbols in a particular way.
A synthetic sentence is true or false according to the actual state of the world.
The value of synthetic sentences resides then in their method of verification. In
other words, knowing the meaning of a sentence boils down to know how to verify
it through observation. Verificationism is a strong empiricist principle: the only
source of knowledge and the only source of meaning is observation. There are two
categories of verifiable statements: i) observation statements (e.g. the temperature
is below zero) which are directly verifiable, and ii) theoretical statements (indirectly
verifiable) from which we can deduce observation statements.
Verificationists reject as "meaningless" statements specific to entire fields such
as metaphysics, theology, ethics since they do not imply verifiable observations.
Such statements may be meaningful in influencing emotions or human behaviour
but provide no truth value, information, or factual content.
Science consists then of verifiable and then meaningful claims. According to
the philosophy of logical positivism, a general statement or theory can be arrived
at by inductive reasoning. Moreover, if such a theory is verified by observation or
experiment, it can be promoted to a law. It follows that verifiability is the criterion
of what is and what is not science (demarcation criterion).
Logical positivists stress that almost none of the evidence in everyday life and
science may have the same degree of necessity as deductive logic. No evidence for a
scientific theory is ultimately decisive since there is always the possibility of error,
but this does not prevent science from being supported by evidence. The great aim
of science is to discover and establish generalisations since there is no alternative to
knowledge besides experience.
32 CHAPTER 2. SETTING THE FOUNDATIONS
However, the verificationist ambition of grounding scientific truth in experience
encountered some major problems related to the real possibility of verifying hy-
pothesis in practice:
1. pure observations do not exist: observations are always theory-laden , i.e. they
are inevitably affected by the theoretical beliefs (or expectations) of the inves-
tigator. Observations are neither neutral nor exhaustive, even in a big data
world. To observe means to select what seems to be pertinent for the analysis,
and this demands a specific and voluntary action from the experimenter (e.g.
selection of the instrumentation or the language to communicate the results).
Unfortunately, the analyst is often unaware of such selection, inducing then
dangerous bias in the possible conclusions (Section 13.7.4).
2. no scientific assumption is testable in complete isolation (also known as the
problem of holism about testing): the dogma of verificationism is naive since,
in practice, only whole complex structured hypotheses may be submitted to
empirical tests. Our ideas and hypothesis have contact with the experience
only as a whole. Whenever we assess a theory by comparing it with observa-
tions, we need many additional assumptions to put a theoretical statement at
the same level of observations.
3. unobservable entities escape from verification: one of the basic claims of logical
positivists is that all aspects of science can be reduced to observational state-
ments and submitted to verification (in science, there are no depths, there is
surface everywhere ). However, many successful and universally accepted sci-
entific formulations rely on hidden structures and notions that are not directly
observable (or mapped to observations in a univocal manner). Consider, for
instance, the notions of gene or electron and the significant impact they have
on the human understanding of reality.
Such criticisms contributed to the decline of the positivist program and opened the
way to alternative interpretations of the knowledge discovery process.
2.6 Popper and the problem of induction
Karl Popper (1902-1994) is generally regarded as one of the greatest philosophers
of science of the 20th century. His first achievement was an original definition of
science based on the distinction between scientific and pseudo-scientific statements
(also known as the demarcation problem). The solution he proposes is called falsifi-
cationism in opposition to the verificationism of positivists. Falsificationism claims
that a hypothesis is scientific if and only if it has the potential to be refuted by
some possible observation. To be scientific, a hypothesis has to entail testable pre-
diction; in other words, it has to be bold, to take a risk. For instance, All F is G is
a scientific statement while Some F is G is not. All scientific theories are univer-
sal in nature, and no finite collection of observation statements, however great, is
logically equivalent to or can justify an unrestricted universal proposition. At the
same time, we are never completely sure that a theory is true (aka fallibilism ). A
well-known example is Newton's physics which was considered for a long time as a
gold standard of scientific theory until it was shown to be false in several respects.
Popper was sceptical about all forms of confirmation and notably about the
theory of confirmation proposed by empiricists. According to him, the only good
reasoning is deductively valid reasoning. According to Popper, humans or scientists
do not make inductions; they make conjectures (or hypotheses) and test them (or
their logical consequences obtained by deduction). If the test is successful, the
conjecture is corroborated but never verified or proven. Confirmation is thus a
2.7. INSTRUMENTALISM 33
myth : no theory or belief about the world can be proven. Though no number of
positive experimental outcomes can demonstrate the truth of a scientific theory, a
single genuine counter instance can refute it (modus tollens ). It follows that we learn
something by deduction and not by induction. If the empirical test of the conjecture
is not successful, the conjecture is refuted. The refutation of a hypothesis leads us
(or the scientific community) to revise it or devise a more robust one. The final
result is that scientific laws are falsifiable yet strictly unverifiable.
Scientific knowledge evolves via a two-step cycle that repeats endlessly: the first
stage is made of conjecture making. The second stage is attempted refutation, when
the hypothesis is submitted to critical testing. The most important qualities of a
scientist are then imaginative (almost artistic) creativity, and rigorous testing.
Also, according to Popper, there are no "pure" or theory-free observations. Ob-
servation is always selective: it needs a chosen object, a definite task, an interest, a
point of view, a problem. Observation is theory-laden and involves applying theoret-
ical terms, descriptive language, and a conceptual scheme to particular experimental
situations.
2.7 The hypothetico-deductive method and instru-
mentalism
Nowadays, the most commonly agreed vision of science (hypothetico-deductivism )
merges the main ideas of logic and induction, realism and empiricism, of verifica-
tionism and falsificationism. According to this vision, science is a process where
scientists formulate hypotheses (e.g. inductive step after a preliminary stage where
observations were collected) and then deduce observational predictions from them.
If predictions are accurate, then the theory is supported (in agreement with logical
positivists) or (e.g. in Bayesian terms) its degree of truth increases. If predictions
are not accurate, the theory is disconfirmed (this is coherent with Popper). The
more tests a theory passes, the more confidence we can have in its truth3.
If the value of scientific models is intimately related to the quality of prediction,
they should be seen more as useful tools (or instruments) than a faithful representa-
tion of reality. The notion of instrumentalism was introduced by Van Fraassen [184].
An instrumentalist does not worry about whether a theory is a true description of
the world (e.g. if electrons really exist). The role of a theory is to establish a good
prediction. The question of whether our theory has some deeper match in the real
world will never have an answer so we should stop asking it.
Van Fraassen thinks that the only aim of theories is to accurately describe the
observable parts of the world. If this happens, they are empirically adequate. Trying
to address the hidden nature of reality is of no interest to science.
2.8 Epistemology and machine learning: the cross-
fertilisation
This chapter sketched some contributions of epistemology to the understanding of
how humans extract and attain knowledge from observations, in particular during
the scientific endeavour.
Machine learning, the topic of this book, is a computationally based approach
aiming to produce knowledge from observed data. If we make the basic assumption
3Note also that the predominating use of probabilistic hypotheses to take into account noisy
observation is in contradiction with the restrictive vision of Popper on logical deduction.
34 CHAPTER 2. SETTING THE FOUNDATIONS
that both epistemology and machine learning refers to the same notion of knowl-
edge (i.e. knowledge useful for human beings), an epistemological approach can be
useful to understand both limits and potential of machine learning. The author
is convinced that a fruitful cross-fertilisation can derive from a stronger synergy
between epistemology and machine learning. In particular, he expects the following
contributions from a machine learning approach to the study of knowledge discov-
ery:
•machine learning deals intimately with induction or how observations can
induce and/or confirm a theory, one of the most fundamental problems of
philosophy of science. Also, it implements in a reproducible and testable way
the mechanism of learning, generation of hypothesis, and testing.
•machine learning is today unavoidable in supporting discovery in scientific
domains where human experts would be overwhelmed by complexity and di-
mensionality.
•machine learning is a key factor of the revolution transforming all empirical
sciences into data sciences, i.e. inductive disciplines where the quality and the
accuracy of the discoveries are strictly dependent on the capacity of extracting
accurate information, predictions, or models from large amounts of observed
data.
•machine learning generalises and democratises the notion of observed evidence
by making it converging with the notion of data. Every instrument (or tool or
simulator) producing data can be taken as the starting point of a knowledge
discovery process. This extends the common notion of experimental evidence
adopted in conventional sciences, like physics. A financial transaction, a tweet,
or a GPS trace may be for some domains as informative as a CERN multi-
million experience in physics.
•machine learning is the ultimate step in the scientific process moving from the
optimistic objective of finding true descriptions of reality to the more realistic
goal of attaining accurate models of observations.
At the same time, there a number of lessons that young data scientists could
learn from ancient and recent philosophers of science:
•A critical analysis of the role of observations and data: all empirical sciences
derive their justification from the fact of being firmly founded on experiments.
The distinctive nature of machine learning, and a reason for its success, is the
automatic process of extracting knowledge from data. Observations and data
are then necessary conditions for triggering any knowledge discovery proce-
dure. There is, however, the risk to sanctify the role of data (or facts) as
an unquestionable and objective foundation of truth. This excess has been
several times discussed and criticised by epistemologists (notably the crit-
ics of logical positivism). Pure facts and theory-neutral observations do not
exist, not even in a big-data world. Observations (and more specifically ex-
periments) are never passive or beyond any suspicion: they are the results of
a specific human initiative (or intervention) that can be dictated by specific
objectives, constraints, and motivation. The presupposition that the truth
of empirical statements can be securely established only by observation is a
naive attitude that could lead to disastrous consequences (e.g. sexist or racist
AI applications due to sampling bias) [41].
•Skepticism about induction: the Hume analysis, confirmed by theoretical anal-
ysis in machine learning (notable the no-free-lunch theorem) reminds us that
2.8. EPISTEMOLOGY AND MACHINE LEARNING: THE CROSS-FERTILISATION35
atabula rasa approach going from data to knowledge is not possible. There is
no univocal (or optimal) way of proceeding from observations to models since
every learning process relies on (explicit or more often implicit) assumptions.
This is also related to the notion of undetermination of theory by evidence,
which means that there will always be a range of alternative theories compat-
ible with observations.
•Importance of hypothesis generation and validation: this important lesson
comes straight from Popper and associates the scientific character of a knowl-
edge discovery process to the possibility of falsification. In that sense, machine
learning complies with the Popper interpretation of science and goes further
by proposing a set of strategies for automatically generating hypotheses and
validating them by empirical evidence. In more actual terms, the best way
to ensure falsifiability to computation sciences is reproducibility and inter-
pretability. These two aspects are essential to guarantee the respect of high
standards of quality and rigour in computational approaches to knowledge
discovery. Forgetting the assumptions underlying any data-driven effort may
lead to accepting biased conclusions and misinterpretations (e.g. from a causal
perspective), which are dangerously endorsed by the size of the dataset or the
complexity of the algorithmic approach.
•Model as tools: the adoption of complex representation of reality (though
characterised by high-level notions and principles) makes difficult, if not un-
realistic, the validation of all the components of a model. As a consequence,
a model should not be considered as a faithful copy of reality but as a conve-
nient abstraction, which, if confirmed by experimental validation, becomes a
useful instrument for prediction and decision making.
•The confirmation of a hypothesis requires taking into account the procedures
involved in generating data: confirmation of a hypothesis with observations
is not a go-nogo process. Since new evidence changes degrees of validity
(or degree of belief), a probabilistic approach is necessary. This is why any
introduction to machine learning needs first an introduction to probability,
probabilistic reasoning, and then statistics.
36 CHAPTER 2. SETTING THE FOUNDATIONS
Chapter 3
Foundations of probability
Uncertainty is inescapable in the real world. Even without resort to indeterminism,
its pervasiveness is due to the complexity of reality and the limitations of human
observational skills and modelling capabilities. According to [119] uncertainty arises
because of limitations in our ability to observe the world, limitations in our ability
to model it, and possibly even because of innate nondeterminism. Probability theory
is one of many disciplines [143] concerned with the study of uncertain (or random)
phenomena. It is also, according to the author, one of the most successful ones
in terms of formalisation, theoretical and algorithmic developments and practical
applications. For this reason, in this book, we will adopt probability as the math-
ematical language to describe and quantify uncertainty. Uncertain phenomena,
although not predictable in a deterministic fashion, may present some regularities
and consequently be described mathematically by idealised probabilistic models.
These models consist of a list of all possible outcomes together with the respective
probabilities. The theory of probability makes it possible to infer from these models
the patterns of future behaviour.
This chapter presents the basic notions of probability which serve as a necessary
background to understand the statistical aspects of machine learning. We ask the
reader to become acquainted with two aspects: the notion of a random variable
as a compact representation of uncertain knowledge and the use of probability as
an effective formal tool to manipulate and process such uncertain information. In
particular, we suggest the reader give special attention to the notions of conditional
and joint probability. As we will see in the following, these two related notions
are extensively used by statistical modelling and machine learning to define the
dependence and the relationships between random variables.
3.1 The random model of uncertainty
We define a random experiment as any action or process which generates results
or observations which cannot be predicted with certainty. Uncertainty stems from
the existence of alternatives. In other words, each uncertain phenomenon is charac-
terised by a multiplicity of possible configurations or outcomes. Weather is uncer-
tain since it can take multiple forms (e.g. sunny, rainy, cloudy,...). Other examples
of random experiments are tossing a coin, rolling dice, passing an exam or measuring
the time to reach home.
A random experiment is then characterised by a sample space Ω that is a (finite
or infinite) set of all the possible outcomes (or configurations) ω of the experiment.
The elements of the set Ω are called experimental outcomes or realisations. For
example, in the die experiment, Ω = {ω1 , ω2, . . . , ω6 } and ωi stands for the outcome
37
38 CHAPTER 3. FOUNDATIONS OF PROBABILITY
corresponding to getting the face with the number i . If ω is the outcome of a
measurement of some physical quantity, e.g. pressure, then we could have Ω = R+ .
The representation of an uncertain phenomenon is the result of a modelling
activity and, as such, it is not necessarily unique. In other terms different repre-
sentations of a random experiment are possible. In the die experiment, we could
define an alternative sample space made of two sole outcomes: numbers equal to
and different from 1. Also, we could be interested in representing the uncertainty
of two consecutive tosses. In that case, the outcome would be the pair (ω (t) , ω(t+1) )
where ω(t) is the outcome at time t.
Uncertainty stems from variability. Each time we observe a random phenomenon,
we may observe different outcomes. In probabilistic jargon, observing a random
phenomenon is interpreted as the realisation of a random experiment. A single
performance of a random experiment is called a trial. This means that after each
trial, we observe one outcome ωi ∈ Ω.
A subset of experimental outcomes is called an event. Consider a trial that
generated the outcome ωi : we say that an event Eoccurred during the trial if the
set E contains the element ωi . For example, in the die experiment, an event (denoted
odd number ) is the set of odd values E = {ω1 , ω3, ω5 } . This means that when we
observe the outcome ω5 the event odd number takes place.
An event composed of a single outcome, e.g. E ={ ω1 } is called an elementary
event.
Note that since events Eare subsets, we can apply to them the terminology of
the set theory:
•Ω refers to the certain event i.e. the event that occurs in every trial.
•the notation
Ec = {ω: ω / ∈ E}
denotes the complement of E.
•the notation
E1 ∪ E2 = {ω ∈ Ω : ω ∈ E1 OR ω ∈ E2}
refers to the event that occurs when E1 or E2 or both occur.
•the notation
E1 ∩ E2 = {ω ∈ Ω : ω ∈ E1 AND ω ∈ E2}
refers to the event that occurs when both E1 and E2 occur.
•two events E1 and E2 are mutually exclusive or disjoint if
E1 ∩ E2 = ∅ (3.1.1)
that is each time that E1 occurs, E2 does not occur as well.
•a partition of Ω is a set of disjoint sets Ej ,j = 1, . . . , J (i.e. Ej 1 ∩ Ej 2 =∅
∀ji , j2 ∈J ) such that
∪J
j=1E j = Ω
•given an event Ewe define the indicator function of E by
IE ( ω) = ( 1 if ω∈ E
0 if ω / ∈ E (3.1.2)
3.1. THE RANDOM MODEL OF UNCERTAINTY 39
Let us consider now the notion of class of events. An arbitrary collection of
subsets of Ω is not a class of events. We require that if E1 and E2 are events, the
same also holds for the intersection E1∩E2 and the union E1 ∪E2 . A set of events that
satisfies these conditions is called, in mathematical terms, a Borel field [142]. We
will consider only Borel fields since we want to deal not only with the probabilities
of single events but also with the probabilities of their unions and intersections.
3.1.1 Axiomatic definition of probability
Probability is a measure of uncertainty. Once a random experiment is defined, this
measure associates to each possible outcome ω a number between 0 and 1. It follows
that we can assign to each event Ea real number Prob {E} ∈ [0, 1] which denotes
the probability of the event E . The measure associated with the event including all
possibilities is 1. The function Prob {·} : 2Ω → [0, 1] is called probability measure or
probability distribution and must satisfy the following three axioms:
1. Prob {E} ≥ 0 for any E.
2. Prob { Ω} = 1
3. Prob {E1 ∪ E2 } = Prob {E1 } + Prob {E2 } if E1 and E2 are mutually exclusive
(Equation (3.1.1)).
These conditions are known as the axioms of the theory of probability [120]. The
first axiom states that all the probabilities are nonnegative real numbers. The sec-
ond axiom attributes a probability of unity to the universal event Ω, thus providing
a normalisation of the probability measure. The third axiom states that the prob-
ability function must be additive for disjoint events, consistently with the intuitive
idea of how probabilities behave.
So from a mathematician perspective, probability is easy to define: it is a
countably additive set function defined on a Borel field, with a total mass of one.
Every probabilistic property, for instance E1 ⊂ E2 ⇒ Prob {E1 } ≤ Prob {E2 } or
Prob {E c } = 1 − Prob {E} , can be derived directly or indirectly from the axioms
(and only the axioms).
There are many interpretations and justifications of these axioms, and we dis-
cuss the frequentist and the Bayesian interpretation in Section 3.1.4 briefly. What
is relevant here is that the probability function is a formalisation of uncertainty
and that most of its properties and results appear to be coherent with the human
perception of uncertainty [110].
3.1.2 Visualisation of probability measures
Since probabilistic events are sets of outcomes, Venn diagrams are a convenient
manner to illustrate the relations between events and the notion of probability
measure. Suppose that you are a biker and you are interested in representing
the variability of weather and traffic conditions in your town in the morning. In
particular, you are interested in the probability that the morning will be sunny (or
not) and the road busy (or not). In order to formalise your practical issue, you
could define the uncertainty about the morning state by defining a sample space
which is the set of all possible morning conditions. Two events are of interest here:
sunny mornings and traffic conditions. What is the relationship and probability of
such two events? Figure 3.1 illustrates the sample space, the two events, and the
(hypothetical) probability measures by means of a Venn diagram and two different
tabular representations. The three representations in Figure 3.1 convey the same
information in different manners. Notwithstanding, they do not necessarily scale-
up in the same manner if we take into consideration a larger number of events.
40 CHAPTER 3. FOUNDATIONS OF PROBABILITY
0.2 0.25
0.1 0.45
Not
Sunny 0.25
0.2
Traffic
Sunny 0.1
No
Traffic
Sunny
Traffic
11 0.1
0.21 0
0.250 1
0.45
Traffic
0
Sunny
0
P
Figure 3.1: Visualisation of two events and probability measures: Venn diagram
(left), two-way table (center), probability distribution table (right)
0.15 0.15
0.05
1 1 0.051
0.051 01
0.0511 0
1 00 0.15
110 0.1
0.15
P
0.1
0.35
0 01
10 0
0
Traffic
0
Sunny
0
Polluted
Sunny
Traffic
Polluted
0.1
0.05
0.05
0.1
Figure 3.2: Visualisation of three events and related probability measures: Venn
diagram (left), probability distribution table (right)
For instance, for n events the Venn diagram should contain all 2n hypothetically
possible zones1.
Suppose that you are also interested in another type of event, i.e. the air quality.
Adding such an event to your probability representation would make your Venn
representation more complicated and the two-way table inadequate (Figure 3.2).
The visualisation will still be more difficult to handle and interpret if we deal with
more than three events.
Given their difficulty of encoding information in realistic probabilistic settings,
Venn diagrams are a pedagogical yet very limited tool for representing uncertainty.
Once introduced the notion of probability, a major question remains still open:
how to compute the probability value Prob {E} for a generic event E? The assign-
ment of probabilities is perhaps the most difficult aspect of constructing probabilistic
models. Although the theory of probability is neutral, that is it can make infer-
ences regardless of the actual probability values, its results will be strongly affected
by the choice of a particular assignment. This means that if the assignments are
inaccurate, the predictions of the model will be misleading and will not reflect the
real behaviour of the modelled phenomenon. In the following sections, we are going
to present some procedures which are typically adopted in practice.
3.1.3 Symmetrical definition of probability
Consider a random experiment where the sample space is made of a finite number
Mof symmetric outcomes (i.e., they are equally likely to occur). Let the number
of outcomes that are favourable to the event E (i.e. the event E takes place if one
of them occurs) be ME .
An intuitive definition of probability (also known as the classical definition) of
the event E, that adheres to the axioms, is
Prob {E} =M E
M(3.1.3)
1see Wikipedia https://en.wikipedia.org/wiki/Venn_diagram
3.1. THE RANDOM MODEL OF UNCERTAINTY 41
In other words, according to the principle of indifference (a term popularised by
J.M. Keynes in 1921), we have that the probability of an event equals the ratio of its
favourable outcomes to the total number of outcomes provided that all outcomes are
equally likely [142]. The computation of this quantity requires combinatorial meth-
ods for counting the favourable outcomes. This is typically the approach adopted
for a fair die. Also, in most cases, the symmetric hypothesis is accepted as self-
evident: if a ball is selected at random from a bowl containing W white balls and B
black balls, the probability that we select a white one is W/( W+ B ).
Note that this number is determined without any experimentation and is based
on symmetrical and finite space assumptions. But how to be sure that the symmet-
rical hypothesis holds? and that is invariant? Think, for instance, to the probability
that a newborn be a boy. Is this a symmetric case? More generally, how would one
define the probability of an event if the symmetrical hypothesis does not necessarily
hold or the space is not finite?
3.1.4 Frequentist definition of probability
Let us consider a random experiment and an event E . Suppose we repeat the
experiment N times and that we record the number of times NE that the event E
occurs. The quantity N E
N(3.1.4)
comprised between 0 and 1 is known as the relative frequency of E . It can be
observed that if the experiment is carried out a large number of times under exactly
the same conditions, the frequency converges to a fixed value for increasing N . This
observation led von Mises to use the notion of frequency as a foundation for the
notion of probability.
Definition 1.1 (von Mises) . The probability Prob {E} of an event Eis the limit
Prob {E} = lim
N→∞
NE
N(3.1.5)
where N is the number of observations and NE is the number of times that E
occurred.
This definition appears reasonable, and it is compatible with the axioms in
Section 3.1.1. However, in practice, in any physical experience, the number Nis
finite2 , and the limit has to be accepted as a hypothesis, not as a number that can
be determined experimentally [142].
Moreover, the assumption under exactly the same conditions is not as innocuous
as it seems. How could you ensure that two experiments occur under exactly the
same conditions? And what do those conditions refer to? Temperature, humidity,
obsolescence of the equipment? Are humans really able to control exactly all of
them? Would you be able to reproduce the exact same conditions of an experiment?
Notwithstanding, the frequentist interpretation is very important to show the
links between theory and application. At the same time, it appears inadequate to
represent probability when it is used to model a subjective degree of belief. Think,
for instance, to the probability that your professor wins a Nobel Prize: how to define
in such case a number Nof repetitions?
An important alternative interpretation of the probability measure comes then
from the Bayesian approach. This approach proposes a degree-of-belief interpreta-
tion of probability according to which Prob {E} measures an observer's strength of
belief that E is or will be true [192]. This manuscript will not cover the Bayesian
2As Keynes said "In the long run we are all dead".
42 CHAPTER 3. FOUNDATIONS OF PROBABILITY
approach to statistics and data analysis for the sake of compactness, though the
author is well aware that Bayesian machine learning approaches are more and more
common and successful. Readers interested in the foundations of the Bayesian in-
terpretation of probability are referred to [110]. Readers interested in introductions
to Bayesian machine learning are referred to [78, 13].
3.1.5 The Law of Large Numbers
A well-known justification of the frequentist approach is provided by the Weak Law
of Large Numbers, proposed by Bernoulli.
Theorem 1.2. Let Prob {E} =p and suppose that the event E occurs NE times in
Ntrials. Then, N E
Nconverges to pin probability, that is, for any > 0 ,
Prob
NE
N− p ≤ →1as N→ ∞
According to this theorem, the ratio NE /N is close to pin the sense that, for any
> 0, the probability that | NE /N − p| ≤ tends to 1 as N → ∞. This result justifies
the widespread use of the frequentist approach (e.g. in Monte Carlo simulation) to
illustrate or numerically solve probability problems. The relation between frequency
and probability is illustrated by the Shiny dashboard lawlarge.R (package gbcode).
Note that such a result does not imply that the number NE will be close to Np
as one could naively infer from (3.1.5). In fact,
Prob {NE = Np} ≈ 1
p2πN p(1 − p)→ 0, as N → ∞ (3.1.6)
For instance, in a fair coin-tossing game, this law does not imply that the ab-
solute difference between the number of heads and tails should oscillate close to
zero [180] (Figure 3.3). On the contrary, it could happen that the absolute differ-
ence keeps growing (though at a slower rate than the number of tosses) as shown
in the R script freq.R and the Shiny dashboard lawlarge.R.
3.1.6 Independence and conditional probability
Let us consider two different events. We have already introduced the notions of
complementary and disjoint events. Another important definition is the definition
of independent events and the related notion of conditional probability. This notion
is essential in machine learning since supervised learning aims to detect and model
(in)dependencies by estimating conditional probabilities.
Definition 1.3 (Independent events) . Two events E1 and E2 are independent if and
only if
Prob {E1 ∩ E2 } = Prob {E1 } Prob {E2 } (3.1.7)
and we write E1 ⊥⊥ E2 .
The probability Prob {E1 ∩ E2 } of seeing two events occurring together is also
known as joint probability and often noted as Prob {E1 ,E2 } . If two events are inde-
pendent the joint probability depends only on the two individual probabilities. As
an example of two independent events, think of two outcomes of a roulette wheel
or of two coins tossed simultaneously.
From an uncertain reasoning perspective, independence is a very simplistic as-
sumption since the occurrence (or the observation) of one event has no influence
on the occurrence of the other, or similarly that the second event has no memory
3.1. THE RANDOM MODEL OF UNCERTAINTY 43
Figure 3.3: Fair coin-tossing random experiment: evolution of the relative frequency
(left) and of the absolute difference (right) between the number of heads and tails
(R script freq.R in gbcode).
of the first. In other words, independence considers the uncertainty of a complex
joint event as a function of the uncertainties of its components3. This makes the
reasoning much simpler but, at the same time, too rough.
Exercise
Suppose that a fair die is rolled and that the number ω appears. Let E1 be the
event that the number ωis even, E2 be the event that the number ωis greater than
or equal to 3, E3 be the event that the number ωis a 4,5 or 6.
Are the events E1 and E2 independent? Are the events E1 and E3 independent?
•
Let E1 be an event such that Prob {E1 }> 0 and E2 a second event. We define
the conditional probability of E2 , given that E1 has occurred, the revised probability
of E2 after we learn about E1 occurrence:
Definition 1.4 (Conditional probability) . If Prob {E1 }> 0 then the conditional
probability of E2 given E1 is
Prob {E2|E1 } = Prob {E1 ∩ E2}
Prob {E1 } (3.1.8)
The following result derives from the definition of conditional probability.
Lemma 1. If E1 and E2 are independent events, then
Prob {E1|E2 } = Prob {E1 } (3.1.9)
In qualitative terms, the independence of two events means that the fact of
observing (or knowing) that one of these events (e.g. E1 ) occurred does not change
the probability that the other (e.g. E2 ) will occur.
3We refer the interested reader to the distinction between extensional and intensional reasoning
in [147]. Extensional reasoning (e.g. logics) always makes an assumption of independence, while
intensional reasoning (e.g. probability) consider independence as an exception.
44 CHAPTER 3. FOUNDATIONS OF PROBABILITY
Example
Let E1 and E2 two disjoint events with positive probability. Can they be indepen-
dent? The answer is no since
Prob {E1 ∩ E2 } = Prob {∅} = 0 6 = Prob {E1 } Prob {E2 }> 0
or equivalently Prob {E1|E2 } = 0. We can interpret this result by noting that if
two events are disjoint, the realisation of one of them is highly informative about
the realisation of the other. For instance, though it is very probable that Italy
will win the next football World Cup (Prob {E1 } >> 0) , this probability goes
to zero if the (rare yet possible) event E2 ("World cup won by Belgium") occurs
(Prob {E1|E2 } = 0). The two events are then dependent.
•
Exercise
Let E1 and E2 be two independent events, and E c
1the complement of E 1 . Are E c
1
and E2 independent?
•
Exercise
Consider the sample space Ω and the two events E1 and E2 in Figure 3.4. Suppose
that the probability of the two events is proportional to the surface of the regions.
From the Figure we compute
Prob {E1 } = 9
100 = 0. 09 (3.1.10)
Prob {E2 } = 20
100 = 0. 2 (3.1.11)
Prob {E1 ∩ E2 } = 1
100 = 0. 01 6 = Prob {E1 } Prob {E2 } (3.1.12)
Prob {E1 ∪ E2 } = 0.28 = Prob {E1 } + Prob {E2 } − Prob {E1 ∩ E2 } (3.1.13)
Prob {E1|E2 } = 1
20 = 0.05 6 = Prob {E1 } (3.1.14)
Prob {E2|E1 } = 1
96 = Prob {E2 } (3.1.15)
and then derive the following conclusions: the events E1 and E2 are neither disjoint
nor independent. Also, it is more probable that E2 occurs given that E1 occurred
rather than the opposite.
•
From (3.1.8) we derive
Prob {E1 ,E2 } = Prob {E1 } Prob {E2|E1 } (3.1.16)
If we replace the event E2 with the intersection of two events E2 and E3 , from (3.1.16)
we obtain
Prob {E1 ,E2 ,E3 } = Prob {E1 } Prob {E2 ,E3|E1 } =
Prob {E1 } Prob {E2|E3 , E1 } Prob {E3|E1 } = Prob {E1 ,E3 } Prob {E2|E3 , E1 }
If we divide both terms by Prob {E3 } we obtain
Prob {E1 ,E2|E3 } = Prob {E1|E3 } Prob {E2|E1 ,E3 } (3.1.17)
which is the conditioned version of (3.1.16).
3.1. THE RANDOM MODEL OF UNCERTAINTY 45
Figure 3.4: Events in a sample space.
3.1.7 The chain rule
The equation (3.1.16) shows that a joint probability can be factorised as the prod-
uct of a conditional and an unconditional probability. In more general terms, the
following rule holds.
Definition 1.5 (Chain rule) . For any sequence of events E1 ,E2 ,...,En ,
Prob {E1 ,E2 ,...,En } =
Prob {E1 } Prob {E2|E1 } Prob {E3|E1 ,E2 } . . . Prob {En|E1 ,E2 ,...,En−1 }
We will see in Chapter 4 that the chain rule factorisation and the notion of
conditional independence play a major role in the adoption of graphical models to
represent probability distributions.
3.1.8 The law of total probability and the Bayes' theorem
Let us consider an indeterminate practical situation where a set of events E1 , E2 ,...,
Ek may occur. Suppose that no two such events may occur simultaneously, but at
least one of them must occur. This means that E1 , E2 ,..., Ek are mutually exclusive
and exhaustive or, in other terms, that they form a partition of Ω. The following
two theorems can be proven.
Theorem 1.6 (Law of total probability) . Let Prob {Ei} ,i = 1, . . . , k denote the
probability of the ith event Ei and Prob {E|Ei } ,i = 1, . . . , k the conditional probability
of a generic event E given that Ei has occurred. It can be shown that
Prob {E} =
k
X
i=1
Prob {E|Ei } Prob {Ei } =
k
X
i=1
Prob {E ∩ Ei } (3.1.18)
The quantity Prob {E} is referred to as marginal probability and denotes the
probability of the event Eirrespective of the occurrence of other events. A common-
sense interpretation of this theorem is that if an event E (e.g. an effect) depends
on the realisation of kdisjoint events (e.g. causes), the probability of observing
E, is a weighted average of each single conditional probability Prob {E|Ei } where
the weights are given by the marginal probabilities of each event Ei , i = 1, . . . , k.
For instance, we can compute the probability that the highway is busy once we
46 CHAPTER 3. FOUNDATIONS OF PROBABILITY
know the probability that an accident occurred or not (two disjoint events) and the
conditional probabilities of traffic given the occurrence (or not) of an accident.
Theorem 1.7 (Bayes' theorem) . The conditional ("inverse") probability of any Ei ,
i= 1 , . . . , k given that E has occurred is given by
Prob {Ei |E} = Prob {E|Ei } Prob {Ei}
Pk
j=1 Prob {E|E j }Prob {E j }=Prob {E,E i }
Prob {E} i = 1, . . . , k
(3.1.19)
It follows that the Bayes theorem is the only sound way to derive from a condi-
tional probability Prob {E2|E1 } its inverse
Prob {E1|E2 } = Prob {E2|E1 } Prob {E1}
Prob {E2 } (3.1.20)
Any alternative derivation (or shortcut) will lead inevitably to fallacious reasoning
and inconsistent results (see the Prosecutor fallacy discussion in Section 3.1.9 ).
It may be useful also to write a conditioning version of the total probability.
Given an event E0 and the set E1 , E2 ,..., Ek of mutually exclusive events:
Prob {E|E 0 } =
k
X
i=1
Prob {E|Ei , E0 } Prob {Ei|E 0 } (3.1.21)
From (3.1.20) and by conditioning on a third event E3 , we obtain a conditioning
version of the Bayes theorem
Prob {E1|E2 ,E3 } = Prob {E2|E1 ,E3 } Prob {E1|E3}
Prob {E2|E3 } (3.1.22)
as long as Prob {E2|E3 }> 0
Example
Suppose that k = 2 and
• E1 is the event: "Tomorrow is going to rain".
• E2 is the event: "Tomorrow is not going to rain".
• E is the event: "Tonight is chilly and windy".
The knowledge of Prob {E1} , Prob {E2 } and Prob {E|Ek} ,k = 1, 2 makes possible
the computation of Prob {Ek|E}.
•
Exercise
Verify the validity of the law of total probability and of the Bayes theorem for the
problem in Figure 3.5.
•
3.1. THE RANDOM MODEL OF UNCERTAINTY 47
Figure 3.5: Events in a sample space
3.1.9 Direct and inverse conditional probability
The notion of conditional probability is central in probability and machine learning,
but it is often prone to dangerous misunderstanding, for instance, when inappropri-
ately used in domains like medical sciences or law. The most common error consists
of taking a conditional probability Prob {E1|E2 } for its inverse Prob {E2|E1 } . This
is also known as the prosecutor fallacy, as discussed in an example later.
The first important element to keep in mind is that for any fixed E1 , the quantity
Prob {·|E1 } still satisfies the axioms of probability, i.e. the function Prob {·|E1 } is
itself a probability measure. Conditional probabilities are probabilities [27]. How-
ever, this does not generally hold for Prob {E1|·}, which corresponds to fix the term
E1 on the left of the conditional bar. For instance if E2 ,E3 and E4 are disjoint events
we have
Prob {E2 ∪ E3∪ E4 |E1 } = Prob {E2|E1 } + Prob {E3|E1 } + Prob {E4|E1}
in agreement with the third axiom (Section 3.1.1) but
Prob {E1|E2 ∪ E3∪ E4 } 6 = Prob {E1|E2 } + Prob {E1|E3 } + Prob {E1|E4}
Also it is generally not the case that Prob {E2|E1 } = Prob {E1|E2} . As a conse-
quence if E1 and E2 are not independent then
Prob {E c
1|E 2 }= 1 − Prob {E 1 |E 2 }
but
Prob {E1|E c
2} 6= 1 − Prob {E 1 |E 2 }(3.1.23)
where Ec denotes the complement of E.
Another remarkable property of conditional probability, which is also a distinc-
tive aspect of probabilistic reasoning, is its non-monotonic property. Given a non
conditional probability Prob {E1 }> 0 a priori, we cannot say anything about the
conditional term Prob {E1|E2 } . This term can be larger, equal or smaller than
Prob {E1} . For instance if observing the event E2 makes the event more (less) prob-
able then Prob {E1|E2 }> Prob {E1 } (Prob {E1|E2 }< Prob {E1 } ). If the two events
are independent, then the probability of E1 does not change by conditioning. It
follows that the degree of belief of an event (or statement) depends on the context.
Note that this does not apply to conventional logical reasoning where the validity
of a statement is context-independent.
48 CHAPTER 3. FOUNDATIONS OF PROBABILITY
Italians
football supporters
World population
Figure 3.6: Italians and football fans.
In more general terms, it is possible to say that any probability statement is con-
ditional since it has been formulated on the basis of an often implicit background
knowledge K . For instance, if we say that the probability of the event E="rain
tomorrow" is Prob {E} = 0. 9, we are implicitly taking into consideration the sea-
son, our location and probably the weather today. So we should better note it as
Prob {E|K} = 0. 9. As succinctly stated in [27] all probabilities are conditional, and
conditional probabilities are probabilities.
Exercise
Consider as sample space Ω the set of all human beings. Let us define two events:
the set E1 of Italians and the set E2 of football supporters. Suppose that the prob-
ability of the two events is proportional to the surface of the regions in Figure 3.6.
Are these events disjoint? Are they independent? What about Prob {E1|E2 } and
Prob {E2|E1 } ? Are they equal? If not, which one is the largest?
•
The prosecutor fallacy
Consider the following story: A crime occurs in a big city (1M of inhabitants), and
a deteriorated DNA trace of the murderer is collected. The DNA profile matches
the profile of a person in a police database. A geneticist is contacted, and she states
that the probability of finding a person with the same DNA profile is one out of
100 thousand (i.e. 1e− 5 ). The prosecution lawyer asks for condemnation with
the following argument: "since the chance of finding an innocent man with such
characteristics is so tiny, then the probability that he is innocent will be tiny as well".
The jury is impressed and ready to proceed with a life sentence. Then the defendant
replies: "Do you know that the population of the city is 1M? So the average number
of persons matching such DNA profile is 10. His chance of being innocent is not so
tiny since it is 9/10 and not one in 100000" Lacking any additional evidence, the
suspect is acquitted.
This short story is inspired by a number of real cases in court that were con-
fronted with the serious error of confounding direct and inverse conditional probabil-
ity [169]. The impact of such false reasoning is so relevant in law that it is known as
the Prosecutor's fallacy, a common default in reasoning when the collected evidence
is tiny if the accused was innocent.
3.1. THE RANDOM MODEL OF UNCERTAINTY 49
Let us analyse in probabilistic terms the fallacious reasoning that occurred in
the example above. Let consider a criminal case for which we have 10 suspects, i.e.
the responsible and 9 innocent persons (out of a 1 million population) matching
the DNA profile. The probability of matching evidence (M) given that someone is
innocent (I) is very low
Prob {M |I } = 9
999999 ≈ 1e− 5
However, what is relevant here is not the probability of the evidence given that he
is innocent (Prob {M |I } ) but the probability that is innocent given the evidence
Prob {I |M } = Prob {M|I } Prob {I}
Prob {M } = 9/ 999999 × 999999/1000000
10/ 1000000 = 9/ 10.
We can rephrase the issue in the following frequentist terms. Given Ninhabi-
tants, m persons with DNA matching profiles and a single murderer, the following
table shows the distribution of persons
Match No match
Innocent m− 1N− m
Guilty 1 0
From the table above, it is easy to derive the inconsistency of the prosecutor fallacy
reasoning since
Prob {M|I } =m− 1
N−1 ≈Prob {M } = m
N
Prob {I |M } =m− 1
m>> Prob {M|I}
•
3.1.10 Logics and probabilistic reasoning
This section aims to present some interesting relationships between logic deduction
and probabilistic reasoning.
First, we show that we can write down a probabilistic version of the deductive
modus ponens rule of propositional logic (Section 2.1):
If E1 ⇒ E2 and E1 is true, then E2 is true as well.
Since E1 ⇒ E2 is equivalent in set terms to E1 ⊂ E2 we obtain
Prob {E2|E1 } = Prob {E1 ,E2}
Prob {E1 } = Prob {E1}
Prob {E1 } = 1
i.e. a translation of the modus ponens argument in the probabilistic language.
Interestingly enough, the probability theory provides us with a result also in the
case of true E2 . It is well-known that in propositional logic if E1 ⇒ E2 and E2 is
true, then nothing can be inferred about E1 . Probability theory is more informative
since in this case we may derive from E2 ⊂ E1 that
Prob {E1|E2 } = Prob {E1}
Prob {E2 }≥ Prob {E1}
Note that this is a probabilistic formulation of the abduction principle (Section 2.3).
In other words, probability supports the following common-sense reasoning: if both
E1 ⇒ E2 and E2 apply, then the conditional probability of E1 (i.e. the probability
of E1 once we know that E2 occurred) cannot be smaller than the unconditional
probability (i.e. the probability of E1 if we knew nothing about E2 ).
50 CHAPTER 3. FOUNDATIONS OF PROBABILITY
Also the properties of transitivity and inverse modus ponens hold in probability.
Let us consider three events E1 ,E2 ,E3 . The transitivity principle in logics states
that
If E1 ⇒ E2 and E2 ⇒ E3 then E1 ⇒ E3
In probabilistic terms we can rewrite E1 ⇒ E2 as
Prob {E2|E1 } = 1
and E2 ⇒ E3 as
Prob {E3|E2 } = 1
respectively. From the law of total probability (Equation (3.1.18)) we obtain
Prob {E3|E1 } = Prob {E3|E2 ,E1 } Prob {E c
2|E 1 }
| {z }
0
+ Prob {E3|E2 ,E1}
| {z }
1
Prob {E2|E1}
| {z }
1
= 1
Inverse modus ponens in logics states that
If E1 ⇒ E2 then ¬E2 ⇒ ¬E1
In probabilistic terms from Prob {E2|E1 } = 1 it follows
Prob {E c
1|E c
2}= 1 − Prob {E 1 |E c
2}= 1 −
Prob {E c
2|E 1 }
| {z }
0
Prob {E1}
Prob {E c
2}= 1
Those results show that deductive logic rules can be seen as limiting cases of proba-
bilistic reasoning and confirm the compatibility of probability reasoning with human
common sense.
3.1.11 Combined experiments
So far we assumed that all the events belong to the same sample space. However,
the most interesting use of probability concerns combined (or multivariate) random
experiments whose sample space
Ω=Ω 1×Ω 2×. . . Ω n
is the Cartesian product of the spaces Ωi ,i = 1, . . . , n . For instance, if we want to
study the probabilistic dependence between the height and the weight of a child we
define a joint sample space
Ω = { (w, h ) : w∈ Ωw , h ∈ Ωh }
made of all pairs (w, h) where Ωwis the sample space of the random experiment de-
scribing the weight and Ωh is the sample space of the random experiment describing
the height.
Note that all the properties studied so far also holds for events that do not belong
to the same univariate sample space. For instance, given a combined experiment
Ω = Ω1 × Ω2 two events E1 ∈ Ω1 and E2 ∈ Ω2 are independent iff Prob {E1|E2 } =
Prob {E1}.
Some examples of real problems modelled by random combined experiments are
presented in the following.
3.1. THE RANDOM MODEL OF UNCERTAINTY 51
Gambler's fallacy
Consider a fair coin-tossing game. The outcome of two consecutive tosses can be
considered independent. Now, suppose that we observe a sequence of 10 consecutive
tails. We could be tempted to think that the chances that the next toss will be head
are now very large. This is known as the gambler's fallacy [180]. In fact, to witness
a very rare event (like 10 consecutive tails) does not imply that the probability of
the next event will change or rather that it will become suddenly dependent on the
past.
•
Example [192]
Let us consider a medical study about the relationship between the outcome of a
medical test and the presence of a disease. We model this study as a combination
of two random experiments:
1. the random experiment which models the state of the patient. Its sample
space is Ωs ={ H, S } where H and S stand for a healthy and a sick patient,
respectively.
2. the random experiment which models the outcome of the medical test. Its
sample space is Ωo = { +, −} where + and −stand for a positive and a
negative outcome of the test, respectively.
The dependency between the state of the patient and the outcome of the test
can be studied in terms of conditional probability.
Suppose that out of 1000 patients, 108 respond positively to the test and that
among them, 9 result to be affected by the disease. Also, among the 892 patients
who responded negatively to the test, only 1 is sick. According to the frequentist in-
terpretation, the probabilities of the joint events Prob {E s , Eo } can be approximated
according to expression (3.1.5) by
Es =S Es =H
Eo = + 9
1000 =.009 108−9
1000 =.099
Eo = −1
1000 =.001 892−1
1000 =.891
Doctors are interested in answering the following questions. What is the proba-
bility of having a positive (negative) test outcome when the patient is sick (healthy)?
What is the probability of being in front of a sick (healthy) patient when a positive
(negative) outcome is obtained? From the definition of conditional probability we
derive
Prob {E o = +|E s =S} = Prob {E o = +, Es =S }
Prob {E s =S} = . 009
.009 + .001 = .9
Prob {E o = −|E s =H} = Prob {E o = −, Es =H }
Prob {E s =H} = . 891
.891 + .099 = . 9
According to these figures, the test appears to be accurate. Does this mean that
we should be scared if we test positive? Though the test is accurate, the answer is
negative, as shown by the quantity
Prob {E s =S |E o = +} = Prob {E o = +, Es =S }
Prob {E o = +} = . 009
.009 + .099 ≈ .08
This example confirms that sometimes humans tend to confound Prob {E s|E o } with
Prob {E o|E s } and that the most intuitive response is not always the right one (see
example in Section 3.1.9).
•
52 CHAPTER 3. FOUNDATIONS OF PROBABILITY
3.1.12 Array of joint/marginal probabilities
Let us consider the combination of two random experiments whose sample spaces are
ΩA ={A1 , ··· , An } and ΩB ={B1 , ··· , Bm } , respectively. Assume that for each
pair of events (Ai , Bj ), i = 1, . . . , n ,j = 1, . . . , m we know the joint probability value
Prob {Ai , Bj } . The joint probability array contains all the necessary information
for computing all marginal and conditional probabilities by means of (3.1.18) and
(3.1.8).
B1 B2 ·· · Bm Marginal
A1 Prob { A1 , B1 } Prob { A1 , B2 } ·· · Prob { A1 , Bm } Prob { A1 }
A2 Prob { A2 , B1 } Prob { A2 , B2 } ·· · Prob { A1 , Bm } Prob { A2 }
.
.
..
.
..
.
..
.
..
.
..
.
.
An Prob { An , B1 } Prob { An , B2 } ·· · Prob { An , Bm } Prob { An }
Marginal Prob {B1 } Prob { B2 } ·· · Prob {Bm } Sum=1
where Prob {Ai } = P j=1,...,m Prob {Ai , Bj } and Prob { Bj } = P i=1,...,n Prob {Ai , Bj } .
Using an entry of the joint probability matrix and the sum of the corresponding
row/column, we may use (3.1.8) to compute the conditional probability as shown
in the following example.
Example: dependent/independent scenarios
Let us model the commute time to go back home for a ULB student living in
St. Gilles as a random experiment. Suppose that its sample space is Ωt = { LOW,
MEDIUM, HIGH}. Consider also an (extremely:-) random experiment represent-
ing the weather in Brussels, whose sample space is Ωw = {G=GOOD, B=BAD}.
Suppose that the array of joint probabilities is
G (in Bxl) B (in Bxl) Marginal
LOW 0.15 0.05 Prob { LOW} = 0.2
MEDIUM 0.1 0.4 Prob { MEDIUM} = 0.5
HIGH 0.05 0.25 Prob { HIGH} = 0.3
Prob { G} = 0. 3 Prob { B} = 0. 7 Sum=1
According to the above probability function, is the commute time dependent on
the weather in Bxl? Note that if weather is good
LOW MEDIUM HIGH
Prob {·| G} 0.15/0.3=0.5 0.1/0.3=0.33 0.05/0.3=0.16
Else if weather is bad
LOW MEDIUM HIGH
Prob {·| B} 0.05/0.7=0.07 0.4/0.7=0.57 0.25/0.7=0.35
Since Prob {·| G } 6 = Prob {·| B } , i.e. the probability of having a certain commute time
changes according to the value of the weather, the relation (3.1.9) is not satisfied.
Consider now the dependency between an event representing the commute time
and an event describing the weather in Rome.
G (in Rome) B (in Rome) Marginal
LOW 0.18 0.02 Prob { LOW} = 0.2
MEDIUM 0.45 0.05 Prob { MEDIUM} = 0.5
HIGH 0.27 0.03 Prob { HIGH} = 0.3
Prob { G} = 0. 9 Prob { B} = 0. 1 Sum=1
Our question now is: is the commute time dependent on the weather in Rome?
If the weather in Rome is good we obtain
LOW MEDIUM HIGH
Prob {·| G} 0.18/0.9=0.2 0.45/0.9=0.5 0.27/0.9=0.3
3.1. THE RANDOM MODEL OF UNCERTAINTY 53
E1E2E3 P(E1 ,E2 ,E3 )
CLEAR RISING DRY 0.4
CLEAR RISING WET 0.07
CLEAR FALLING DRY 0.08
CLEAR FALLING WET 0.10
CLOUDY RISING DRY 0.09
CLOUDY RISING WET 0.11
CLOUDY FALLING DRY 0.03
CLOUDY FALLING WET 0.12
Table 3.1: Joint probability distribution of the three-variable probabilistic model of
the weather
while if the weather in Rome is bad
LOW MEDIUM HIGH
Prob {·| B} 0.02/0.1=0.2 0.05/0.1=0.5 0.03/0.1=0.3
Note that the probability of a commute time event does NOT change according to
the value of the weather in Rome, e.g. Prob { LOW| B} = Prob { LOW} . Try to
answer now the following question. If you would like to predict the commute time
in Brussels, which event would return more information on it: the weather in Rome
or in Brussels?
•
Example: three sample spaces
Consider a probabilistic model of the day's weather based on the combination of
the following random descriptors where
1. the first represents the sky condition and its sample space is Ωs = {CLEAR,
CLOUDY}.
2. the second represents the barometer trend and its sample space is Ωb = {RISING,
FALLING}.
3. the third represents the humidity in the afternoon and its sample space is
Ωh ={ DRY,WET}.
Let the joint probability values be given by Table 3.1. From the joint values we
can calculate the probabilities P ( CLE AR, RIS IN G )=0 .47 and P(C LOUDY ) =
0. 35 and the conditional probability value
P( DRY | C LEAR, RI SIN G) = P(DRY , CLEAR, RI SI NG)
P( CLE AR, RIS IN G)= 0 .40
0. 47 ≈ 0.85
Take the time now to compute yourself other probabilities: for instance what is
the probability of having a cloudy sky in wet conditions? Does a rising barometer
increase or not this probability? Is the event "clear sky and falling barometer"
independent from the event "dry weather"?
•
54 CHAPTER 3. FOUNDATIONS OF PROBABILITY
3.2 Random variables
Machine learning and statistics are concerned with numeric data and measurements
while so far we have mainly been dealing with categories. What is then the link
between the notion of random experiment and data? The answer is provided by the
concept of random variable.
Consider a random experiment and the associated triple (Ω, {E}, Prob {·} ). Sup-
pose that we have a mapping rule z : Ω → Z ⊂ R such that we can associate with
each experimental outcome ωa real value z = z (ω ) in the domain Z. We say that
zis the value taken by the random variable z when the outcome of the random
experiment is ω . Henceforth, in order to clarify the distinction between a random
variable and its value, we will use the boldface notation for denoting a random
variable (as in z) and the normal face notation for the eventually observed value
(as in z = 11).
Since there is a probability associated with each event Eand we have a mapping
from events to real values, a probability distribution can be associated with z.
Definition 2.1 (Random variable) . Given a random experiment (Ω, {E}, Prob {·}),
a random variable zis the result of a mapping z : Ω → Z that assigns a number z
to every outcome ω . This mapping must satisfy the following two conditions:
•the set {z ≤z }is an event for every z.
•the probabilities
Prob {z = ∞} = 0 Prob {z = −∞} = 0
Given a random variable z ∈ Z and a subset I ⊂ Z we define the inverse
mapping
z−1 (I ) = {ω ∈ Ω | z(ω )∈I } (3.2.24)
where z −1 (I ) ∈ {E} is an event. On the basis of the above relation we can associate
a probability measure to zaccording to
Prob {z ∈I } = Prob z−1 (I ) = Prob {ω ∈ Ω |z (ω )∈I } (3.2.25)
Prob {z =z} = Prob z−1 (z ) = Prob {ω ∈ Ω |z (ω ) = z} (3.2.26)
In other words, a random variable is a numerical quantity, linked to some exper-
iment involving some degree of randomness, which takes its value from some set Z
of possible real values. The notion of r.v. formalizes the notion of numeric measure-
ments, which is indeed a mapping between an event (e.g. your body temperature)
and a number (e.g. in the range Z = { 35,...,41} returned by the thermometer).
Another experiment might be the rolling of two six-sided dice and the r.v. z might
be the sum (or the maximum) of the two numbers showing in the dice. In this case,
the set of possible values is Z ={ 2,...,12} (or Z = { 1,..., 6} ).
Example
Suppose that we have to decide when to go home and watch Fiorentina AC playing
the Champion's League final match against Anderlecht. In order to make such a
decision, a quantity of interest is the (random) commute time z for getting from
ULB to home. Our personal experience is that this time is a positive number that
is not constant: for example, z1 = 10 minutes, z2 = 23 minutes, z3 = 17 minutes,
where zi is the time taken on the ith day of the week. The variability of this quantity
is related to a complex random process with a large sample space Ω (depending, for
example, on the weather condition, the weekday, the sports events in town, and so
on). The probabilistic approach uses a random variable to represent this uncertainty
3.3. DISCRETE RANDOM VARIABLES 55
and considers each measure zi as the consequence of a random outcome ωi . The use
of a random variable zto represent the commute time becomes then a compact (and
approximate) way of modelling the disparate set of causes underlying the uncertainty
of this phenomenon. Whatever its limits, the probabilistic representation provides
us with a computational way to decide when to leave if we want to bound the
probability of missing the start of the game.
•
3.3 Discrete random variables
The probability (mass) function of a discrete r.v. zis the combination of
1. the countable set Zof values that the r.v. can take (also called range),
2. the set of probabilities associated to each value of Z.
This means that we can attach to the random variable some specific mathemat-
ical function Pz (z ) that gives for each z ∈ Z the probability that zassumes the
value z
Pz ( z) = Prob {z = z}(3.3.27)
This function is called probability function or probability mass function. Note that
henceafter will use P (z ) as a shorthand for Prob {z =z} when the identity of the
random variable is clear from the context.
As depicted in the following example, the probability function can be tabulated
for a few sample values of z. If we toss a fair coin twice, and the random variable
zis the number of heads that eventually turn up, the probability function can be
tabulated as follows
Values of the random variable z 012
Associated probabilities 0.25 0.50 0.25
3.3.1 Parametric probability function
Sometimes the probability function is not precisely known but can be expressed as
a function of zand a quantity θ. An example is the discrete r.v. zthat takes its
value from Z = { 1, 2,3} and whose probability function is
Pz ( z, θ) = θ 2z
θ2 +θ4 +θ6
where θ is some fixed nonzero real number.
Whatever the value of θ ,Pz (z )> 0 for z = 1, 2, 3 and Pz (1) + Pz (2) + Pz (3) = 1.
Therefore z is a well-defined random variable, even if the value of θis unknown.
We call θ a parameter , that is some constant, usually unknown, involved in the
analytical expression of a probability function. We will see in the following that the
parametric form is a convenient way to formalise a family of probabilistic models
and that the problem of estimation can be seen as a parameter identification task.
3.3.2 Expected value, variance and standard deviation of a
discrete r.v.
Though the probability function Pz provides a complete description of the uncer-
tainty of z , it is often not practical to use since this requires to keep in mind (or in
memory) as many values as the size of Z. Therefore, it is more convenient to deal
with some compact representation of Pz obtained by computing a functional (i.e.
56 CHAPTER 3. FOUNDATIONS OF PROBABILITY
a function of a function) of Pz . The most common single-number summary of the
distribution Pz is the expected value which is a measure of central tendency4.
Definition 3.1 (Expected value) . The expected value of a discrete random variable
zis
E[z ] = µ=X
z∈Z
zPz ( z ) (3.3.28)
assuming that the sum is well-defined.
An interesting property of the expected value is that it is the value which mini-
mizes the squared deviation
µ= arg min
mE[(z−m ) 2 ] (3.3.29)
Note that the expected value is not necessarily a value that belongs to the domain
Zof the random variable. It is important also to remark that while the term
mean is used as a synonym of expected value, this is not the case for the term
average. We will discuss in detail the difference between mean and sample average
in Section 5.3.2.
Example [180]
Let us consider a European roulette with numbers 0, 1,..., 36 and where the number
0 is considered as winning for the house. The gain of a player who places a 1$ bet
on a single number is a random variable zwhose sample space is Z = {− 1,35} . In
other words, only two outcomes are possible: either she wins z1 =− 1$ (or better
he loses 1$) with probability p1 = 36/ 37 or he wins z2 = 35$ with probability
p2 = 1/37. The expected gain is then
E[z ] = p1 z1 +p2z2 =p1 ∗ (− 1) + p2 ∗ 35 = − 36 / 37 + 35 / 37 = − 1 /37 = − 0 . 027
This means that while casinos gain on average 2.7 cents for every staked dollar,
players on average are giving away 2.7 cents (whatever sophisticated their betting
strategy is).
•
A common way to summarise the spread of a distribution is provided by the
variance.
Definition 3.2 (Variance). The variance of a discrete random variable zis
Var [z ] = σ2 =E [(z−E [z])2 ] = X
z∈Z
(z− E [z])2 Pz (z)
The variance is a measure of the dispersion of the probability function of the
random variable around its mean µ. Note that the following relation holds
σ2 = E[(z− E[z])2 ] = E[z2 − 2z E[z ]+( E[z])2 ] (3.3.30)
=E [z2 ]− (E [z ])2 =E [z2 ]− µ2 (3.3.31)
whatever is the probability function of z. Figure 3.7 illustrate two example discrete
r.v. probability functions that have the same mean but different variance. Note
that the variance Var [z] does not have the same dimension as the values of z . For
instance, if z is measured in the unit [m ], Var [z ] is expressed in the unit [m ]2.
Standard deviation is a measure for the spread that has the same dimension as z.
An alternative measure of spread is E [ |z − µ|] but this quantity is less used since
more difficult to be analytically manipulated than the variance.
4This concept was first introduced in the 17th century by C. Huygens in order to study the
games of chance
3.3. DISCRETE RANDOM VARIABLES 57
Figure 3.7: Two discrete probability functions with the same mean and different
variance
Definition 3.3 (Standard deviation) . The standard deviation of a discrete random
variable z is the positive square root of the variance.
Std [z ] = p Var [z ] = σ
Example
Let us consider a binary random variable z ∈ Z = { 0,1} where Pz (1) = p , 0 ≤p≤1
and Pz (0) = 1 − p . In this case
E[z ] = p∗1+0∗(1 − p) = p(3.3.32)
E[z2 ] = p∗1+0∗(1 − p) = p(3.3.33)
Var [z ] = E [z2 ]− (E [z ])2 =p− p2 = p (1 −p ) (3.3.34)
•
Definition 3.4 (Moment) . For any positive integer r , the r th moment of the prob-
ability function is
µr = E [zr ] = X
z∈Z
zr Pz ( z) (3.3.35)
Note that the first moment coincides with the mean µ , while the second moment
is related to the variance according to Equation (3.3.30). Higher-order moments
provide additional information, other than the mean and the spread, about the
shape of the probability function.
Definition 3.5 (Skewness) . The skewness of a discrete random variable zis defined
as
γ= E[(z− µ)3]
σ3 (3.3.36)
Skewness is a parameter that describes asymmetry in a random variable's prob-
ability function. Probability functions with positive skewness have long tails to the
right, and functions with negative skewness have long tails to the left (Figure 3.8).
Definition 3.6 (Kurtosis) . The kurtosis of a discrete random variable z is defined
as
γ= E[(z− µ)4]
σ4 (3.3.37)
Kurtosis is always positive. Its interpretation is that the probability function
of a distribution with large kurtosis has fatter tails, compared with the probability
function of a distribution with smaller kurtosis.
58 CHAPTER 3. FOUNDATIONS OF PROBABILITY
Figure 3.8: A discrete probability function with positive skewness (left) and one
with a negative skewness (right).
3.3.3 Entropy and relative entropy
Definition 3.7 (Entropy) . Given a discrete r.v. z , the entropy of the probability
function Pz (z ) is defined by
H(z ) = − X
z∈Z
Pz ( z) log Pz ( z)
H(z ) is a measure of the unpredictability of the r.v. z. Suppose that there are
Mpossible values for the r.v. z. The entropy is maximized (and takes the value
log M ) if Pz (z )=1 /M for all z . It is minimized iff P ( z ) = 1 for a single value of z
(i.e. all others probability values are null).
Although entropy measures as well as variance the uncertainty of a r.v., it differs
from the variance since it depends only on the probabilities of the different values
and not on the values themselves. In other terms, Hcan be seen as a function of
the probability function Pz rather than of z.
Let us now consider two different discrete probability functions on the same set
of values
P0 =Pz 0 ( z) , P1 = Pz 1 ( z)
where P0 (z )> 0 if and only if P1 (z )> 0. The relative entropies (or the Kullback-
Leibler divergences) associated with these two functions are
H( P0 || P1 ) = X
z
P0 ( z) log P 0 ( z)
P1 ( z)= X
z
P0 ( z) log P0 ( z)−X
z
P0 ( z) log P1 ( z)
(3.3.38)
H( P1 || P0 ) = X
z
P1 ( z) log P 1 ( z)
P0 ( z)= X
z
P1 ( z) log P1 ( z)−X
z
P1 ( z) log P0 ( z)
(3.3.39)
where the term
−X
z
P0 ( z) log P1 ( z) = − Ez [log P1 ] (3.3.40)
is also called the cross-entropy. These asymmetric quantities measure the dissimi-
larity between the two probability functions. A symmetric formulation of the dis-
similarity is provided by the divergence quantity
J( P0 , P1 ) = H( P0 ||P1 ) + H( P1 ||P0 ) .
3.4. CONTINUOUS RANDOM VARIABLE 59
3.4 Continuous random variable
An r.v. zis said to be a continuous random variable if it can assume any of the
infinite values within a range of real numbers. The following quantities can be
defined:
Definition 4.1 (Cumulative distribution function) . The (cumulative) distribution
function of z is the function Fz :R→ [0,1]
Fz ( z ) = Prob {z ≤ z} (3.4.41)
This function satisfies the following two conditions:
1. it is right-continuous: Fz (z ) = limy→z Fz (y),
2. it is non-decreasing: z1 < z2 implies Fz ( z1 )≤Fz (z2 ),
3. it is normalized, i.e
lim
z→−∞ F z (z)=0 ,lim
z→∞ F z (z ) = 1
Definition 4.2 (Density function) . The density function of a real random variable
zis the derivative of the distribution function
pz ( z) = dF z ( z)
dz (3.4.42)
at all points z where Fz (· ) is differentiable.
Probabilities of continuous r.v. are not allocated to specific values but rather to
interval of values. Specifically
Prob {a ≤z≤b } =Z b
a
pz ( z) dz, Z Z
pz ( z) dz = 1
Some considerations about continuous r.v. are worthy to be mentioned:
•the quantity Prob {z=z }= 0 for all z,
•the quantity pz (z ) can be bigger than one (since it is a density and not a
probability) and even unbounded,
•two r.v.s z1 and z2 with the same domain Zare equal in distribution if
Fz 1 ( z ) = Fz 1 ( z ) for all z ∈ Z .
Note that hence-after we will use p (z ) as a shorthand for pz (z ) when the identity
of the random variable is clear from the context.
3.4.1 Mean, variance, moments of a continuous r.v.
Consider a continuous scalar r.v. with range Z = (l, h ) and density function p (z).
We may define the following quantities.
Definition 4.3 (Expectation or mean) . The mean of a continuous scalar r.v. z is
the scalar value
µ= E[z ] = Z h
l
zp( z ) dz (3.4.43)
60 CHAPTER 3. FOUNDATIONS OF PROBABILITY
Figure 3.9: Cumulative distribution function and upper critical point.
Definition 4.4 (Variance). The variance of a continuous scalar r.v. z is the scalar
value
σ2 = E[(z− µ)2 ] = Z h
l
(z− µ )2 p(z)dz (3.4.44)
Definition 4.5 (Moments) . The r-th moment of a continuous scalar r.v. zis the
scalar value
µr = E [zr ] = Z h
l
zr p( z) dz (3.4.45)
Note that the moment of order r= 1 coincides with the mean of z.
Definition 4.6 (Quantile function) . Given the cumulative function Fz , the quantile
(or inverse cumulative) function is the function F −1
z: [0, 1] →R such that
F−1
z(q) = inf { z: F z (z) > q}
The quantities Fz (1/ 4), Fz (1/ 2), Fz (3/ 4) are called the first quartile, the median
and the third quartile, respectively.
Definition 4.7 (Upper critical point) . For a given 0 ≤α≤1 the upper critical
point of a continuous r.v. zis the value zα such that
1−α = Prob {z ≤ zα } =F ( zα )⇔ zα =F−1 (1 −α )
Figure 3.9 shows an example of cumulative distribution together with the upper
critical point. A compact review of univariate discrete and continuous distributions
is available in Appendix C.1. In what follows we will detail only the univariate
normal case.
3.4.2 Univariate Normal (or Gaussian) distribution
A continuous scalar random variable xis said to be normally distributed with pa-
rameters µ and σ2 (also x ∼ N (µ, σ 2 )) if its probability density function is Normal
(or Gaussian). The analytical form of a Normal probability density function is
px ( x) = 1
√2πσ e − (x−µ)2
2σ 2 (3.4.46)
where the coefficient before the exponential ensures that R px (x ) dx = 1. The mean
of the Normal random variable x is µ and its variance is σ2 . An interesting property
of a normal r.v.is that the probability that an observation xis within 1 (2) standard
deviations from the mean is 0.68 (0.95). You may find more probabilistic relation-
ships in Table 3.2. When µ = 0 and σ2 = 1 the distribution is called standard
normal (Figure 3.10) and its distribution function is denoted Fz (z ) = Φ(z ). All
3.5. JOINT PROBABILITY 61
Figure 3.10: Density of a standard r.v. N (0,1)
Prob {µ −σ ≤x≤µ +σ } ≈ 0 .683
Prob {µ − 1.282σ≤x≤ µ + 1. 282σ } ≈ 0 .8
Prob {µ − 1.645σ≤x≤ µ + 1. 645σ } ≈ 0 .9
Prob {µ − 1 . 96σ≤x ≤ µ + 1. 96σ } ≈ 0 .95
Prob {µ − 2σ≤x≤ µ + 2σ } ≈ 0.954
Prob {µ − 2 . 57σ≤x≤ µ + 2. 57σ } ≈ 0.99
Prob {µ − 3σ≤x≤ µ + 3σ } ≈ 0.997
Table 3.2: Some probabilistic relations holding for x ∈ N ( µ, σ 2 )
random variables x ∼ N (µ, σ2 ) are linked to a standard variable zby the following
relation
z= ( x−µ) /σ. (3.4.47)
It follows that z ∼ N (0, 1) ⇒x =µ +σz ∼ N ( µ, σ2 ).
The practitioner might now wonder why the Normal distribution is so ubiqui-
tous in statistics books and literature. There are plenty of reasons both from the
theoretical and the practical side. From a theoretical perspective, the adoption of a
Normal distribution is justified by the Central Limit theorem (Appendix C.7) which
states that, under conditions almost always satisfied in practice, a linear combina-
tion of random variables converges to a Normal distribution. This is particularly
useful if we wish to represent in a compact lumped form the variability that escapes
to a modelling effort (e.g. the regression plus noise form in Section 10.1). Another
relevant property of Gaussian distributions is that they are invariant to linear trans-
formations, i.e. a linear transformation of a Gaussian r.v. is still Gaussian, and its
mean (variance) depends on the mean (variance) of the original r.v.. From a more
pragmatic perspective, an evident asset of a Gaussian representation is that only a
finite number of parameters (two in the univariate case) are sufficient to characterise
the entire distribution.
Exercise
Test yourself the relations in Table 3.2 by random sampling and simulation using
the script norm.R.
•
3.5 Joint probability
So far, we considered scalar random variables only. However, the most interesting
probabilistic (and machine learning) applications are multivariate, i.e. concerning a
number of variables larger than one. Let us consider a probabilistic model described
by n discrete random variables. A fully-specified probabilistic model gives the joint
62 CHAPTER 3. FOUNDATIONS OF PROBABILITY
probability for every combination of the values of the nr.v.s. In other terms, the
joint probability contains all the information about the random variables.
In the discrete case, the model is specified by the values of the probabilities
Prob {z1 = z1 ,z2 = z2 ,...,zn = zn } =P (z1 , z2, . . . , zn ) (3.5.48)
for every possible assignment of values z1 , . . . , zn to the variables.
Spam mail example
Let us consider a bivariate probabilistic model describing the relation between the
validity of a received email and the presence of the word Viagra in the text. Let
z1 be the random variable describing the validity of the email (z1 = 0 for no-spam
and z1 = 1 for spam) and z2 the r.v. describing the presence (z2 = 1) or the
absence (z2 = 0) of the word Viagra. The stochastic relationship between these two
variables can be defined by the joint probability distribution given by the table
z1 = 0 z1 = 1 P z 2
z2 = 0 0.8 0.08 0.88
z2 = 1 0.01 0.11 0.12
Pz 1 0.81 0.19 1
•
In the case of ncontinuous random variables, the model is specified by the joint
distribution function
Prob {z1 ≤ z1 ,z2 ≤ z2 ,...,zn ≤ zn } =F ( z1 , z2, . . . , zn )
which returns a value for every possible assignment of values z1 , . . . , zn to the vari-
ables.
3.5.1 Marginal and conditional probability
Let {z1 ,...,zm } be a subset of size m of the n discrete r.v.s for which a joint
probability function (3.5.48) is defined. The marginal probabilities for the subset
can be derived from expression (3.5.48) by summing over all possible combinations
of values for the remaining variables.
P( z1 , . . . , zm ) = X
˜ zm+1 ··· X
˜ zn
P( z1 , . . . , zm , ˜ zm+1 ,..., ˜ zn ) (3.5.49)
Exercise
Compute the marginal probabilities P (z1 = 0) and P (z1 = 1) from the joint prob-
ability of the spam mail example.
•
For continuous random variables the marginal density is
p(z1 , . . . , zm ) = Z p(z1 , . . . , zm, zm+1 , . . . , zn ) dzm+1 . . . dzn (3.5.50)
This is also known as the sum rule or the marginalisation property.
The following definition for r.v. derives directly from Equation (3.1.8).
3.5. JOINT PROBABILITY 63
Definition 5.1 (Conditional probability function) . The conditional probability func-
tion for one subset of discrete variables {zi :i∈ S1 } given values for another disjoint
subset {zj :j∈ S2 } where S1 ∩S2 = ∅ , is defined as the ratio
P({ zi : i∈ S1 }|{zj : j∈ S2 }) = P({zi :i∈ S1 } , {zj :j∈ S2 } )
P({ zj : j∈ S2 })
Definition 5.2 (Conditional density function) . The conditional density function
for one subset of continuous variables {zi :i∈ S1 } given values for another disjoint
subset {zj :j∈ S2 } where S1 ∩S2 = ∅ , is defined as the ratio
p({ zi : i∈ S1 }|{zj : j∈ S2 }) = p({zi :i∈ S1 } , {zj :j∈ S2 } )
p({ zj : j∈ S2 })(3.5.51)
where p ({ zj :j∈ S2 } ) is the marginal density of the set S2 of variables. When
p({ zj : j∈ S2 } ) = 0 this quantity is not defined.
The simplified version of (3.5.51) for two r.v.s z1 and z2 is
p(z1 = z1 ,z2 = z2 ) =
=p (z2 =z2 |z1 =z1 )p(z1 =z1 ) = p (z1 =z1 |z2 =z2 )p (z2 =z2 ) (3.5.52)
which is also known as the product rule.
By combining (3.4.43), the sum rule (3.5.50) and the product rule (3.5.52) we
obtain
p( z1 ) = Z p( z1 , z2 ) dz2 =Z p(z1 |z2 ) p( z2 )dz2 = E z 2 [p(z1 |z2 )]
where the subscript z2 makes clear that the expectation is computed with respect
to the distribution of z2 only (while z1 is fixed).
3.5.2 Independence
Having defined the joint and the conditional probability, we can now define when
two random variables are independent.
Definition 5.3 (Independent discrete random variables) . Let x and y be two dis-
crete random variables. Two variables x and y are defined to be statistically inde-
pendent (written as x ⊥⊥ y ) if the joint probability
Prob {x = x, y =y} = Prob {x =x} Prob {y =y} , ∀x, y (3.5.53)
The definition can be easily extended to the continuous case.
Definition 5.4 (Independent continuous random variables). Two continuous vari-
ables x and y are defined to be statistically independent (written as x ⊥⊥ y ) if the
joint density
p(x = x, y= y ) = p(x = x ) p(y = y) ,∀ x, y (3.5.54)
From the definition of independence and conditional density it follows that
x⊥⊥ y ⇔p (x =x |y=y ) = p ( x=x ) ∀x, y (3.5.55)
In layman's terms, the independence of two variables means that we do not
expect that the observed outcome of one variable will affect the probability of ob-
serving the other, or equivalently that knowing something about one variable adds
no information about the other. For instance, hair colour and gender are indepen-
dent. Knowing someone's hair colour adds nothing to the knowledge of his gender.
64 CHAPTER 3. FOUNDATIONS OF PROBABILITY
Height and weight are dependent, however. Knowing someone's height does not
determine precisely their weight: nevertheless, you have less uncertainty about his
probable weight after you have been told the height.
Though independence is symmetric
x⊥⊥ y ⇔y ⊥⊥ x
it is neither reflexive (i.e. a variable is not independent of itself) nor transitive. In
other terms, if x and y are independent and y and z are independent, then xand
zneed not be independent.
If we consider three instead of two variables, they are said to be mutually inde-
pendent if and only if each pair of rv.s. is independent and
p( x, y, z ) = p ( x ) p ( y) p( z )
Also the relationship
x⊥⊥ (y, z ) ⇒x ⊥⊥ z, x ⊥⊥ y
holds, but not the one in the opposite direction.
Note that in mathematical terms an independence assumption implies that a
bivariate density function can be written in a simple form, i.e. as the product of
two univariate densities. This results in an important benefit in terms of the size of
the parametrisation. For instance, consider two discrete random variables z1 ∈ Z1 ,
z2 ∈ Z2 such that the cardinality of the two ranges is k1 and k2 , respectively. In the
generic case, if z1 and z2 are not independent, the definition of the joint probability
requires the definition of k1k2 − 1 terms5(or parameters). In the independent case
because of the property (3.5.54), the definition requires k1 − 1 terms for z1 and
k2 −1 terms for z2 , so overall k1 + k2 −2. This makes a big difference in case of
large values of k1 and k2 .
Independence allows an economic parametrisation in the multivariate case as
well. Consider the case of a large number nof binary discrete r.v.s., i.e. each
having a range made of two values. If we need to define the joint probability, we
require 2n −1 terms (or parameters) in the generic case. If the nvariables are
independent, this number is reduced to n.
Exercise
Check whether the variable z1 and z2 of the spam mail example are independent.
•
Note that hence-after, for the sake of brevity, we will limit to introduce definitions
for continuous random variables only. All of them can however be extended to the
discrete case too.
3.5.3 Chain rule
Given a set of nrandom variables, the chain rule (also called the general product
rule) returns the joint density as a function of conditional densities:
p( zn , . . . , z1 ) = p(zn |zn−1 , . . . , z1 ) p ( zn−1 | z n−2 , . . . , z1 ). . . p( z2 |z1 )p(z1 ) (3.5.56)
This rule is convenient to simplify the representation of large variate distributions
by describing them in terms of conditional probabilities.
5minus one because of the normalisation constraint
3.5. JOINT PROBABILITY 65
3.5.4 Conditional independence
Independence is not a stable relation. Though x ⊥⊥ y , the r.v. xmay become
dependent with y once we observe the value zof a third variable z. In the same
way, two dependent variables x and y may become independent once the value of
zis known. This leads us to introduce the notion of conditional independence.
Definition 5.5 (Conditional independence) . Two r.v.s x and y are conditionally
independent given the value z=z ( x ⊥⊥ y |z=z ) iff
p(x = x, y= y |z= z ) = p(x = x |z= z ) p (y = y|z= z) ∀ x, y (3.5.57)
Two r.v.s x and y are conditionally independent given z (x ⊥⊥ y |z ) iff they are
conditionally independent for all values of z.
Since from the chain rule (3.5.56) we may write
p(x = x, y= y|z= z ) = p(x = x|z= z) p(y = y |x= x, z= z )
it follows that x ⊥⊥ y |z=z implies the relation
p(y = y|x= x, z= z ) = p(y = y|z= z) (3.5.58)
In plain words, the notion of conditional dependence makes formal the intuition
that a variable may bring (or not) information about a second one, according to
the context.
Note that the statement x ⊥⊥ y |z=z means that x and y are independent
if z =z occurs but does not say anything about the relation between x and yif
z=z does not occur. It could follow that two variables are independent but not
conditional independent (or the other way round). In general independence does
not imply conditional independence and conditional independence does not imply
independence [27] (as in the example below).
Example: pizzas, dependence and conditional independence
Let y a variable representing the quality of a pizza restaurant and xa variable
quantifying the Italian assonance of the restaurant name. Intuitively, you would
prefer (because of higher quality y) a pizza served in the restaurant "Sole Mio"
(large x ), rather than in the restaurant "Tot Straks" (low x). In probabilistic
terms, this means that x and y are dependent (x 6⊥⊥ y ), i.e. knowing x reduces the
uncertainty we have about y. However, it is not the restaurant owner who makes
your pizza, but the cook (pizzaiolo ). Let z represent the assonance of his name.
Now you would prefer eating a pizza in a Belgian restaurant where the pizzaiolo
has Italian origins rather than in an Italian restaurant with a Flemish cook. In
probabilistic terms x and y become independent once z (the pizzaiolo's name) is
known (x ⊥⊥ y |z ).
•
It can be shown that the following two assertions are equivalent
(x ⊥⊥ (z1 ,z2 ) |y) ⇔ (x ⊥⊥ z1 |(y, z2 )),(x ⊥⊥ z2 |(y, z1 ))
Also
(x⊥⊥ y |z ),(x ⊥⊥ z |y) ⇒ (x ⊥⊥ (y, z ))
If (x ⊥⊥ y |z ), (z ⊥⊥ y | x), ( z ⊥⊥ x |y ) then x ,y ,z are mutually independent.
If z is a random vector, the order of the conditional independence is equal to the
number of variables in z.
66 CHAPTER 3. FOUNDATIONS OF PROBABILITY
3.5.5 Entropy in the continuous case
Consider a continuous r.v. y . The (differential) entropy of y is defined by
H(y ) = −Z log( p( y ))p( y) dy = Ey [−log( p ( y ))] = Ey log 1
p( y)
with the convention that 0 log 0 = 0. Entropy is a functional of the distribution of
yand is a measure of the predictability of a r.v. y. The higher the entropy, the
less reliable are our predictions about y. For a scalar normal r.v. y ∼ N (µ, σ2 )
H(y ) = 1
2 1 + ln 2πσ2 =1
2 ln 2πeσ2 (3.5.59)
In the case of a normal random vector Y ={ y1 ,...,yn }∼N (0,Σ)
H(Y ) = 1
2(ln(2πe)n det(Σ))
3.5.5.1 Joint and conditional entropy
Consider two continuous r.v.s x and y and their joint density p (x, y ). The joint
entropy of x and y is defined by
H(x ,y) = − Z Z log( p( x, y)) p( x, y ) dxdy =
=Ex,y [−log(p ( x, y ))] = Ex,y log 1
p( x, y)
The conditional entropy is defined as
H(y |x ) = − Z Z log( p( y | x ))p( x, y) dxdy = Ex,y [−log(p( y | x ))] =
=Ex,y log 1
p( y| x) = E x [ H(y| x)]
This quantity quantifies the remaining uncertainty of y once x is known. Note that
in general H (y |x ) 6 =H (x |y ), H (y )−H (y| x ) = H (x )−H ( x| y ) and that the chain
rule holds
H(y ,x) = H(y| x ) + H(x ) (3.5.60)
Also, conditioning reduces entropy
H(y |x )≤ H(y)
with equality if x and y are independent, i.e. x ⊥⊥ y . This property formalises a
fundamental principle underlying machine learning, data science and prediction in
general, i.e. that by conditioning on some variables x(e.g. inputs) we may reduce
the uncertainty about a variable y(target). Another interesting property is the
independence bound
H(y ,x)≤ H(y ) + H(x)
with equality if x ⊥⊥ y .
3.6. BIVARIATE CONTINUOUS DISTRIBUTION 67
Figure 3.11: 3D visualisation of a bivariate joint density.
3.6 Bivariate continuous distribution
Let us consider two continuous r.v. x and y and their bivariate joint density func-
tion px,y (x, y ). An example of bivariate joint density function is illustrated in
Figure 3.11. From (3.5.50), we define marginal density the quantity
px ( x ) = Z ∞
−∞
px,y ( x, y)dy
and conditional density the quantity
py|x ( y| x ) = p ( x, y)
p( x)(3.6.61)
which is, in loose terms, the probability that ybelongs to an interval dy about y
assuming that x =x . Note that, if x and y are independent
px,y ( x, y) = px ( x) py ( y ) , p( y| x) = py ( y)
The definition of conditional expectation is obtained from (3.6.61) and (3.4.43).
Definition 6.1 (Conditional expectation) . The conditional expectation of y given
x=x is
Ey [y |x = x ] = Z ypy|x ( y | x ) dy = µy|x ( x ) (3.6.62)
From (3.3.29) we may derive that
Ey [y |x = x] = arg min
mE y [(y−m) 2 |x=x ] (3.6.63)
Note that Ey [y| x =x ] is a function of xalso known as the regression function.
The definition of conditional variance derives from (3.6.61) and (3.4.44).
Definition 6.2 (Conditional variance).
Var [y| x =x ] = Z (y− µy|x (x))2 py|x (y |x )dy (3.6.64)
68 CHAPTER 3. FOUNDATIONS OF PROBABILITY
Figure 3.12: Bivariate distribution: the figure shows the two marginal distribu-
tions (beside the axis), the conditional expectation function (dashed line) and some
conditional distributions (dotted).
Note that both these quantities are a function of x. If we replace the given value
xby the r.v. xthe terms Ey [y |x ] and Var [y |x ] are random, too.
Some important results on their expectation are contained in the following the-
orems [192].
Theorem 6.3. For two r.v.s x and y , assuming their expectations exist, we have
that
Ex [Ey [y |x = x]] = Ey [y ] (3.6.65)
and
Var [y ] = Ex [Var [y |x = x ]] + Var [Ey [y |x =x ]] (3.6.66)
where Var [y| x =x ] and Ey [y |x =x ] are functions of x.
We remind that for a bivariate function f (x, y)
Ey [ f( x, y)] = Z f (x, y) py ( y) dy, Ex [ f ( x, y)] = Z f (x, y ) px ( x)dx.
A 2D representation of a bivariate continuous distribution is illustrated in Figure
3.12. It is worthy noting that, although the conditional distribution is bell-shaped,
this is not necessarily the case for the marginal distributions.
3.6.1 Correlation
Consider two random variables x and y with means µx and µy and standard devi-
ations σx and σy .
Definition 6.4 (Covariance). The covariance between x and y is defined as
Cov[x, y ] = E [(x− µx )(y− µy )] = E [xy ]− µxµy (3.6.67)
3.6. BIVARIATE CONTINUOUS DISTRIBUTION 69
Figure 3.13: Dependent but uncorrelated random variables
A positive (negative) covariance means that the two variables are positively
(inversely) related, i.e. that once one is above its mean, then the other tends to
be above (below) its mean as well. The covariance can take any value in real
numbers. A limitation of covariance is that it depends on variables' scales and
units: for instance, if variables were measured in meters instead of centimetres, this
would induce a change of their covariance. For this reason, it is common to replace
covariance with correlation, a dimensionless measure of linear association.
Definition 6.5 (Correlation) . The correlation coefficient is defined as
ρ(x ,y) = Cov[ x ,y]
pVar [x ] Var [y ] (3.6.68)
It is easily shown that − 1≤ρ (x, y )≤ 1. For this reason, the correlation is
sometimes expressed as a percentage.
Definition 6.6 (Uncorrelated variables) . Two r.v.s x and y are said to be uncor-
related if ρ (x, y ) = 0 or equivalently if
E[xy ] = E[x]E[y ] (3.6.69)
Note that if x and y are two independent random variables, then
E[xy ] = Z xyp( x, y ) dxdy =Z xyp( x) p( y) dxdy =Z xp(x) dx Z yp( y) dy = E[x ] E[y ]
This means that independence implies uncorrelation. However, the contrary does
not hold for a generic distribution. The equivalence between independence and
uncorrelation
ρ(x ,y)=0⇔ x⊥⊥ y (3.6.70)
holds only if x and y are jointly Gaussian.
See Figure 3.13 for an example of uncorrelated but dependent variables.
Exercises
1. Let x and y two discrete independent r.v. such that
Px (− 1) = 0 .1 , Px (0) = 0 .8 , Px (1) = 0 . 1
and
Py (1) = 0 .1 , Py (2) = 0 . 8, Py (3) = 0 .1
If z =x +y show that E [z ] = E [ x ] + E [y]
70 CHAPTER 3. FOUNDATIONS OF PROBABILITY
2. Let x be a discrete r.v. which assumes {− 1,0,1} with probability 1/ 3 and
y= x2 . Let z= x+ y. Show that
•E[z ] = E[x ] + E[y ].
•xand yare uncorrelated but dependent random variables.
•
3.7 Normal distribution: the multivariate case
Let z = [z1 ,...,zn ]T be a [n, 1] random vector . The vector is said to be normally
distributed with parameters µ and Σ (also z ∼ N (µ, Σ)) if its probability density
function is given by
pz ( z ) = 1
(√ 2π)n p det(Σ) exp − 1
2(z− µ )T Σ−1 (z− µ )(3.7.71)
where det(Σ) denotes the determinant of the matrix Σ. It follows that
•the mean E [z ] = µ is an [n, 1] vector,
•the matrix
Σ = E [(z− µ)(z− µ)T ] (3.7.72)
is the [n, n] covariance matrix. This matrix is symmetric and positive semidef-
inite. It has n (n + 1)/ 2 parameters: the diagonal terms Σjj are the variances
Var [zj ] of the vector components and the off-diagonal terms Σjk , j 6 =k are the
covariance terms Cov[zj ,zk ]. The inverse Σ−1 is also called the concentration
matrix.
The quantity
∆=( z− µ)T Σ −1 ( z− µ) (3.7.73)
which appears in the exponent of pz is called the Mahalanobis distance from zto
µ. It can be shown that the n -dimensional surfaces of constant probability density
•are hyper-ellipsoids on which ∆2is constant;
•their principal axes are given by the eigenvectors uj ,j = 1, . . . , n of Σ which
satisfy
Σuj =λj uj j = 1, . . . , n
where λj are the corresponding eigenvalues.
•the eigenvalues λj give the variances along the principal directions (Figure
3.14).
If the covariance matrix Σ is diagonal then
•the contours of constant density are hyper-ellipsoids with the principal direc-
tions aligned with the coordinate axes.
•the components of z are then statistically independent since the distribution of
zcan be written as the product of the distributions for each of the components
separately in the form
pz ( z) =
n
Y
j=1
pz j (zj )
•the total number of independent parameters in the distribution is 2n (nfor
the mean vector and nfor the diagonal covariance matrix).
•if σj =σ for all j, the contours of constant density are hyper-spheres.
3.7. NORMAL DISTRIBUTION: THE MULTIVARIATE CASE 71
Figure 3.14: Contour curves of normal distribution for n = 2.
3.7.1 Bivariate normal distribution
Let us consider a bivariate (n= 2) normal density whose mean is µ = [µ1 , µ2 ]Tand
the covariance matrix is
Σ = σ 2
1σ 12
σ21 σ2
2
The correlation coefficient is
ρ= σ12
σ1σ2
It can be shown that the general bivariate normal density has the form
p(z1 , z2 ) =
1
2πσ1σ2 p 1−ρ2 exp " − 1
2(1 −ρ2 )" z 1 − µ 1
σ1 2
−2ρ z 1 −µ1
σ1 z 2 − µ 2
σ2 + z 2 − µ 2
σ2 2 ##
A plot of a bivariate normal density with µ = [0, 0] and Σ = [1. 2919,0. 4546; 0. 4546,1.7081]
and a corresponding contour curve are traced in Figure 3.15 by means of the script
gaussXYZ.R.
We suggest the reader to play with the Shiny dashboard gaussian.R in order
to visualize the impact of the parameters on the Gaussian distribution.
One of the important properties of the multivariate normal density is that all
conditional and marginal probabilities are also normal. Using the relation
p( z2 |z1 ) = p(z1 , z2 )
p(z1 )
we find that p (z2 |z1 ) is a normal distribution N (µ2|1 , σ2
2| 1), where
µ2|1 =µ2 + ρσ 2
σ1
(z1 − µ1 )
σ2
2| 1=σ 2
2(1 −ρ 2 )
Note that
•µ2|1 is a linear function of z1 : if the correlation coefficient ρ is positive, the
larger z1 , the larger µ2|1 .
•if there is no correlation between z1 and z2 , the two variables are independent,
i.e. we can ignore the value of z1 to estimate µ2 .
72 CHAPTER 3. FOUNDATIONS OF PROBABILITY
Figure 3.15: Bivariate normal density function
3.7.2 Gaussian mixture distribution
A continuous r.v. zhas a Gaussian mixture distribution with mcomponents if
p(z = z ) =
m
X
k=1
wk N( z; µk ,Σk ) (3.7.74)
where N (z ; µk , Σk ) denotes the Normal density with mean µk and covariance Σk,
and the mixture weights wk satisfy
m
X
k=1
wk = 1 ,0≤ wk ≤1
A Gaussian mixture is a linear superposition of m Gaussian components and, as
such, has a higher expressive power than a unimodal Gaussian distribution: for
instance, it can be used to model multimodal density distributions.
The script gmm.R samples a bidimensional mixture of Gaussians with 3 compo-
nents with diagonal covariances. The density and the sampled points are in Fig-
ure 3.16. An interesting property of Gaussian mixtures is that they are universal
approximator of densities which means that any smooth density can be approx-
imated with any specific nonzero amount of error by a Gaussian mixture model
(GMM) with enough components.
3.7.3 Linear transformations of Gaussian variables
If z1 ∼ N (µ1 , Σ1 ) and z2 ∼ N (µ2 , Σ2 ) are independent Gaussian r.v.s., then the
sum z =z1 +z2 is a Gaussian r.v. z ∼ N (µ1 + µ2 , Σ1 + Σ2).
Given two real constants c1 and c2 , the linear combination z =c1 z1 +c2 z2 is a
Gaussian r.v. z ∼ N (c1µ1 + c2µ2 , c2
1Σ 1 +c 2
2Σ 2 ).
If z ∼ N (µ, Σ) is a [n, 1] Gaussian random vector and y =Az , with A a [n, n]
real matrix, then y ∼ N (Aµ, A ΣAT ) is a Gaussian vector.
3.8. MUTUAL INFORMATION 73
Figure 3.16: Density and observations of a bidimensional mixture of Gaussians with
3 components. Each colour corresponds to a different component.
3.8 Mutual information
Mutual information is one of the most widely used measures to convey the depen-
dency of variables. It is a measure of the amount of information that one random
variable contains about another random variable. It can also be considered as the
distance from independence between the two variables. This quantity is always non-
negative and zero if and only if the two variables are stochastically independent.
Given two random variables x and y , their mutual information is defined in
terms of their probabilistic marginal density functions px (x ), py (y ) and the joint
p(x, y) ( x, y):
I(x ;y ) = Z Z log p( x, y)
p( x) p( y) p( x, y) dxdy = H (y )− H (y |x ) = H (x )− H (x |y )
(3.8.75)
with the convention that 0 log 0
0= 0. From (3.5.60), we derive
I(x ;y ) = H(y )− H(y| x ) = H(y ) + H(x )− H(x ,y) (3.8.76)
Mutual information is null if and only if x and y are independent, i.e.
I(x ;y )=0⇔x ⊥⊥ y. (3.8.77)
In other words, the larger the mutual information term, the stronger is the degree
of dependency between two variables.
In the Gaussian case, an analytical link between correlation and mutual infor-
mation exists. Let (x, y ) a normally distributed random vector with a correlation
coefficient ρ . The mutual information between x and y is given by
I(x ;y ) = − 1
2log(1 −ρ2 )
74 CHAPTER 3. FOUNDATIONS OF PROBABILITY
Equivalently the correlation coefficient (3.6.68) can be written as
ρ=p 1− exp(− 2 I(x ;y ))
In agreement with (3.8.77) and (3.6.70), it follows that in the Gaussian case
ρ(x ,y)=0⇔ I(x ;y ) = 0 (3.8.78)
3.8.1 Conditional mutual information
Consider three r.v.s x ,y and z . The conditional mutual information is defined by
I(y ;x| z ) = H(y| z )− H(y| x ,z) (3.8.79)
It can also be written as
I(y ;x| z ) = Z Z log p(x, y| z )
p( x| z ) p ( y| z) p( x, y, z ) dxdydz
While mutual information quantifies the degree of (in)dependence between two
variables, conditional mutual information quantifies the degree of conditional (in)dependence
(Section 3.5.4) between three variables. The conditional mutual information is null
iff x and y are conditionally independent given z , i.e.
I(x ;y |z )=0⇔x ⊥⊥ y |z (3.8.80)
Note that I ( x ;y |z ) can be null though I (x ;y )> 0, like in the pizzas example
in Section 3.5.4. Also a symmetric configuration is possible, e.g. I ( x ;y ) = 0 but
I(x ;y| z ) >0 as in the case of complementary variables which will be discussed in
Section 12.8.
3.8.2 Joint mutual information
This section derives the information of a pair of variables (x1 ,x2 ) about a third one
y.
From (3.8.79) and (3.5.60) it follows:
I(x ;y |z ) = H(y |z )− H(y| x ,z) = H(y |z ) + H(x| z )− H((x ,y)| z) =
=H ((x, z )) + H ((y, z )) −H (z )−H ((x, y, z )) (3.8.81)
From (3.8.76) it follows
I((x1 ,x2 ); y) = H(x1 ,x2 ) + H(y )− H(x1 ,x2 ,y)
and
I(x1 ;y ) = H(x1 ) + H(y )− H(x1 ,y)
From (3.8.81) it follows
I(x2 ;y |x1 ) = H(y| x1 )− H(y |x1 ,x2 ) =
=H (y, x1 )−H (x1 )−H (y, x1 ,x2 ) + H (x1 ,x2 )
On the basis of the results above, we derive the chain rule of mutual information
I(x1 ;y ) + I(x2 ;y| x1 ) =
=H (x1 ) + H (y )−H (x1 , y ) + H (y, x1 )−H (x1 )−H (y, x1 ,x2 ) + H (x1 ,x2 ) =
=H (y )−H (y, x1 ,x2 ) + H (x1 ,x2 ) = I ((x1 ,x2 ); y ) (3.8.82)
3.8. MUTUAL INFORMATION 75
This formula shows that the information that a pair of variables (x1 ,x2 ) brings
about a third variable yis not simply the sum of the two mutual information terms
I(x1 ;y ) and I(x2 ;y ) but is the sum of I(x1 ;y ) and the conditional information of
x2 and ygiven x1 . This aspect is particularly important in the feature selection
context (Section 12.8) where simplistic assumptions of monotonicity and additivity
do not hold.
For n > 2 variables X = {x1 ,...,xn } the chain rule formulation is
I(X ;y ) = I(X−i ;y| xi )+I(xi ;y ) = I(xi ;y |X−i )+ I(X−i ;y ) i= 1 , . . . , n (3.8.83)
where X−i denote the X set with the i th term set aside.
3.8.3 Partial correlation coefficient
We have seen in Section 3.6.1 that correlation is a good measure of independence
in the case of Gaussian distributions. The same role for conditional independence
is played by partial correlation.
Definition 8.1 (First-order partial correlation) . Let us consider three r.v.s x,y
and z . The first-order partial correlation is
ρxy|z = ρ xy − ρ xz ρ zy
q(1 −ρ2
xz)(1 −ρ 2
yz)
where ρ xy is defined in (3.6.68).
This quantity returns a measure of the correlation between x and y once the
value of z is known. It is possible to extend the partial correlation to the condition-
ing of two variables.
Definition 8.2 (Second-order correlation).
ρx 1 y|zx2 = ρ x 1 y|z − ρ x 1 x 2 |z ρ yx 2 |z
q(1 −ρ 2
x1x2 | z )(1 −ρ 2
yx2 | z )
This can be used also to define a recurrence relationship where qth order partial
correlations can be computed from (q− 1)th order partial correlations.
Another interesting property is the link between partial correlation and con-
centration matrix (Section 3.7). Let Σ and Ω = Σ−1 denote the covariance and
the concentration matrix of the normal set of variables Z ∪ {x, y } . The partial
correlation coefficient ρ xy|Z can be obtained by matrix inversion:
ρxy|Z =− ωxy
√ω xx ωxy
where ωxy is the element of the concentration matrix corresponding tp x and y.
Consider a multivariate normal vector X, such that xi ,xj ∈ X ,XS ⊂ X and s
is the dimension of XS . Then
ρx i x j |XS = 0 ⇔ I(xi ,xj | XS )=0
Note that this is the conditional version of the relation (3.8.78).
76 CHAPTER 3. FOUNDATIONS OF PROBABILITY
3.9 Functions of random variables and Monte Carlo
simulation
For any function g (· ) of the random variable z
E[ g(z )] = Z g( z) pz ( z) dz (3.9.84)
This is also know as the law of the unconscious statistician (LOTUS). Note that in
general E [g (z )] 6 =g (E [z ]), with the exception of the linear function g (z ) = az + b
which will be discussed in the following section.
Exercise
Let z be a scalar r.v. and
g( z) = ( 1 z∈[a, b]
0 else
with a < b . Compute E [g (z)].
•
For a generic g , the analytical computation or numerical integration of (3.9.84) may
be extremely complex. A numerical alternative is represented by the Monte Carlo
simulation which requires a pseudo-random generator of examples according to the
distribution of z. In a nutshell Monte Carlo computes E [g (x )] by
1. generating a large number S of sample points zi ∼ Fz , i = 1, . . . , S ,
2. computing g (zi ),
3. returning the estimation
E[ g(z )] ≈P S
i=1 g(z i )
S
If S is sufficiently large, we may consider such approximation as reliable. The same
procedure may be used to approximate other parameters of the distribution (e.g.
the variance). In this book, we will have recourse to Monte Carlo simulation to
provide a numerical illustration of probabilistic formulas or concepts (e.g. bias,
variance and generalisation error), which otherwise might appear too abstract for
the reader.
Monte Carlo computation
The script mcarlo.R contains the Monte Carlo computation of the mean and vari-
ance of z ∼ N (µ, σ 2 ) as well as the computation of E [z2 ] and E [ |z | ].
The Shiny dashboard mcarlo.R visualises the result of some operations on a
single and two random variables by using a Monte Carlo simulation.
•
3.10. LINEAR COMBINATIONS OF R.V. 77
3.10 Linear combinations of r.v.
The expected value of a linear combination of r.v.s is simply the linear combination
of their respective expectation values
E[ ax+ by ] = aE[x ] + bE [y] , a ∈R , b ∈ R
i.e., expectation is a linear statistic. On the contrary, the variance is not a linear
statistic. We have
Var [ax + by ] = a2 Var [x ] + b2 Var [y ]+2 ab ( E[ xy]− E [x] E [y ]) (3.10.85)
=a2 Var [x ] + b2 Var [y ]+2 abCov[x ,y ] (3.10.86)
where the quantity Cov[x, y ] is defined in (3.6.67).
Given n r.v. zj , j = 1, . . . , n
Var
n
X
j=1
cj zj
=
n
X
j=1
c2
jVar [zj ]+2X
i<j
ci cj Cov[zi ,zj ] (3.10.87)
Let us consider now nrandom variables with the same variance σ2 and mutual
correlation ρ . Then the variance of their average is
Var "P n
j=1 z j
n# = nσ2
n2 + 2 1
n2
n( n− 1)
2ρσ 2 =
=σ 2
n+ ρσ 2 − ρσ2
n= (1 − ρ) σ 2
n+ ρσ 2 (3.10.88)
3.10.1 The sum of i.i.d. random variables
Suppose that z1 ,z2 ,. . . ,zN are i.i.d. (identically and independently distributed)
random variables, discrete or continuous, each having a probability distribution
with mean µ and variance σ2 . Let us consider the two derived r.v., that is the sum
SN = z1 +z2 + · ·· + z N
and the average
¯
z= z 1 + z 2 +· ·· +z N
N(3.10.89)
The following relations hold
E[SN ] = Nµ, Var [SN ] = Nσ 2 (3.10.90)
E[¯
z] = µ, Var [¯
z] = σ 2
N(3.10.91)
An illustration of these relations by simulation can be obtained by running the R
script sum rv.R.
3.11 Conclusion
The reader (in particular, if practitioner) might think that a chapter on probabil-
ity theory is an unnecessary frill in a book on machine learning. The author has
a different opinion. Probability extends the logical formalism and makes formal
human patterns of reasoning under uncertainty (e.g. abduction). Also, probability
provides an effective language to formalise the task of machine learning, i.e. using
78 CHAPTER 3. FOUNDATIONS OF PROBABILITY
some variables (e.g. inputs) to explain, provide information (or reduce uncertainty)
about other ones (e.g. targets). According to Aristotles, philosophy begins with
wonder. From a scientific perspective, wonder originates from uncertainty, and sci-
ence has the role of reducing it by explanation. The author hopes that this chapter
showed that uncertainty and information are not only philosophical concepts but
quantities whose nature and relationship can be described in probabilistic terms.
So far, we only considered low variate settings, although the ambition of statis-
tical machine learning is attacking complex high variate problems. For this reason,
the next chapter will provide a probabilistic formalism to deal with high variate
(and then complex) settings. What is still missing for the moment is the second
major ingredient (besides uncertainty) of machine learning: data. Please be patient:
the relation between uncertainty and observations will be discussed in Chapter 5,
which introduces estimation as the statistical way of combining probabilistic models
with real-world data.
3.12 Exercises
1. Suppose you collect a dataset about spam in emails. Let the binary variables x1 ,
x2 and x3 represent the occurrence of the words "Viagra", "Lottery" and "Won",
respectively, in a email. Let the dataset of 20 emails being summarised as follows
Document x1 (Viagra) x2 (Lottery) x3 (Won) y(Class)
E1 0 0 0 NOSPAM
E2 0 1 1 SPAM
E3 0 0 1 NOSPAM
E4 0 1 1 SPAM
E5 1 0 0 SPAM
E6 1 1 1 SPAM
E7 0 0 1 NOSPAM
E8 0 1 1 SPAM
E9 0 0 0 NOSPAM
E10 0 1 1 SPAM
E11 1 0 0 NOSPAM
E12 0 1 1 SPAM
E13 0 0 0 NOSPAM
E14 0 1 1 SPAM
E15 0 0 1 NOSPAM
E16 0 1 1 SPAM
E17 1 0 0 SPAM
E18 1 1 1 SPAM
E19 0 0 1 NOSPAM
E20 0 1 1 SPAM
where
•0 stands for the case-insensitive absence of the word in the email.
•1 stands for the case-insensitive presence of the word in the email.
Let y = 1 denote a spam email and y= 0 a no-spam email.
The student should estimate on the basis of the frequency of the data above
•Prob {x1 = 1, x2 = 1}
•Prob {y= 0 |x2 = 1, x3 = 1}
•Prob {x1 = 0 |x2 = 1}
•Prob {x3 = 1 |y= 0, x2 = 0}
•Prob {y= 0 |x1 = 0,x2 = 0,x3 = 0}
3.12. EXERCISES 79
•Prob {x1 = 0 |y= 0}
•Prob {y= 0}
Solution:
•Prob {x1 = 1, x2 = 1 }= 0.1
•Prob {y= 0 |x2 = 1, x3 = 1 }= 0
•Prob {x1 = 0 |x2 = 1 }= 0 .8
•Prob {x3 = 1 |y= 0, x2 = 0 }= 0.5
•Prob {y= 0 |x1 = 0, x2 = 0, x3 = 0 }= 1
•Prob {x1 = 0 |y= 0 }= 0 .875
•Prob {y= 0 }= 0.4
2. Let us consider a fraud detection problem. Suppose we collect the following trans-
actional dataset where v = 1 means that the transaction came from a suspicious
web site and f= 1 means that the transaction is fraudulent.
f= 1 f= 0
v= 1 500 1000
v= 0 1 10000
Estimate the following quantities by using the frequency as estimator of probability:
•Prob {f= 1}
•Prob {v= 0}
•Prob {f= 1 |v= 1}
•Prob {v= 1 |f= 1}
Use the Bayes theorem to compute Prob {v = 1|f = 1} and show that the result is
identical to the one computed before.
Solution:
•Prob {f= 1 }= 501/ 11501 = 0.043
•Prob {v= 0 }= 10001 /11501 = 0 .869
•Prob {f= 1 |v= 1 }= 500/1500 = 1 /3
•Prob {v= 1 |f= 1 }= 500/501
By Bayes theorem: Prob {v = 1|f = 1} = Prob{f =1|v =1}Prob{v =1}
Prob{f =1 } = 1/3(1500/11501)
501/ 11501 =
500/501
3. Let us consider a dataset with 4 binary variables
x1x2x3 y
1 1 0 1
0 0 1 0
0 1 0 0
1 1 1 1
0 0 0 0
0 1 0 0
0 1 1 0
0 0 1 0
0 0 0 0
0 1 0 0
1 1 1 1
Estimate the following quantities by using the frequency as estimator of probability
80 CHAPTER 3. FOUNDATIONS OF PROBABILITY
•Prob {y= 1}
•Prob {y= 1 |x1 = 0}
•Prob {y= 1 |x1 = 0, x2 = 0, x3 = 0}
Solution:
•Prob {y= 1 }= 3 /11
•Prob {y= 1 |x1 = 0 }= 0
•Prob {y= 1 |x1 = 0, x2 = 0, x3 = 0 }= 0
4. Let us consider a task with three binary inputs and one binary target where the
input distribution is
x1x2x3 P( x1 , x2, x3 )
0 0 0 0.2
0 0 1 0.1
0 1 0 0.1
0 1 1 0.1
1 0 0 0.1
1 0 1 0.1
1 1 0 0.1
1 1 1 0.2
and the conditional probability is
x1x2x3 P( y= 1| x1 , x2, x3 )
0 0 0 0.8
0 0 1 0.1
0 1 0 0.5
0 1 1 0.9
1 0 0 0.05
1 0 1 0.1
1 1 0 0.05
1 1 1 0.5
Compute
•Prob {x1 = 1, x2 = 1}
•Prob {y= 0 |x2 = 1, x3 = 0}
•Prob {x1 = 0 |x2 = 1}
•Prob {x3 = 1 |y= 0, x2 = 1}
•Prob {y= 0 |x1 = 0, x2 = 0, x3 = 0}
•Prob {x1 = 0 |y= 0}
Solution:
•Prob {x1 = 1, x2 = 1 }=0.1+0.2=0.3
•By using (3.1.21) (where E0 stands for x2 = 1,x3 = 0) we obtain:
Prob {y = 0|x2 = 1,x3 = 0} = Prob {y = 0| x1 = 0,x2 = 1,x3 = 0}∗ Prob {x1 = 0|x2 = 1, x3 = 0}+
Prob {y = 0|x1 = 1,x2 = 1,x3 = 0}∗ Prob {x1 = 1|x2 = 1,x3 = 0} = 0. 5∗ 0. 5+
0. 95 ∗ 0. 5 = 0.725
•Prob {x1 = 0 |x2 = 1 }= (0 .1 + 0 .1) / (0 .2 + 0.3) = 0 .4
•From the joint four variate distribution computed in the exercise below
Prob { x3 = 1|y = 0,x2 = 1} = Prob {x3 = 1,y = 0,x2 = 1}
Prob {y = 0,x2 = 1} = 0 . 11
0. 255 = 0.4313725
3.12. EXERCISES 81
•Prob {y= 0 |x1 = 0, x2 = 0, x3 = 0 }= 1 − 0. 8 = 0.2
•From the joint four variate distribution computed in the exercise below
Prob {x1 = 0|y = 0} = Prob {x1 = 0,y = 0}
Prob {y = 0} = 0. 19
0. 57 = 0.3333
5. Consider the probability distribution of the previous exercise. Is yconditionally
independent of x1 given x2 ?
Solution:
According to Section 3.5.4, yis conditionally independent of x1 given x2 if for all
values x2 :
Prob {y =y| x1 = x1 ,x2 = x2 } = Prob {y =y|x2 = x2 }
Let us compute Prob {y = 1|x1 = 1,x2 =x2 } and Prob {y = 1| x2 = x2 } for x2 = 0.
From (3.1.21)
Prob {y = 1| x2 = 0,x1 = 1}=
X
x3
Prob {y = 1|x2 = 0,x1 = 1,x3 = x3 } Prob {x3 =x3 | x2 = 0,x1 = 1}=
= Prob {y = 1|x2 = 0,x1 = 1x3 = 0} Prob { x3 = 0|x2 = 0, x1 = 1}+
+ Prob {y = 1|x2 = 0, x1 = 1,x3 = 1} Prob { x3 = 1|x2 = 0,x1 = 1}=
= 0. 05 ∗ 0. 1/ 0. 2 + 0. 1∗ 0. 1 / 0. 2 = 0.075
and
Prob {y = 1| x2 = 0}=
=X
x1,x3
Prob {y = 1|x2 = 0, x1 =x1 ,x3 = x3 } Prob {x1 =x1 ,x3 =x3 | x2 = 0}=
= Prob {y = 1|x2 = 0,x1 = 0,x3 = 0} Prob {x1 = 0,x3 = 0|x2 = 0}+
+ Prob {y = 1|x2 = 0,x1 = 0,x3 = 1} Prob {x1 = 0,x3 = 1|x2 = 0}+
+ Prob {y = 1|x2 = 0,x1 = 1,x3 = 0} Prob {x1 = 1,x3 = 0|x2 = 0}+
+ Prob {y = 1|x2 = 0,x1 = 1,x3 = 1} Prob {x1 = 1,x3 = 1|x2 = 0}=
= 0. 8∗ 0.2 / 0. 5 + 0. 1∗ 0.1 / 0. 5 + 0. 05 ∗ 0. 1/0 . 5 + 0. 1∗ 0. 1/0 . 5 = 0.37
Since those two values are different, the two variables are not conditionally indepen-
dent.
An alternative would be first computing the joint distribution of the 4 variables and
then deriving the conditional terms. Since
Prob { y, x1, x2, x3 } = Prob {y | x1 , x2, x3 } Prob {x1 , x2, x3 }
the joint distribution is :
82 CHAPTER 3. FOUNDATIONS OF PROBABILITY
y x1 x2x3 P( y, x1, x2, x3 )
0 0 0 0 (1-0.8)*0.2=0.04
0 0 0 1 (1-0.1)*0.1=0.09
0 0 1 0 0.05
0 0 1 1 0.01
0 1 0 0 0.095
0 1 0 1 0.09
0 1 1 0 0.095
0 1 1 1 0.1
1 0 0 0 0.8*0.2=0.16
1 0 0 1 0.1*0.1=0.01
1 0 1 0 0.05
1 0 1 1 0.09
1 1 0 0 0.005
1 1 0 1 0.01
1 1 1 0 0.005
1 1 1 1 0.1
From the table above we compute the conditional terms as
Prob {y = 1|x2 = 0} = Prob {y = 1, x2 = 0}
Prob { x2 = 0} =
=0. 16 + 0.01 + 0.01 + 0.005
0. 04 + 0.09 + 0.095 + 0.09 + 0.16 + 0.01 + 0.005 + 0. 01 = 0.37
and
Prob {y = 1|x2 = 0, x1 = 1} = Prob {y = 1,x1 = 1,x2 = 0}
Prob {x1 = 1,x2 = 0} =
=0.005 + 0.01
0. 095 + 0.09 + 0.005 + 0. 01 = 0.075
Since the results are (obviously) identical to the ones obtained with the first method,
the conclusion is the same, i.e. the variables are conditionally dependent.
6. Let x, y, z be three binary random variables denoting the pathological mutation of
a given gene of the father, mother and child, respectively. The values 0 and 1 stand
for the absence and presence of the mutation, respectively. Suppose that
•the two parents have the same probability 0.5 of having a pathological mutation
in a given gene
•the variables x and y are independent
•the child may inherit the mutation according to this conditional probability
table
Prob {z = 1|x = x, y =y} x y
0 0 0
0.6 0 1
0.4 1 0
0.7 1 1
1. What is the probability that the child has no mutation if both parents are not
affected?
2. What is the probability that the father had a mutated gene if the child has the
mutation and the mother is not affected?
3. What is the probability that the father has a mutated gene if the child has the
mutation and the mother is affected?
3.12. EXERCISES 83
4. What is the probability that the child has the mutation if the father has none?
5. What is the probability that the father has a mutated gene if the child has the
mutation?
6. What is the probability that the father has a mutated gene if the child has no
mutation?
Solution:
Let us derive first
P(z = 1|y = 0) =
P(z = 1|y = 0 ,x= 1) P(x = 1|y = 0) + P(z = 1|y = 0 ,x= 0) P(x = 0|y = 0) =
=P (z = 1|y = 0,x = 1)P (x = 1) + P (z = 1|y = 0,x = 0)P (x = 0) =
= 0. 4∗ 0. 5 + 0 ∗ 0 . 5 = 0.2
P(z = 1|y = 1) =
P(z = 1|y = 1 ,x= 1) P(x = 1|y = 1) + P(z = 1|y = 1 ,x= 0) P(x = 0|y = 1) =
=P (z = 1|y = 1,x = 1)P (x = 1) + P (z = 1|y = 1,x = 0)P (x = 0) =
= 0. 7∗ 0 . 5 + 0. 6∗ 0. 5 = 0.65
P(z = 1|x = 1) =
P(z = 1|x = 1 ,y= 0) P(y = 0|x = 1) + P(z = 1|y = 1 ,x= 1) P(y = 1|x = 1) =
= 0. 4∗ 0 . 5 + 0. 7∗ 0. 5 = 0.55
It follows
1.
P(z = 0|x = 0 ,y= 0) = 1
2.
P(x = 1|z = 1 ,y= 0) = P(z = 1|x = 1,y = 0) P(x = 1 |y = 0)
P(z = 1|y = 0 = 0 .4∗ 0 .5
0. 2= 1
3.
P(x = 1|z = 1 ,y= 1) = P(z = 1|x = 1,y = 1) P(x = 1|y = 1)
P(z = 1|y = 1 = 0 .7∗ 0 .5
0. 65 = 0.538
4.
P(z = 1|x = 0) = P(z = 1|x = 0 ,y= 1) P(y = 1|x = 0)+ P(z = 1|x = 0 ,y= 0) P(y = 0|x = 0) =
= 0. 6∗ 0 . 5 + 0 = 0.3
5.
P(x = 1|z = 1) = P(z = 1 |x = 1) P(x = 1)
P(z = 1 = 0 .55 ∗ 0 .5
0. 55 ∗ 0. 5 + 0. 3∗ 0. 5= 0.647
6.
P(x = 1|z = 0) = P(z = 0|x = 1) P(x = 1)
P(z = 0 = 0 .45 ∗ 0 .5
0. 45 ∗ 0. 5 + 0. 7∗ 0. 5 = 0.3913
84 CHAPTER 3. FOUNDATIONS OF PROBABILITY
Chapter 4
Graphical models
Graphical Models combine probability theory and graph theory [113, 136, 119] to
deal with two pervasive issues in applied mathematics and engineering: uncertainty
and complexity. In particular, they rely on the notion of conditional independence
(Section 3.1.6) to simplify the representation of complex high-variate probability
distributions.
4.1 Conditional independence and multivariate dis-
tributions
One of the hardest challenges for machine learning is to model large variate tasks,
i.e. tasks characterised by a large number of variables. Section 3.5.2 shows that an
independence assumption reduces the size of the parameter set needed to describe
a probability distribution with many variables. Unfortunately, the assumption of
independence is very strong and rarely met in real tasks. Nevertheless, it is realistic
to assume the existence of conditional independence (Section 3.5.4) relationships in
large variate settings. This assumption implies sparseness, which is a dependence
pattern where variables tend to interact with few others. If conditional indepen-
dence between some variable holds, thanks to the (3.5.56), we reduce the size of the
parameter set required to describe the joint probability distribution.
Consider for instance the case of n= 4 binary discrete r.v.s. In the generic case,
we need 24 −1 = 35 parameters to encode such probability, i.e. a quantity expo-
nential in the number of variables. This exponential nature makes the probabilistic
modelling unfeasible (i.e. too many parameters to elicit) and unmanageable (i.e.
too large required memory) in case of large n.
Let us now suppose that the 4 binary r.v.s are independent: in this case since
P( z4 , z3, z2, z1 ) = P( z4 ) P( z3 ) P( z2 ) P( z1 ) (4.1.1)
only 4 parameters are necessary to describe the joint distribution. No exponential
explosion of the number of required parameters happens. However, this is a very
simplistic and idealised setting, which rarely occurs in real interesting problems.
Moreover, if all the variables were independent, there would be no need of supervised
learning and predictive modelling since no variable brings information (or reduce
uncertainty) about the other.
A more realistic assumption is to consider some variables as conditionally inde-
pendent of others. For instance, suppose that z4 is conditionally independent of z 1
and z2 given z3 (z4 ⊥⊥ (z1 ,z2 ) |z3 )
P( z4 |z3 , z2, z1 ) = P( z4 |z3 ) (4.1.2)
85
86 CHAPTER 4. GRAPHICAL MODELS
and z3 is conditionally independent of z1 given z2 (z3 ⊥⊥ z1 |z2 )
P( z3 | z2 , z1 ) = P( z3 |z2 ) (4.1.3)
From the discrete version of (3.5.56) we can write
P( z4 , z3, z2, z1 ) = P( z4 | z3 , z2, z1 ) P( z3 |z2 , z1 ) P( z2 | z1 ) P( z1 )
From the conditional independence relations (4.1.2) and (4.1.3) we obtain the sim-
plified expression
P( z4 , z3, z2, z1 ) = P( z4 |z3 ) P( z3 |z2 ) P( z2 |z1 ) P( z1 )
Note that the conditional probability P ( zj |zi ) for two binary r.v.s can be now
encoded by a conditional table with two single parameters, e.g. P (zj = 1|zi =
1) and P (zj = 1|zi = 0). It follows that thanks to such assumptions, we may
describe the joint probability with 7 parameters only. The useful compactness of
the representation is still more striking in the case of large n, continuous variables,
or discrete r.v.s with a large range of values.
The representational advantage of conditional independence relationships is evi-
dent in Bayesian Networks, a formalism characterised by a correspondence between
topological properties (e.g. connectivity in a directed graph) and probabilistic ones
(notably independence). This formalism allows a compact, flexible, modular (since
localized and then natural for humans) representation of joint distributions.
4.2 Directed acyclic graphs
Adirected graph G is a pair (V, E ) where V is a finite non-empty set whose elements
are called nodes , and E is a set of ordered pairs of distinct elements of V . The
elements of E are called edges. A directed cycle is a path from a node to itself. A
directed graph is called a directed acyclic graph (DAG) if it contains no directed
cycles.
Given a DAG and two nodes z1 ∈ V , and z2 ∈ V ,
•z2 is called a parent of z1 if there is an edge from z2 to z 1
•z2 is called a descendent of z1 , and z1 is called an ancestor of z2 if there is a
directed path from z1 to z 2
•z2 is called a non-descendent of z1 if it is not a descendent of z 1
Note that a node is not considered a descendant of itself.
4.3 Bayesian networks
DAGs are an effective way of representing multivariate distributions where the nodes
denote random variables, and the topology (notably the absence of edges) encodes
conditional independence assertions (e.g. elicited from an expert in this domain).
A main advantage of the approach is the notion of modularity (a complex system
is made by combining simpler parts) which makes possible visual interpretability.
A Bayesian Network (BN) is a pair (G , P ) where Gis a Directed Acyclic Graph
(DAG) (i.e. graph with no loops from a variable back to itself) and Pis a joint
probability distribution over Z, which is associated with Gby the Markov condition.
Definition 3.1 (Markov condition) . Given a DAG graph Gand the associated
joint probability distribution P over Z , the Markov condition (MC) holds if every
variable is independent of its graphical non-descendants conditional on its parents.
4.3. BAYESIAN NETWORKS 87
Figure 4.1: Bayesian Network.
If the Markov condition is satisfied (it is also said that G represents P ) the
following theorem holds.
Theorem 3.2. If (G , P ) satisfies the Markov condition, then Pis equal to the
product of its conditional distributions of all nodes given values of their parents,
whenever these conditional distributions exist.
This means that if we order the set of r.v. zi , such that if zj is a descendant of
zk ,zj follows zk in the ordering (k < j), we have the product form
P( z1 , . . . , zn ) =
n
Y
i=1
P( zi | Parents(zi ))
where Parents(zi ) is the set of parents of the node zi in G.
Example
An example of BN is shown in Figure 4.1. Note that the enumeration of the variable
indices satisfies the topological ordering mentioned before. Let us consider the node
z4 : the nodes z2 and z3 are its parents, z1 is its ancestor, and z6 is its descendant.
The associate probability distribution may be factorised as follows:
P( Z) = P( z6 |z4 ) P( z5 |z3 ) P( z4 |z3 , z2 ) P( z3 |z1 ) P( z2 |z1 ) P( z1 )
From the DAG, we can derive a number of conditional independent statements on
the basis of the Markov Condition:
z6 ⊥⊥ (z2 ,z3 ,z1 ,z5 ) |z4 (4.3.4)
z4 ⊥⊥ (z1 ,z5 ) | (z2 ,z3 ) (4.3.5)
z2 ⊥⊥ (z3 ,z5 ) |(z1 ) (4.3.6)
Note, for instance, that z4 is not independent of z6 since it is a descendant. May
you write more independence statements?
•
88 CHAPTER 4. GRAPHICAL MODELS
MARY CALLS
BURGLARY EARTHQUAKE
ALARM
JOHN CALLS
Figure 4.2: Alarm BN [166].
Example
This is a well-known example used in [166] to show a practical application of
Bayesian Networks. Suppose you want to model a burglar alarm that is fairly re-
liable at detecting a burglary but also responds on occasion to minor earthquakes.
You also have two neighbours, John and Mary, who promised to call you at work
when they hear the alarm. John always calls when he hears the alarm but some-
times confuses the telephone ringing with the alarm and calls then, too. Mary likes
loud music and sometimes misses the alarm. Given the evidence of who has or has
not called, we would like to estimate the probability of a burglary. We can describe
the problem by using a BN (Figure 4.3) where all the variables are Boolean and
denoted by capital letters to better remember their meaning.
The joint probability can be factorized as follows:
P( J, M, A, B , E) = P( J| A) P( M| A) P( A| B, E ) P( B) P( E)
Suppose that the unconditional probability of a burglar (B) or an earthquake (E)
are quite low, e.g. P (B ) = 0. 001 and P (E )=0 .002. Let the conditional probability
tables be
B E P(A | B,E) P(¬ A| B,E)
T T 0.95 0.05
T F 0.94 0.06
F T 0.29 0.71
F F 0.001 0.999
A P(J| A) P(¬ J |A)
T 0.9 0.1
F 0.05 0.95
A P(M| A) P(¬ M| A )
T 0.7 0.3
F 0.01 0.99
What is the probability Prob {B =T |J =T} (denoted Prob {B |J } below), i.e.
the probability that a burglar entered the house if John calls?
Prob {B |J } = Prob {J |B } Prob {B}
Prob {J } =
=Prob {J|B } Prob {B }
Prob {J|B } Prob {B } + Prob {J|¬B } Prob {¬B}
4.3. BAYESIAN NETWORKS 89
We have
Prob {J|B } = Prob {J|A, B } Prob {A |B } + Prob {J |¬A, B } Prob {¬A |B } =
= Prob {J|A } Prob {A|B } + Prob {J |¬A } Prob {¬A |B }
Since
Prob {A |B } = Prob {A|B , E } ∗ Prob {E } + Prob {A | B , ¬E } ∗ Prob {¬E } =
= 0. 95 · 0 . 002 + 0. 94 · (1 − 0 . 002) = 0.94
Prob {A|¬B } = Prob {A|¬B , E } ∗ Prob {E } + Prob {A|¬B , ¬E } ∗ Prob {¬E } =
= 0. 29 · 0 . 002 + 0. 001 · (1 − 0 . 002) = 0.00158
Prob {¬A|¬B } = 1 − Prob {A|¬B } = 0.9984
it follows
Prob {J |B } = 0. 9· 0 . 94 + 0. 05 · (1 − 0. 94) = 0.8490
Prob {J |¬B } = Prob {J|A } Prob {A |¬B } + Prob {J|¬A } Prob {¬A|¬B}
= 0. 9· 0 . 00158 + 0. 05 · 0. 9984 = 0.0513
and
Prob {B |J } = 0. 8490 · 0 .001
0. 8490 · 0. 001 + 0. 0513 · (1 − 0. 001) = 0.016
You can retrieve the same results by running the script alarm.R which re-
lies on first computing the entire joint probability distribution and then the ratio
Prob{B,J }
Prob{J } .
•
Example
An interesting BN topology is known as Naive Bayes (Figure 4.3). In this case all
variables zi , 0 < i ≤ n are conditionally independent given the variable z 0
zi ⊥⊥ zj |z0 ∀i > 0 , j > 0
It follows that the associated joint probability can be written as
P( Z) = P( z0 )
n
Y
i=1
P( zi |z0 )
Note that in this case, we need overall 2n+1 parameters to encode the distribution,
i.e. two parameters for each conditional distributions and one parameter to encode
P( z0 ).
This probabilistic model is commonly used for probabilistic reasoning in medical
diagnostic where z0 denotes the pathology class (the cause) and z1 ,...,zn represent
nsymptoms (or effects) associated with the pathology. The assumption here is that
symptoms depend only on the underlying pathology. The Naive Bayes principle also
underlies a well-known classifier which will be presented in Section 10.2.3.1.
•
Definition 3.3 (Minimality condition) . Consider the BN (G , P ) satisfying the MC.
The BN satisfies the minimality condition iff for every proper subgraph H of Gthe
pair (H , P ) does not satisfy the MC.
90 CHAPTER 4. GRAPHICAL MODELS
Figure 4.3: Naive Bayes topology.
4.3.1 Bayesian network and d-separation
The Markov Condition induces a set of conditional independence relations in a
Bayesian Network. However, it is not easy to determine which other conditional
independence relationships possibly hold.
In order to show the link between DAG topology and conditional independence,
we introduce the criterion of d-separation. Let Z−(i,j ) the set obtained by removing
zi and zj from Z.
Definition 3.4 (d-separation) . In a DAG (Directed Acyclic Graph), two nodes z i
and zj are d-separated by the conditioning set S⊆ Z −(i,j ) (denoted by (zi lzj |S ))
if every path from zi to zj is blocked by S.
Two nodes are d-connected if they are not d-separated.
Definition 3.5 (Blocked path) . A path from zi to zj in a DAG is blocked by the
conditioning set Sif
•at least one diverging or serially connected (i.e. non-collider ) node of the path
is in S, OR
•at least one converging node (collider ) and all its descendants are not in S
Example
Consider the graph Gshown in Figure 4.1. If we consider the path z2 →z4 ←z 3
the node z4 is a collider. If we consider the path z2 →z4 →z6 the node z4 is a
non-collider. It follows then that
•z6 is d-separated from z2 by the conditioning set S= {z4 }
(z6 lz2 |z4 ) (4.3.7)
since the only path is blocked (the serially connected node z4 ∈ S is in the
path z2 →z4 →z6 )
•z2 is not d-separated from z3 by the conditioning set S= {z4 }
(z3 ↔z2 |z4 ) (4.3.8)
since there is at least a path that is not blocked (the collider node z4 ∈ S is
in the path z2 →z4 →z3 )
•
A non-blocked path is also called active . The activity of a path depends on the
activity of its (colliders and non-colliders) nodes:
4.3. BAYESIAN NETWORKS 91
•Non-colliders: when the conditioning set is empty, they are active. When they
belong to the conditioning set they become inactive.
•Colliders: when the conditioning set is empty, they are inactive. They become
active when they or some of their descendants are part of the conditioning
set.
It follows that a path is not blocked or active when all the nodes are active.
R example
The R package bnlearn allows us to encode DAGs and performs checks of d-
separation between sets of variables. This package is used in the script dsep.R to
encode and then visualise the DAG in Figure 4.1. The script also uses the function
dsep provided by the package bnlearn to check the existence of the d-separations
corresponding to the conditional independence statements (4.3.7) and (4.3.8).
•
Note that, for the moment, d-separation is a pure graphical (and non probabilis-
tic) notion related to the topology of the graph G . In what follows, we will discuss
how and when it may be informative about probabilistic properties.
4.3.2 D-separation and I-map
Definition 3.6 (I-map property) . A Bayesian Network (G , P ) satisfies the I-map
property if
∀zi ,zj ∈ Z, ∀S ⊆Z −(i,j ) : (zi lzj |S) ⇒I (zi ;zj |S)=0
i.e. if the d-separation (zi lzj |S ) implies the conditional independence zi ⊥⊥ zj |S
or, equivalently, the null conditional mutual information I (zi ;zj |S ) = 0.
We remind the relation (3.8.80) between conditional mutual information and
conditional independence.
The I-map property implies that the set of d-separations of G(denoted by I (G))
is contained in the set of conditional independencies of P(denoted by I (P )):
I( G) ⊆ I(P)
Note that a completely connected DAG always satisfies the property above since
I( G) = ∅.
It can be shown that if (and only if) a probability Psatisfies the Markov con-
dition for a given graph G, each d-separation in the graph implies a conditional
independence (MC ⇔I-map).
In other terms, if one takes any instance of a distribution P which factorises
according to the graph structure and I (P ) is the list of all the conditional indepen-
dence statements that can be obtained from P , if z1 and z2 are d-separated by z3 ,
the independence z1 ⊥⊥ z2 |z3 belongs to I (P).
4.3.2.1 D-separation and faithfulness
If MC holds, d-separation implies independence. Can we also say the reverse, i.e.
that d-connection (i.e. absence of d-separation) implies dependence? Do all distri-
butions P factorised according to a graph Gpossess the dependencies related to
the d-connection of the graph? Unless we make additional assumptions, the answer
is no. The required additional assumption is called faithfulness between the graph
and the distribution.
92 CHAPTER 4. GRAPHICAL MODELS
Definition 3.7 (Faithfulness). A Bayesian Network (G , P ) satisfies the Faithfulness
property if
∀zi ,zj ∈ Z, ∀S ⊆Z−(i,j ) : (zi ↔zj | S) ⇒I (zi ;zj | S) 6= 0
or equivalently if the conditional independence zi ⊥⊥ zj |S entails the d-separation
∀zi ,zj ∈ Z, ∀S ⊆Z −(i,j ) :I (zi ;zj |S)=0 ⇒ (zi lzj |S)
Faithfulness means that independence between two variables zi and zj (in in-
formation theory terms I (zi ;zj | S ) = 0) implies d-separation, or equivalently that
d-connection implies dependency.
When both the Markov condition and the faithfulness hold, there is a bijection
between d-separation and conditional independence. The DAG is then said a perfect
map of the joint probability distribution.
If faithfulness holds, we have a probabilistic independence interpretation of the
graphical d-separation. This means, for instance, that when the conditioning set
is empty non-colliders transmit information (dependence) while colliders do not
(independence).
Example
Consider the BN in Figure 4.1 and suppose it is faithful. For an empty conditioning
set, we can derive a number of relations of dependence, such as
z2 6⊥⊥ z6 ,z4 6⊥⊥ z 5
Are there two independent variables?
•
Beware that many distributions have no perfect map in DAGs. The spectrum
of probabilistic dependencies is, in fact, so rich that it cannot be cast into any
representation scheme that uses a polynomial amount of storage ([Verma, 1987]).
So, how strong is the assumption of faithfulness? It is possible to show that
a DAG is a perfect map for almost all distributions that factorise over G (i.e. all
distributions except a set of measure zero) [119]. This means that assuming faith-
fulness is reasonable in most practical settings. There are, however, well-known
counterexamples to faithfulness as in the case of the XOR (eXclusive OR) func-
tion1 : if x1 → y← x2 and y is the output of a XOR function with inputs x1 and
x2 , it follows the y⊥⊥ x1 and y ⊥⊥ x2 , though y is not d-separa ted from x1 and
x2 . In other terms, following the faithfulness assumptions, there should be no edges
connecting x1 , x2 and y , however, there are.
Example
Consider a symptom (e.g. headache) with two possible causes: a serious one (can-
cer) and a less serious one (virus). We can describe the problem by using a BN
(Figure 4.4) where all the variables are Boolean. First of all, if we assume a per-
fect map, from d-separation we obtain that the variable C and V are independent
but conditionally independent (i.e. conditioning on Hthe path from; C to Vis
unblocked).
Suppose that the a priori probability of the serious cause (P (C )=0 .1) is much
lower than the one of the less serious one (P (V ) = 0. 6) and that the conditional
probability table is
1the XOR function returns TRUE if one of the inputs is TRUE and FALSE if both inputs are
TRUE.
4.3. BAYESIAN NETWORKS 93
Figure 4.4: Common effect configuration.
C V P(H | C,V) P(¬ H| C,V)
T T 0.95 0.05
T F 0.8 0.2
F T 0.8 0.2
F F 0.1 0.9
From the script expl away.R we compute the conditional probability of cancer
in three different situations: headache only, headache and virus and headache but
no virus:
P( C= T| H)=0 .1597846 (4.3.9)
P( C= T| H, V )=0 .1165644 (4.3.10)
P( C= T| H, ¬ V )=0 .4705882 (4.3.11)
Let us remark that the conditional probability (4.3.9) is higher than the a priori
(P(C )=0 .1) if we observe that the patient has a headache. If we know that the
patient is infected by a virus as well, the probability of cancer decreases to (4.3.10).
We say that the virus explains away the cancer possibility. On the contrary, if we
know that headache is present but the virus is absent, the cancer probability surges
again to (4.3.11).
This non-monotone behaviour is caused by what is called the explaining away
effect, i.e. if we have two common causes of an observed effect, knowing that one
occurs (or not) reduces (or increases) the probability of the other. This is due to the
fact that though virus and cancer are marginally independent (P (C |V ) = P (C)),
they are dependent once we condition on headache (P (C |V, H ) < P (C |H )).
•
4.3.3 Skeleton and I-equivalence
In the case of perfect-map, a BN is fully specified by its conditional independent
statements. A definition of equivalence follows, then:
Definition 3.8 (I-equivalence:) . Two graphs are I-equivalent if they have the same
associated set of independencies.
All distributions that can be factorised on a graph G, can also be factorised on
an equivalent graph G0 .
In order to check the notion of equivalence visually, we introduce the notion of
skeleton.
94 CHAPTER 4. GRAPHICAL MODELS
Definition 3.9 (Skeleton:) . The skeleton of a Bayesian Network graph G over Zis
an undirected graph that contains an edge for every edge in G.
A sufficient (but not necessary) condition for equivalence of two graphs is that
they have the same skeleton and the same set of v-structures. A v-structure occurs in
a DAG when there is a node having two entering edges (e.g. Figure 4.4). Complete
graphs (e.g. completely connected triplets) are equivalent (no independence) but
may have different v-structures.
A v-structure with no direct edge between its parents is also called immorality
or unshielded collider. A sufficient and necessary condition for the equivalence of
two graphs is that they have the same skeleton and the same set of immoralities.
Definition 3.10 (Markov equivalence:).Two DAGs are said to be Markov equiv-
alent, if and only if they have the same skeleton and the same v-structures
Observational equivalence places a limit on the ability to infer directionality from
conditional probabilities alone. This means that there are classes of equivalence
of graphs that cannot be distinguished using conditional independence tests. Two
graphs that are I-equivalent cannot be distinguished without resorting to alternative
strategies (e.g. manipulative experimentation or temporal information).
By considering conditional independence only, it is not possible to detect changes
in graphs (e.g. arc reversing) that do not change the skeleton and which do not
introduce or destroy a v-structure. The simplest example is provided by two graphs
z1 →z2 and z2 →z1 . They are Markov equivalent (same skeleton and no v-
structure) and they have associated an empty set of independencies.
At the same time the unshielded collider z1 →z2 ←z3 is a singleton, i.e. the
only member of its equivalence class. This means that independence constraints
alone suffice to determine without ambiguity its structure.
4.3.4 Stable distributions
The use of the independence relationships made so far implies that the set of inde-
pendences of the probability distribution associated with a graph depends only on
the structure of the graph and not on the parametrisation.
This restriction is also known as stability: in other terms, we consider distri-
butions whose independencies remain invariant to any change in the parameters.
The stability assumption presumes that unstable independencies (i.e. dependencies
that disappear with a parameter change) are unlikely to occur in the data, so all
the independencies are structural.
In general, it is important to be aware of the limited expressibility of BNs. BNs
cannot necessarily graphically represent all the independence properties of a given
distribution. Consider, for instance, the distribution associated to x1 → y1 ←
z→ y2 ←x2 . If we marginalise the distribution wrt z(i.e. zis not observable),
there is no DAG containing only the vertices x1 ,x2 ,y1 ,y2 , which represents the
independence relations of the original DAG without adding spurious independencies.
4.4 Markov networks
Markov networks (MN) are an undirected graphical representation of conditional
independencies. Let us consider a set Z ={z1 ,...,zn } of n random variables.
Definition 4.1. The conditional independent graph of Z is the undirected graph
G= (V, E ) where V = {1 , . . . , n } and (i, j) is NOT in the edge set Eiff
zi ⊥⊥ zj |Z −{i,j }
4.4. MARKOV NETWORKS 95
This graph is also called a pairwise Markov graph. Note that for n variables,
there are 2( n
2)potential undirected graphs.
4.4.1 Separating vertices, separated subsets and indepen-
dence
As in the directed case, it is possible to use topological notions (e.g. separation
of vertices in the network) to deduce probabilistic properties (e.g. conditional
(in)dependencies). Given an undirected graph G = (V , E),
•a subset of vertices separates two vertices i and j if every path joining the two
vertices contains at least one vertex from the separating subset
•a subset of vertices separates two subsets Va and Vb in G if it separates every
pair of vertices i∈ Va ,j∈ Vb .
The last property is also called the global Markov property. In general, it can
be show that the set of distributions that satisfies the pairwise Markov property
satisfies as well the global Markov property.
Given an undirected independence graph G = (V, E ) it can be shown that:
•if V can be partitioned into two subsets Vb and Vc , such that there is no path
between any vertex in Vb and any vertex in Vc then
xi ⊥⊥ xj , forall xi ∈Vb and xj ∈V c
•if Va is any subset of vertices of Gthat separates i and jthen
xi ⊥⊥ xj |X a
4.4.2 Directed and undirected representations
BNs and MNs are two closely related yet different, representations and it is im-
portant not to confuse them. So why consider undirected representations besides
directed Bayesian networks? What is their main difference? First of all, let us
remind the Box golden rule of modelling: no model representation is perfect or
exhaustive, all of them are wrong, but sometimes some of them are useful. Markov
networks visualise conditional independence properties in distributions without hav-
ing recourse to any notion of ordering. In that sense, they are more adequate when
the considered problem is not explicitly associated to a specific ordering of vari-
ables or is characterised by symmetric relations. At the same time, asymmetric
relationships (e.g. cause and effect, past and future) fit well the BN formalism.
As an example, let us consider two probabilistic distributions of n random vari-
ables. In the first case, the n variables represent a quantity measured during n
consecutive time instants. In the second case, the nvariables measure the same
quantity over n different spatial locations. In both cases properties of conditional
independence might help in representing and reasoning on the distribution. How-
ever, only the first case takes advantage of a Bayesian Network representation which
encodes the explicit and asymmetric time ordering. An undirected representation,
where a notion of symmetric neighbourhood is present, is more suitable to the
spatial distribution task.
A second interesting issue is whether a MN is equivalent to a BN which have
been deprived of the edge directionality. The answer is not so simple. Consider a
DAG G and a faithful probability P . Let U be the undirected skeleton associated
with G and U0 be the undirected conditional independence graph associated with
P. Which relationship exists between Uand U0 ? Uand U0 generally are not the
same, but the relation U⊆ U0 holds. As shown by Wermuth and Lauritzen, Uand
U0 are the same iff Gdoes not contain any unshielded collider.
96 CHAPTER 4. GRAPHICAL MODELS
4.5 Conclusions
The graphical modelling formalism enables a modular representation of large variate
problems thanks to the correspondence between topological properties and prob-
abilistic notions. Graphical models are then effective tools that can be used to
represent and communicate the relations between a large number of variables and
to perform probabilistic reasoning.
In general, an effective modelling approach has to manage the trade-off between
complexity and fidelity to reality. Graphical modelling uses the notion of conditional
independence to address such issue.
Note that the adoption of conditional independence assumptions to simplify
representations is pervasive in mathematical modelling and human reasoning: think,
for instance, to the notion of state in dynamical systems, which makes the future
behaviour independent from the past given the present. Simplifying by conditioning
is also a peculiar characteristic of human causal reasoning: once we find the cause
of a certain phenomenon, we can disregard all other variables as irrelevant.
We will see that in machine learning, graphical modelling is a powerful way
to explain why some variables are more important or relevant than others (Chap-
ter 12). At the same time, machine learning strategies may be used to infer compact,
graphical (and sometimes causal) representations from data (Chapter 13).
Chapter 5
Parametric estimation and
testing
Given the correct probabilistic model of a phenomenon, we may derive the properties
of observable data by logical deduction. The theory of statistics is designed to
reverse the deductive process (Chapter 2). It takes measured data and uses them to
propose a probabilistic model, to estimate its parameters and eventually to validate
it. This chapter will focus on the estimation methodology, intended as the inductive
process which leads from observed data to a probabilistic description of reality.
We will focus here on the parametric approach, which assumes that we know all
about the probabilistic model except the value of a finite number of parameters.
Parametric estimation algorithms build estimates from data and, more important,
statistical measures to assess their quality. There are two main approaches to
parametric estimation:
Classical or frequentist: it is based on the idea that sample data are the sole
quantifiable form of relevant information and that the parameters are fixed
but unknown. It is related to the frequency view of probability (Section 3.1.4).
Bayesian approach: the parameters are supposed to be random variables , having
a distribution prior to data observation and a distribution posterior to data
observation. This approach assumes that there exists something beyond data,
(i.e. a human sense of uncertainty or a subjective degree of belief), and that
this belief can be described in the probabilistic form.
It is well known, however, that in large-sample problems, frequentist and Bayesian
approaches tend to produce similar numerical results and that in small-medium
settings, though the two outcomes may not coincide, their difference is usually
small. For those reasons and, mainly for reasons of space, we will limit here to
consider the classical approach. It is important, however, not to underestimate the
important role of the Bayesian estimation philosophy, which led recently to a large
amount of research in Bayesian data analysis and important applications in machine
learning [78].
5.1 Classical approach
The classical approach to parameter estimation dates back to the period 1920-35
when J. Neyman and E.S. Pearson, stimulated by problems in biology and industry,
concentrated on the principles for testing hypothesis and R.A. Fisher, interested in
agricultural issues, focused on the estimation from data.
97
98 CHAPTER 5. PARAMETRIC ESTIMATION
We will introduce estimation by considering a simple univariate setting. Let z
be a continuous r.v. and suppose that
1. we know the analytical form of the distribution family
Fz ( z ) = Fz ( z, θ)
but the parameter vector θ∈ Θ is unknown,
2. we have access to a set DN of N i.i.d. measurements of z , called sample data.
In the general case, few parameters are not enough to describe a function, like
the density function: in that sense, parametric densities are an obvious simplifica-
tion. An example of a parametric distribution function is the Normal distribution
(Section (3.4.2)), where the parameter vector is θ = [µ, σ ]. The goal of the esti-
mation procedure is to find a value ˆ
θof the parameter θso that the parameterised
distribution Fz (z, ˆ
θ) closely matches the distribution of data.
The notation i.i.d. stands for identically and i ndependently d istributed. Identi-
cally distributed means that all the observations have been sampled from the same
distribution, that is
Prob {zi =z} = Prob {zj =z} for all i, j = 1, . . . , N and z ∈ Z
Independently distributed means that the fact that we have observed a certain value
zi does not influence the probability of observing the value zj , that is
Prob {zj =z|zi = zi } = Prob {zj =z }
Example
Here you find some examples of estimation problems:
1. Let DN = { 20,31,14,11,19, . . . } be the times in minutes spent the last 2
weeks to go home. What is the mean time to reach my house from ULB?
2. Consider the car traffic in the boulevard Jacques. Suppose that the measures
of the inter-arrival times are DN = { 10,11, 1, 21,2, . . . } seconds. What does
this imply about the mean inter-arrival time?
3. Consider the students of the last year of Computer Science. What is the
variance of their grades?
4. Let z be the r.v. denoting tomorrow's temperature. How can I estimate its
mean value on the basis of past observations?
•
Parametric estimation is a mapping from the space of the sample data to the
space of parameters Θ. The two possible outcomes are:
1. some specific value of Θ. In this case, we have the so-called point estimation.
2. some particular region of Θ. In this case, we obtain an interval of confidence
on the value of the parameter.
5.2. EMPIRICAL DISTRIBUTIONS 99
5.1.1 Point estimation
Consider a random variable zwith a parametric distribution Fz (z, θ ), θ∈ Θ. The
unknown parameter can be written as a function(al) of F
θ= t( F)
This corresponds to the fact that θis a characteristic of the population described
by Fz (· ). For instance the expected value parameter µ =t (F ) = R z dF (z ) is a
functional of F.
Suppose now that we have observed a set of N i.i.d. values DN ={ z1 , z2, . . . , zN } .
Apoint estimate is an example of statistic, where by statistic it is generally meant
any function of the sample data DN . In other terms a point estimate is a function
ˆ
θ= g( DN ) (5.1.1)
of the sample dataset DN , where g ( · ) stands for the estimation algorithm, that
is the procedure which returns the estimation starting from a dataset DN . Note
that, from a machine learning perspective, it is more appropriate to consider g,
rather than a conventional mathematical function, as a generic algorithm taking
the sample dataset as an input and returning an estimation as output1.
There are two main issues in estimation and, more generally, in data analysis,
statistics and machine learning: how to construct an estimator (i.e. which form
should g take) and how to assess the quality of the returned estimation ˆ
θ. In
Sections 5.3 and 5.8 we will discuss two strategies for defining an estimator; the
plug-in principle and the maximum likelihood. In Section 5.5 we will present the
statistical measures most commonly adopted to assess an estimator accuracy.
Before introducing the plug-in principle, we need, however, to present the notion
of empirical distribution.
5.2 Empirical distributions
Suppose we have observed a i.i.d. random sample of size Nfrom a probability
distribution Fz ( ·)
Fz → { z1 , z2, . . . , zN }
The empirical distribution probability ˆ
Fis defined as the discrete distribution
that assigns probability 1/N to each value zi , i = 1 , . . . , N. In other words, ˆ
F
assigns to a set Ain the sample space of zits empirical probability
Prob {z ∈A } ≈ # z i ∈ A
N
that is the proportion of the observations in DN which occur in A.
It can be proved that the vector of observed frequencies in ˆ
Fis a sufficient
statistic for the true distribution F ( ·), i.e. all the information about F ( ·) contained
in DN is also contained in ˆ
F(·).
Consider now the distribution function Fz (z ) of a continuous rv zand a set of
Nobservations DN = { z1 , . . . , zN }. Since
Fz ( z) = Prob {z ≤ z}
we define N (z ) as the number of observations in DN that do not exceed z . We
obtain then the empirical estimate of F (·)
ˆ
Fz ( z) = N( z)
N=# z i ≤ z
N(5.2.2)
1For instance, an awkward, yet acceptable, estimation algorithm could take the dataset, discard
all the examples except the third one and return it as the estimation.
100 CHAPTER 5. PARAMETRIC ESTIMATION
Figure 5.1: Empirical distribution.
This function is a staircase function with discontinuities at the points zi (Figure
5.1).
Example
Suppose that our dataset is made of the following N = 14 observations
DN ={ 20 , 21 , 22 , 20 , 23 , 25 , 26 , 25 , 20 , 23 , 24 , 25 , 26 , 29}
The empirical distribution function ˆ
Fz (which can be traced by running the
script cumdis.R ) is plotted in Figure 5.1.
5.3 Plug-in principle to define an estimator
Consider an r.v. zand sample dataset DN drawn from the parametric distribution
Fz ( z, θ). The main issue of estimation is how to define an estimate of θ . A possible
solution is given by the plug-in principle, that is a simple method of estimating
parameters from observations. The plug-in estimate of a parameter (or target) θis
defined to be:
ˆ
θ= t(ˆ
F( z)) (5.3.3)
obtained by replacing the distribution function with the empirical distribution in
the analytical expression of the parameter.
The following section will discuss the plug-in estimators of the first two moments
of a probability distribution.
5.4. SAMPLING DISTRIBUTION 101
5.3.1 Sample average
Consider an r.v. z∼Fz ( ·) such that
θ= E[z ] = Z zdF ( z )
with θ unknown. Suppose we have available the sample Fz → DN , made of N
observations. The plug-in point estimate of θis given by the sample average
ˆ
θ=1
N
N
X
i=1
zi = ˆ µ(5.3.4)
Note that the sample average is not a parameter (i.e. it is not a function of the
probability distribution Fz ) but a statistic (i.e. a function of the dataset DN ).
5.3.2 Sample variance
Consider a r.v. z∼ Fz (· ) where the mean µand the variance σ2 are unknown.
Suppose we have available the sample Fz →DN . Once we have the sample average
ˆ µ, the plug-in estimate of σ2 is given by the sample variance
ˆ σ2 =1
N−1
N
X
i=1
(zi − ˆ µ)2(5.3.5)
The presence of N− 1 instead of N at the denominator will be explained later.
Note also that the following relation holds for all zi
1
N
N
X
i=1
(zi − ˆ µ)2 = 1
N
N
X
i=1
z2
i!−ˆ µ2
The expression of the plug-in estimators of other interesting probabilistic pa-
rameters are in the Appendix (D).
5.4 Sampling distribution
Given a dataset DN of N observations sampled from z , let us consider a point
estimate ˆ
θ= g( DN ) (5.4.6)
Note that since DN is the outcome of Nrealisations of a r.v. z, the vector DN
can be considered as the realisation of a random vector DN2 .
By applying the transformation gto the random variable DN we obtain the
random variable ˆ
θ=g (DN ) (5.4.7)
which is called the point estimator of θ . A key point is the following: while θis
an (unknown) fixed value, the estimator ˆ
θis a random variable. For instance, if
we aim to estimate θ =µ (expected value of z) the parameter µis an unknown
and fixed value while the average ˆ
µis a random variable (since it is a function of a
random dataset).
2This is not a mathematical detail but an essential aspect of the data-driven discovery process
under uncertainty. Every model learned from data, or more in general all knowledge acquired from
data, is built on random foundations and, as such, it is a random quantity and has to be assessed
as such.
102 CHAPTER 5. PARAMETRIC ESTIMATION
Figure 5.2: From the parametric parent distribution of Fz ( ·, θ ) (underlying the data
generation) to the sampling distribution of the estimator ˆ
θN . Each dataset has the
same size N.
The probability distribution of the r.v. ˆ
θis called the sampling distribution,
while the distribution of the r.v. z (with parameter θ ) is called the parent dis-
tribution. An example of the process bringing from the parent to the sampling
distribution is plotted in Figure 5.2. Note that the sampling distribution, though
a theoretical quantity, is of great significance in estimation since it quantifies the
estimator's accuracy in probabilistic terms, or, in simpler words, the gap between
the estimation and the parameter θ.
5.4.1 Shiny dashboard
The dashboard estimation.R (Appendix G) provides an interactive visualisation
of the sampling distribution of the plug-in estimators of the parameters (mean
and variance) of a Normal parent distribution z. We invite the reader to modify
the values N ,µ and σ and to observe the impact on the sampling distribution.
Note that the sampling distribution is obtained by a Monte Carlo simulation of the
process illustrated in Figure 5.2. The simulation (Algorithm 1 and related R code in
Table 5.1) consists in repeating a number (adjustable) of trials where for each trial
a sample dataset of size Nis generated and the plug-in estimations are computed.
The dashboard shows the histograms of the estimations.
5.5. THE ASSESSMENT OF AN ESTIMATOR 103
Algorithm 1 Monte Carlo simulation to generate a sampling distribution
1: S={}
2: for r = 1 to R do
3: Fz →DN = {z1 , z2, . . . , zN } // pseudo-random sample generation
4: ˆ
θ= g( DN ) // estimation computation
5: S= S∪ { ˆ
θ}
6: end for
7: Plot histogram of S
8: Compute statistics of S(mean, variance)
9: Study distribution of S with respect to θ (e.g. estimate bias)
mu<-0 # parameter
R<-10000 # number trials
N<-20 # size dataset
S<-numeric(R)
for (r in 1:R){
D<-rnorm(N,mean=mu,sd=10)
# pseudo-random sample generation
S[r]<-mean(D)
# compute estimate
}
hist(S)
# Plot histogram of S
bias=mean(S)-mu
# Estimate bias
Table 5.1: R version of Algorithm 1 pseudo-code to generate the sampling distribu-
tion of ˆ
µ.
5.5 The assessment of an estimator
Once defined an estimator ˆ
θ(e.g. in algorithmic or mathematical form), it is
possible to assess its accuracy from its sampling distribution.
5.5.1 Bias and variance
The following measures rely on the sampling distribution3 to assess the estimator
accuracy.
Definition 5.1 (Bias of an estimator) . An estimator ˆ
θof θ is said to be unbiased
if and only if
ED N [ˆ
θ] = θ
Otherwise, it is said to be biased with bias
Bias[ˆ
θ] = ED N [ ˆ
θ]−θ (5.5.8)
Definition 5.2 (Variance of an estimator).The variance of an estimator ˆ
θof θis
the variance of its sampling distribution
Var h ˆ
θi =ED N [(ˆ
θ−E[ˆ
θ])2]
3please note that we refer to the ˆ
θdistribution and not to the zdistribution
104 CHAPTER 5. PARAMETRIC ESTIMATION
Definition 5.3 (Standard error) . The square root of the variance
ˆ
σ=r Var h ˆ
θi
is called the standard error of the estimator ˆ
θ.
An unbiased estimator is an estimator that, on average, has the right value
but averaged over what? It is important to retain that this average is over dif-
ferent realisations of the dataset DN as made explicit by the notation ED N [ ˆ
θ],
represented visually by Figure 5.2 and simulated by the Monte Carlo repetitions in
Section (5.4.1).
Note that different unbiased estimators may exist for a parameter θ . Also, a
biased estimator with a known bias (i.e. not depending on θ) is equivalent to an
unbiased estimator since we can easily compensate for the bias. We will see in
Section 5.5.3 that for some specific estimators it is possible to derive analytically
the bias. Unfortunately, in general, the bias is not measurable since this would
require the knowledge of θwhich is in fact the target of our estimation procedure:
nevertheless, the notion of bias is an important theoretical quantity to reason about
the accuracy of an estimation process.
Sometimes we are accurate (e.g. unbiased) in estimating θ though we are inter-
ested in f (θ ). Given a generic transformation f (· ), if ˆ
θis unbiased for θ this does
not imply that that f ( ˆ
θ) is unbiased for f (θ ) as well. This implies, for instance,
that the standard error ˆ
σis not an unbiased estimator of standard deviation σ
despite ˆ
σ2 being an unbiased estimator of σ2 .
5.5.2 Estimation and the game of darts
An intuitive manner of visualising the notion of sampling distribution of an estima-
tor and the related concepts of bias and variance is to use the analogy of the darts
game.
The unknown parameter θ can be seen as the darts game target and the estimator
ˆ
θas a player. Figure 5.3 shows the target (black dot) together with the distribution
of the draws of two different players: the C (cross) player and the R (round) player.
In terms of our analogy the cross player/estimator has small variance but large bias,
while the round one has small bias and large variance. Which one is the best?
Now it's your turn to draw the shot distribution of a player with low bias and
low variance and of a player which large bias and large variance.
5.5.3 Bias and variance of ˆ
µ
This section shows that for a generic r.v. zand an i.i.d. dataset DN , the sample
average ˆ
µis an unbiased estimator of the mean E [z ].
Consider a random variable z∼Fz ( · ). Let µ and σ2 the mean and the variance
of Fz (· ), respectively. Suppose we have observed the i.i.d. sample DN ←Fz .
From (5.3.4) we obtain
ED N [ˆ
µ] = ED N " 1
N
N
X
i=1
zi # =PN
i=1 E[z i ]
N= Nµ
N= µ(5.5.9)
This means that the sample average estimator is not biased, whatever the distribu-
tion Fz (· ) is. And what about its variance? Since according to the i.i.d. assumption
Cov[zi ,zj ] = 0, for i 6 =j , from (3.10.85) we obtain that the variance of the sample
5.5. THE ASSESSMENT OF AN ESTIMATOR 105
Figure 5.3: The dart analogy: the target is the unknown parameter, the round dots
represent some realisations of the estimator R, while the crosses represent some
realisations of the estimator C.
average estimator is
Var [ ˆ
µ] = Var " 1
N
N
X
i=1
zi # =1
N2 Var " N
X
i=1
zi # =1
N2 Nσ 2 = σ 2
N.(5.5.10)
In fact, ˆ
µacts like the "round player" in darts game (Figure 5.3) with some variance
but no bias.
You can visualise the bias and variance of the sample average estimator by
running the Shiny dashboard estimation.R introduced in Section 5.4.
5.5.4 Bias of the estimator ˆ
σ2
Let us study now the bias of the estimator of the variance of z.
ED N [ˆ
σ2 ] = ED N " 1
N−1
N
X
i=1
(zi − ˆ
µ)2 #(5.5.11)
=N
N−1 E D N " 1
N
N
X
i=1
(zi − ˆ
µ)2 #(5.5.12)
=N
N−1 E D N " 1
N
N
X
i=1
z2
i!−ˆ
µ2 # (5.5.13)
Since E [z2 ] = µ2 +σ2 and Cov[zi ,zj ] = 0, the first term inside the E [· ] is
ED N " 1
N
N
X
i=1
z2
i!# =1
N
N
X
i=1
ED N z 2
i=1
NN(µ2 +σ2 )
Since E P N
i=1 z i 2 =N 2 µ 2 +Nσ 2 the 2nd term is
106 CHAPTER 5. PARAMETRIC ESTIMATION
ED N [ˆ
µ2 ] = 1
N2 E D N
N
X
i=1
zi ! 2
=1
N2 ( N 2 µ 2 + Nσ 2 ) = µ 2 + σ 2 /N
It follows that
ED N [ˆ
σ2 ] = N
N−1 ( µ 2 + σ 2 )−(µ2 +σ2 /N ) = N
N−1 N−1
Nσ 2 = σ 2
This result justifies our definition (5.3.5). Once the term N− 1 is inserted at the
denominator, the sample variance estimator is not biased.
Some points are worth considering:
•The results (5.5.9),(5.5.10) and (5.5.11) are independent of the family of the
distribution F (·).
•According to (5.5.10), the variance of ˆ
µis 1/N times the variance of z . This is
a formal justification of the reason why taking averages on a large number of
observations is recommended: the larger N , the smaller is Var [ ˆ
µ], so a bigger
Nfor a given σ2 implies a better estimate of µ.
•According to the central limit theorem (Section C.7), under quite general
conditions on the distribution Fz , the distribution of ˆ
µwill be approximately
normal as Ngets large, which we can write as
ˆ
µ∼ N (µ, σ 2 /N ) for N → ∞
•The standard error p Var [ˆ
µ] = σ
√N is a common measure of statistical ac-
curacy. Roughly speaking, if the estimator is not biased and the conditions
of the central limit theorem apply, we expect ˆ
µto be less than one standard
error away from µabout 68% of the time, and less than two standard errors
away from µabout 95% of the time (see Table 3.2) .
Script
You can visualize the bias and variance of the sample variance estimator by running
the following R script sam dis2.R or by running the Shiny dashboard estimation.R
introduced in Section 5.4..
5.5.5 A tongue-twister exercise
It sounds like a tongue-twister but it is important that the reader takes some time
to reason on the substantial difference between two quantities like
1. the variance of an estimator and
2. the estimator of the variance.
The first quantity is denoted by Var h ˆ
θi , is a real number and measures the accuracy
of an estimator. It has been introduced in Section 5.5.
The second is denoted ˆ
σ2 , is a random quantity since it is an estimator and its
properties (e.g. bias) has been discussed in Section 5.5.4.
Now, if you understand the difference between the two quantities above, you
could reason on Var ˆ
σ2 , which is nothing more than the variance of the estimator
of the variance. Clear, isn't it? And what about the estimator of the variance of
the estimator of the variance?
5.5. THE ASSESSMENT OF AN ESTIMATOR 107
5.5.6 Bias/variance decomposition of MSE
Bias and variance are two independent criteria to assess the quality of an estimator.
As shown in Figure 5.3 we could have two estimators behaving in opposite ways:
the first has large bias and low variance, while the second has large variance and
small bias. How can we choose among them? We need a measure able to combine or
merge the two to a single criteria. This is the role of the mean-square error (MSE)
measure.
When ˆ
θis a biased estimator of θ, its accuracy is usually assessed by its MSE
rather than simply by its variance. The MSE is defined by
MSE = ED N [(θ− ˆ
θ)2]
For a generic estimator it can be shown that
MSE = (E [ ˆ
θ]−θ )2 + Var h ˆ
θi =h Bias[ˆ
θ]i 2 + Var h ˆ
θi (5.5.14)
i.e., the mean-square error is equal to the sum of the variance and the squared bias
of the estimator . Here it is the analytical derivation
MSE = ED N [(θ− ˆ
θ)2 ] = ED N [(θ− E [ ˆ
θ] + E [ ˆ
θ]− ˆ
θ)2 ] = (5.5.15)
=ED N [(θ− E [ ˆ
θ])2 ] + ED N [(E [ ˆ
θ]− ˆ
θ)2 ] + ED N [2(θ− E [ ˆ
θ])(E [ ˆ
θ]− ˆ
θ)] =
(5.5.16)
=ED N [(θ− E [ ˆ
θ])2 ] + ED N [(E [ ˆ
θ]− ˆ
θ)2 ] + 2(θ− E [ ˆ
θ])(E [ ˆ
θ]−E [ˆ
θ]) =
(5.5.17)
= (E [ ˆ
θ]−θ )2 + Var h ˆ
θi (5.5.18)
This decomposition is typically called the bias-variance decomposition. Note that,
if an estimator is unbiased then its MSE is equal to its variance.
5.5.7 Consistency
Suppose that the sample data contains Nindependent observations z1 , . . . , zN of
a univariate random variable. Let the estimator of θbased on N observations be
denoted ˆ
θN . As N becomes larger, we might reasonably expect that ˆ
θN improves
as estimator of θ(in other terms it gets closer to θ ). The notion of consistency
formalizes this concept.
Definition 5.4. The estimator ˆ
θN is said to be weakly consistent if ˆ
θN converges
to θ in probability, that is
∀ > 0 lim
N→∞Prob n | ˆ
θN −θ | ≤ o = 1
Definition 5.5. The estimator ˆ
θN is said strongly consistent if ˆ
θN converges to θ
with probability 1 (or almost surely).
Prob nlim
N→∞
ˆ
θN =θo = 1
For a scalar θthe property of convergence guarantees that the sampling distribu-
tion of ˆ
θN becomes less disperse as N → ∞. In other terms a consistent estimator
is asymptotically unbiased. It can be shown that a sufficient condition for weak
consistency of unbiased estimators ˆ
θN is that Var h ˆ
θN i →0 as N → ∞.
It is important to remark that the property of unbiasedness (for finite-size sam-
ples) and consistency are largely unrelated.
108 CHAPTER 5. PARAMETRIC ESTIMATION
Exercise
Consider an estimator of the mean that takes into consideration only the first 10
sample points, whatever the total number N > 10 of observations is. Is such
estimator consistent?
•
5.5.8 Efficiency
Suppose we have two unbiased and consistent estimators. How to choose between
them?
Definition 5.6 (Relative efficiency) . Let us consider two unbiased estimators ˆ
θ1
and ˆ
θ2 . If
Var h ˆ
θ1 i <Var h ˆ
θ2 i
we say that ˆ
θ1 is more efficient than ˆ
θ2 .
If the estimators are biased, typically the comparison is done on the basis of the
mean square error.
Exercise
Suppose z1 , . . . , zN is a random sample of observations from a distribution with
mean θ and variance σ2 . Study the unbiasedness and the consistency of the three
estimators of the mean µ:
ˆ
θ1 = ˆ µ=PN
i=1 z i
N
ˆ
θ2 = N ˆ
θ1
N+ 1
ˆ
θ3 = z1
•
5.6 The Hoeffding's inequality
A probabilistic measure of the discrepancy between the estimator ˆ
µand the quantity
µ= E[z ] to be estimated is returned by the Hoeffding's inequality.
Theorem 6.1. [103] Let z1 ,...,zN be independent bounded random variables such
that zi falls in the interval [ai , bi ] with probability one. Let their sum be SN =
PN
i=1 z i . Then for any ε > 0 we have
Prob {|SN −E[ SN ] | > ε } ≤ exp ( −2 ε2/
N
X
i=1
(bi − ai )2 )
Corollary 6.2. If the variables z1 ,...,zN are independent and identically dis-
tributed, the following bound on the discrepancy between the sample mean ˆ µ=
PN
i=1 z i
Nand the expected value E[z ] holds
Prob {|ˆ
µ−E[z ] |> ε } ≤ exp − 2Nε2 / (b −a)2
5.7. SAMPLING DISTRIBUTIONS FOR GAUSSIAN R.V.S 109
Assume that δ is a confidence parameter, that is we are 100(1 −δ )% confident
that the estimate ˆ µis within the accuracy εof the true expectation. It is possible
to derive the expression
ε( N) = r ( b−a)2log(2/δ)
2N
which measures with confidence 1 −δ how the sample mean ˆ µ, estimated on the
basis of N points, is close to the expectation E [z ]. We can also determine the
number of observations Nnecessary to obtain an accuracy ε and a confidence δ by
using the relation
N > ( b− a)2 log(2 /δ)
2ε 2
Hoeffding's bound is a general bound that only relies on the assumption that
sample points are drawn independently. Bayesian bounds are another example of
statistical bounds which give tighter results under the assumption that the examples
are drawn from a normal distribution.
5.7 Sampling distributions for Gaussian r.v.s
The results in Section 5.5 are independent of the type of distribution function Fz .
Additional results are available in the specific case of a normal random variable.
Let z1 ,...,zN be i.i.d. realisation of z ∼ N ( µ, σ2 ) and let us consider the
following sample statistics
ˆ
µ=1
N
N
X
i=1
zi ,c
SS =
N
X
i=1
(zi − ˆ
µ)2 , ˆ
σ2 =c
SS
N−1
It can be shown that the following relations hold
•ˆ
µ∼ N (µ, σ 2 /N ) and N(ˆ
µ−µ)2 ∼σ2 χ2
1where the χ 2 distribution is presented
in Appendix C.2.2.
•zi −µ ∼ N (0, σ 2 ), so P N
i=1(z i −µ) 2 ∼σ 2 χ 2
N.
•PN
i=1(z i −µ) 2 =c
SS +N (ˆ µ− µ)2 .
•c
SS ∼σ2 χ 2
N−1 or equivalently (N− 1) ˆ
σ2
σ2 ∼χ2
N−1 . See R script sam dis2.R.
•√ N(ˆ
µ−µ) / ˆ
σ∼ TN−1 where T stands for the Student distribution (Section
C.2.3).
•if E [ |z −µ |4 ] = µ4 then Var ˆ
σ2 =1
Nµ 4 − N−3
N−1σ 4 .
5.8 The principle of maximum likelihood
Maximum-likelihood is a major strategy used in statistics to design an estimator,
i.e. the algorithm gin (5.4.7). Its rationale is to transform a problem of estimation
into a problem of optimisation. Let us consider
1. a density distribution pz ( z, θ ) which depends on a parameter θ∈ Θ,
2. a dataset DN = {z1 , z2, . . . , zN } i.i.d. drawn from this distribution.
110 CHAPTER 5. PARAMETRIC ESTIMATION
According to (3.5.54), the joint probability density of the i.i.d. dataset is the product
pD N (DN , θ) =
N
Y
i=1
pz (zi , θ) = LN ( θ ) (5.8.19)
where for a fixed DN , LN ( · ) is a function of θ and is called the empirical likelihood
of θ given DN .
The principle of maximum likelihood was first used by Lambert around 1760 and
by D. Bernoulli about 13 years later. It was detailed by Fisher in 1920. The idea
is simple: given an unknown parameter θand a sample data DN , the maximum
likelihood estimate ˆ
θis the value for which the empirical likelihood LN (θ ) has a
maximum ˆ
θml = arg max
θ∈Θ L N (θ)
The estimator ˆ
θml is called the maximum likelihood estimator (m.l.e.). In prac-
tice, it is usual to consider the log-likelihood lN (θ ) instead of LN (θ ). Since log(· ) is
a monotone function, we have
ˆ
θml = arg max
θ∈Θ L N (θ ) = arg max
θ∈Θlog(L N (θ )) = arg max
θ∈Θl N (θ) (5.8.20)
The likelihood function quantifies the relative abilities of the various parameter
values to explain the observed data. The principle of m.l. is that the value of the
parameter under which the obtained data would have had highest probability of
arising must be intuitively our best estimator of θ. In other terms the likelihood
can be considered a measure of how plausible the parameter values are in light of
the data. Note however that the likelihood function is NOT a probability function:
for instance, in general, it does not integrate to 1 (with respect to θ). In terms
of conditional probability, LN (θ ) represents the probability of the observed dataset
given θ and not the probability of θ (which is not a r.v. in the frequentist approach)
given DN .
Example
Consider a binary variable (e.g. a coin tossing) which takes z = 15 times the value
1 (e.g. "Tail") in N= 40 trials. Suppose that the probabilistic model underlying
the data is Binomial (Section C.1.2) with an unknown probability θ =p . We want
to estimate the unknown parameter θ =p∈ [0, 1] on the basis of the empirical
evidence from the N trials. The likelihood L (p ) is a function of (only) the unknown
parameter p . By applying the maximum likelihood technique we have
ˆ
θml = ˆ p= arg max
pL N (p ) = arg max
pN
z p z (1−p )(N− z) = arg max
p40
15 p 15 (1−p )(25)
Figure 5.4 plots L (p ) versus p∈ [0, 1] (R script ml bin.R). The most likely value
of p is the value where L (· ) attains its maximum. According to Figure 5.4 this value
is ˆ p= z/N. The log-likelihood for this model is
lN ( p ) = log LN ( p ) = log N
z + zlog(p )+( N− z) log(1 − p ) =
= log 40
15 + 15 log p+ 25 log(1 − p)
The reader can analytically find the maximum of this function by differentiating
l( p) with respect to p.
•
5.8. THE PRINCIPLE OF MAXIMUM LIKELIHOOD 111
Figure 5.4: Likelihood function
5.8.1 Maximum likelihood computation
In many situations the log-likelihood lN (θ ) is particularly well behaved in being
continuous with a single maximum away from the extremes of the range of variation
of θ . Then ˆ
θml is obtained simply as the solution of
∂lN ( θ )
∂θ = 0
subject to
∂2 lN ( θ)
∂θ 2 ˆ
θml
<0
to ensure that the identified stationary point is a maximum.
5.8.2 Maximum likelihood in the Gaussian case
Let DN be a random sample from the r.v. z ∼ N (µ, σ2 ). It is possible to derive
analytically the expression of the maximum likelihood estimators of the mean and
variance of z . According to (5.8.19), the likelihood of the N observations is
LN ( µ, σ2 ) =
N
Y
i=1
pz (zi , µ, σ2 ) =
N
Y
i=1 1
√2πσ2 exp −(zi −µ)2
2σ2
and the log-likelihood is
lN ( µ, σ2 ) = log LN ( µ, σ 2 ) = log " N
Y
i=1
pz (zi , µ, σ2 )# =
=
N
X
i=1
log pz ( zi , µ, σ2 ) = − P N
i=1(z i −µ)2
2σ2 +Nlog 1
√2πσ2
Note that, for a given σ, maximising the log-likelihood is equivalent to minimising
the sum of squares of the difference between zi and the mean. Taking the derivatives
with respect to µ and σ2 and setting them equal to zero, we obtain
ˆ µml = P N
i=1 z i
N= ˆ µ(5.8.21)
ˆ σ2
ml = P N
i=1(z i −ˆ µml )2
N6= ˆ σ2 (5.8.22)
112 CHAPTER 5. PARAMETRIC ESTIMATION
Note that the m.l. estimator (5.8.21) of the mean coincides with the sample
average (5.3.4) but that the m.l. estimator (5.8.22) of the variance differs from the
sample variance (5.3.5) in terms of the denominator.
In the multivariate Normal case, where zis a vector with [n, 1] mean µ and [n, n]
covariance matrix Σ, the maximum likelihood estimators are
ˆ µml = P N
i=1 z i
N(5.8.23)
ˆ
Σml =PN
i=1(z i −ˆ µml )(zi − ˆ µml )T
N(5.8.24)
where zi and ˆ µare [n, 1] vectors.
Exercise
•Let z ∼ U (0, M ) follow a uniform distribution and Fz → DN = {z1 , . . . , zN } .
Find the maximum likelihood estimator of M.
•Let z have a Poisson distribution, i.e.
pz ( z, λ) = e −λ λ z
z!
If Fz ( z, λ )→ DN = {z1 , . . . , zN } , find the m.l.e. of λ
•
In case of generic distributions Fz computational difficulties may arise: for ex-
ample in some cases no explicit solution might exist for ∂lN (θ) /∂θ = 0. Iterative
numerical methods must be used in this case. The computational cost becomes
heavier if we consider a vector of parameters instead of a scalar θor when there are
several relative maxima of the function lN .
Another complex situation occurs when lN (θ ) is discontinuous, or have a dis-
continuous first derivative, or a maximum at an extremal point.
R script
Suppose we know the analytical form of a one dimensional function f (x ) : I→ R
but not the analytical expression of its extreme points. In this case numerical
optimisation methods can be applied. The implementation of some continuous
optimisation routines is available in the R statistical tool.
Consider for example the function f (x ) = (x− 1 / 3)2 and I = [0, 1]. The value
of the point x where f takes a minimum value can be approximated numerically by
this set of R commands
f <- function (x,a) (x-a)^2
xmin <- optimize(f, c(0, 1), tol = 0.0001, a = 1/3)
xmin
These routines may be applied to solve the problem of maximum likelihood
estimation which is nothing more than a particular case of optimisation problem.
Let DN be a random sample drawn from z ∼ N (µ, σ 2 ). The negative log-likelihood
function of the Nobservations can be written in R by
eml <- function(m,D,var) {
N<- length(D)
Lik<-1
5.8. THE PRINCIPLE OF MAXIMUM LIKELIHOOD 113
for (i in 1:N)
Lik<-Lik*dnorm(D[i],m,sqrt(var))
-log(Lik)
}
and the numerical minimisation of −lN ( µ, s2 ) for a given σ =s in the interval
I= [− 10 ,10] can be written in R as
xmin<-optimize( eml,c(-10,10),D=DN,var=s)
In order to run the above code and compute numerically the m.l. solution we invite
the reader to run the R script emp ml.R.
•
5.8.3 Cramer-Rao lower bound
Assume that θ is a scalar parameter, that the first two derivatives of LN (θ ) with re-
spect to θ exist for all θand that certain operations of integration and differentiation
may be interchanged. Let ˆ
θbe an unbiased estimator of θ and lN (θ ) = loge [LN (θ)].
Suppose that the regularity condition
E ∂lN (θ)
∂θ = 0 (5.8.25)
holds where the quantity ∂l (θ) /∂θ is called score . The Cramer-Rao bound is a lower
bound to the variance of the estimator ˆ
θwhich states that
Var h ˆ
θi ≥1
E ∂lN (θ)
∂θ 2 =− 1
NE h ∂ 2 l N (θ)
∂θ2 i =1
IN
where the denominator term IN is known as the Fisher information. Note that
∂2 lN ( θ)
∂θ2 is the second derivative of l N (·) and, as such, it defines the curvature of the
log-likelihood function. At the maximum ˆ
θ, the second derivative takes a negative
value. Also, the larger its absolute value the larger is the curvature around the
function peak and then the lower is the uncertainty about the m.l. estimation [145].
An estimator having a variance as low as 1/IN is called a Minimum Variance
Bound (MVB) estimator.
Example
Consider a r.v. z ∼ N (µ, σ2 ) where σ2 is known and the unknown parameter is
θ= µ. Let us consider the bound on the variance of the estimator (5.8.21). Since
∂log p( z, θ)
∂θ = z− θ
σ2
∂2 log p( z, θ)
∂θ 2 =− 1
σ2
It follows that
Var h ˆ
θi ≥1
N
σ2
=σ 2
N
From (5.5.10) it derives then that the m.l. estimator (5.8.21) of the mean µis
minimum variance.
•
114 CHAPTER 5. PARAMETRIC ESTIMATION
5.8.4 Properties of m.l. estimators
Under the (strong) assumption that the probabilistic model structure is known, the
maximum likelihood technique features the following properties:
•ˆ
θml is asymptotically unbiased but usually biased in small-size samples (e.g.
ˆ σ2
ml in (5.8.22)).
•ˆ
θml is consistent.
•If ˆ
θml is the m.l.e. of θ and γ (· ) is a monotone function then γ ( ˆ
θml ) is the
m.l.e. of γ (θ).
•If γ ( ·) is a non monotonic function, then even if ˆ
θml is an unbiased estimator
of θ , the m.l.e. γ ( ˆ
θml ) of γ (θ ) is usually biased.
•the variance of ˆ
θml is often difficult to determine. For large-size samples we
can use as approximation
−E ∂ 2 l N
∂θ 2 −1
or −∂ 2 l N
∂θ 2 ˆ
θml −1
•ˆ
θml is asymptotically normally distributed, that is
ˆ
θml ∼ N ( θ, [ IN ( θ )]−1 ) , N → ∞
5.9 Interval estimation
Unlike point estimation which is based on a one-to-one mapping from the space of
data to the space of parameters, interval estimation maps DN to an interval of Θ.
A point estimator is a function which, given a dataset DN generated from Fz (z, θ),
returns an estimate of θ . An interval estimator is a transformation which, given a
dataset DN , returns an interval estimate [θ, ¯
θ] of θ. While an estimator is a random
variable, an interval estimator is a random interval. Let θ and ¯
θbe the random
lower and the upper bounds respectively. While an interval either contains or not
a certain value, a random interval has a certain probability of containing a value.
Suppose that
Prob θ≤θ ≤ ¯
θ = 1 − α α ∈ [0, 1] (5.9.26)
then the random interval [θ, ¯
θ] is called a 100(1 −α )% confidence interval of θ.
If (5.9.26) holds, we expect that by repeating the sampling of DN and the con-
struction of the confidence interval many times, our confidence interval will contain
the true θat least 100(1 −α )% of the time. Notice, however, that being θa fixed
unknown value, at each realisation DN the interval [θ, ¯
θ] either contains or not the
true θ . Therefore, from a frequentist perspective, it is erroneous to think that 1 − α
is the probability of θbelonging to the interval [θ, ¯
θ] computed for a given DN . In
fact, 1 −α is not the probability of the event θ∈ [θ, ¯
θ] (since θis fixed) but the
probability that the interval estimation procedure returns a (random) interval [θ, ¯
θ]
containing θ.
While a point estimator is characterised by bias and variance (Section 5.5), an
interval estimator is characterised by its endpoints θ and ¯
θ(or its width) and by
its confidence α . In Figure 5.3 we used an analogy between point estimation and
dart game to illustrate the bias/variance notions. In the case of interval estimation
the best analogy is provided by the horseshoes game4 (Figure 5.5). A horseshoe
4https://en.wikipedia.org/wiki/Horseshoes
5.9. INTERVAL ESTIMATION 115
Figure 5.5: Horseshoes game as an analogy of interval estimation
player is like an interval estimator and her interval estimation corresponds to the
tossing of a horseshoe. The horseshoe width corresponds to the interval size and
the probability of encircling the stake corresponds to the confidence α.
5.9.1 Confidence interval of µ
Consider a random sample DN of a r.v. z ∼ N (µ, σ2 ) where σ2 is known. Suppose
we want to estimate µwith the estimator ˆ
µ. From Section 5.7 we have that ˆ
µ∼
N(µ, σ2 /N ) is Gaussian distributed. From (3.4.47) it follows that
ˆ
µ−µ
σ/√ N ∼ N (0,1)
and consequently, according to the Definition 4.7
Prob − zα/2 ≤ ˆ
µ−µ
σ/√ N≤ z α/2 = 1 −α (5.9.27)
Prob ˆ
µ−zα/2
σ
√N ≤µ ≤ˆ
µ+zα/2
σ
√N = 1 −α (5.9.28)
where zα is the upper critical point of the standard Gaussian distribution. It follows
that θ = ˆ
µ−zασ/ √ Nis a lower 1 −αconfidence bound for µwhile ¯
θ=ˆ
µ+zασ/√N
is an upper 1 −α confidence bound for µ. By varying α we can vary the width and
the confidence of the interval.
Example
Let z ∼ N ( µ, 0. 01) and DN ={ 10,11,12,13,14,15} . We want to estimate the
confidence interval of µ with level α = 0. 1. Since N = 6, ˆ µ= 12. 5, and
= zα/2 σ/√ N = 1 . 645 · 0 . 01 /√ 6=0 .0672
the 90% confidence interval for the given DN is
{µ: | ˆ
µ−µ | ≤ }= { 12. 5 −0. 0672 ≤µ ≤12. 5+0 .0672}
•
116 CHAPTER 5. PARAMETRIC ESTIMATION
Figure 5.6: Fraction of times that the interval of confidence contains the parameter
µvs. the number of repetitions for α= 0 .1
R script
The R script confidence.R allows the test of the formula (5.9.27) by simulation.
The user sets µ ,σ,N,α and a number of iterations N iter.
The script generates N iter times DN ∼ N ( µ, σ 2 ) and computes ˆ µ. The script
returns the percentage of times that
ˆ µ− zα/2 σ
√N<µ< ˆ µ+ zα/2 σ
√N
This percentage versus the number of iterations is plotted in Figure 5.6 (R script
confidence.R). We can easily check that this percentage converges to 100(1 −α )%
for N iter → ∞.
•
Consider now the interval of confidence of µwhen the variance σ2 is not known.
Let ˆ
µand ˆ
σ2 be the estimators of µ and σ2 computed on the basis of the i.i.d.
dataset DN . From Section 5.7, it follows that
ˆ µ− µ
qˆ σ2
N
∼ TN−1
Analogously to (5.9.28) we have
Prob ˆ
µ−tα/2
σ
√N ≤µ ≤ˆ
µ+tα/2
σ
√N = 1 −α (5.9.29)
where tα is the upper critical point of the Student distribution.
Example
Let z ∼ N ( µ, σ 2 ), with σ2 unknown and DN = { 10,11,12,13,14,15} . We want to
estimate the confidence region of µ with level α = 0. 1. We have ˆ µ= 12. 5, ˆ σ2 = 3. 5 .
According to (5.9.29) we have
= t{α/2,N− 1} ˆ σ/√ N = 2.015 ∗ 1 .87/√ 6=1 .53
The (1 −α ) confidence interval of µis
ˆ µ− < µ < ˆ µ+
•
5.10. COMBINATION OF TWO ESTIMATORS 117
Example
We want to estimate θ , the proportion of people who support the politics of Mr.
Berlusconi amongst a very large population. We want to define how many interviews
are necessary to have a confidence interval of 6% width with a significance of 5%.
We interview Npersons and estimate θas
ˆ
θ= x 1 +· ·· +xN
N= S
N
where xi = 1 if the i th person supports Berlusconi and xi = 0 otherwise. Note that
Sis a binomial variable. We have
E[ˆ
θ] = θ, Var h ˆ
θi = Var [S/N ] = N ( θ )(1 −θ )
N2 = θ(1 − θ)
N≤ 1
4N
If we approximate the distribution of ˆ
θby N (θ, θ(1−θ )
N) it follows that ˆ
θ−θ
√θ(θ− 1)N ∼
N(0, 1). The following relation holds
Prob n ˆ
θ−0. 03 ≤θ ≤ˆ
θ+ 0. 03o=
Prob ( − 0.03
pθ (1 −θ )/N ≤ ˆ
θ−θ
pθ(1 −θ ) /N ≤ 0.03
pθ(1 −θ )/N ) =
Φ 0.03
pθ (1 −θ )/N ! −Φ − 0.03
pθ(1 − θ) /N ! ≥
≥Φ(0.03 √ 4N ) −Φ( −0 .03 √ 4N)
In order to have this probability to be at least 0.95 we need 0. 03√4N≥ 1 . 96 or
equivalently N≥ 1068.
•
5.10 Combination of two estimators
Consider two unbiased estimators ˆ
θ1 and ˆ
θ2 of the same parameter θ
E[ˆ
θ1 ] = θ E [ ˆ
θ2 ] = θ
having equal and non zero variance
Var h ˆ
θ1 i = Var h ˆ
θ2 i =v
and being uncorrelated, i.e. Cov[ ˆ
θ1 ,ˆ
θ2 ] = 0. Let ˆ
θcm be the combined estimator
ˆ
θcm =ˆ
θ1 +ˆ
θ2
2
This estimator has the nice properties of being unbiased
E[ˆ
θcm ] = E [ ˆ
θ1 ] + E [ ˆ
θ2 ]
2=θ(5.10.30)
and with a smaller variance than the original estimators
Var h ˆ
θcm i =1
4Var h ˆ
θ1 +ˆ
θ2 i =
Var h ˆ
θ1 i + Var h ˆ
θ2 i
4= v
2(5.10.31)
This trivial computation shows that the simple average of two unbiased estimators
with a non zero variance returns a combined estimator with reduced variance.
118 CHAPTER 5. PARAMETRIC ESTIMATION
5.10.1 Combination of mestimators
Here, we report the general formula of the linear combination of a number mof
estimators [179, 181]. Assume we want to estimate the unknown parameter θ by
combining a set of m estimators { ˆ
θj },j = 1, . . . , m . Let
E[ˆ
θj ] = µj Var h ˆ
θj i =vj Bias[ˆ
θj ] = bj
be the expected values, the variances and the bias of the m estimators, respectively.
We are interested in estimating θby forming a linear combination
ˆ
θcm =
m
X
j=1
wj ˆ
θj =wT ˆ
θ(5.10.32)
where ˆ
θ= [ˆ
θ1 ,...,ˆ
θm ]T is the vector of estimators and w = [w1 , . . . , wm ]T is the
weighting vector.
The mean-squared error of the combined system is
MSE = E [(ˆ
θcm −θ)2 ] = E[( wT ˆ
θ−E[ wT ˆ
θ])2 ]+( E[ wT ˆ
θ]−θ )2
=E [(wT ( ˆ
θ−E[ˆ
θ]))2 ]+( wT µ− θ)2=
=wT Ωw + (wT µ− θ )2
where Ω is a [m× m ] covariance matrix whose ij th term is
Ωij =E [(ˆ
θi −µi )(ˆ
θj −µj )]
and µ = (µ1 , . . . , µm )T is the vector of expected values. Note that the MSE error
has a variance term (dependent on the covariance of the single estimators) and a
bias term (dependent on the bias of the single estimators).
5.10.1.1 Linear constrained combination
A commonly used constraint is
m
X
j=1
wj = 1 , wj ≥ 0 , j = 1 , . . . , m (5.10.33)
This means that the combined estimator is unbiased if the individual estimators are
unbiased. Let us write was
w= ( uT g )−1 g
where u = (1,..., 1)T is an m-dimensional vector of ones, g = (g1 , . . . , gm )T and
gj >0 , ∀ j = 1 , . . . , m.
The constraint can be enforced in minimising the MSE by using the Lagrangian
function
L= wT Ω w+ ( wT µ− θ)2 + λ( wT u−1)
with λ Lagrange multiplier.
The optimum is achieved if we set
g∗ = [Ω + ( µ− θu)( µ− θu)T ]−1 u
With unbiased estimators (µ =θ ) we obtain
g∗ = Ω−1 u
and with uncorrelated estimators
g∗
j=1
vj
j= 1 , . . . , m (5.10.34)
This means that the optimal term g ∗
jof each estimator is inversely proportional to
its own variance.
5.11. TESTING HYPOTHESIS 119
5.11 Testing hypothesis
Hypothesis testing is together with estimation a major area of statistical inference.
Astatistical hypothesis is an assertion or conjecture about the distribution of one
or more random variables. A test of a statistical hypothesis is a rule or procedure
for deciding whether to reject the assertion on the basis of the observed data. The
basic idea is formulate some statistical hypothesis and look to see whether the data
provides any evidence to reject the hypothesis. Examples of hypothesis tests follow:
•Consider the model of the traffic in the boulevard. Suppose that the measures
of the inter-arrival times are DN ={ 10,11,1,21, 2 , . . . } seconds. Can we say
that the mean inter-arrival time θis different from 10?
•We want to know the effect of a drug on rats' survival to cancer. We randomly
divide some rats in two groups and we administrate a drug only to one of them.
Is the survival rate of the groups the same?
•Consider the grades of two different school sections. Section A had { 15,10,12,19, 5, 7}.
Section B had { 14,11,11,12,6,7} . Can we say that Section A had better
grades than Section B?
•Consider two protein coding genes and their expression levels in a cell. Are
the two genes differentially expressed ?
A statistical test is a procedure that aims to answer such questions.
5.11.1 Types of hypothesis
We start by declaring the working (basic, null) hypothesis H to be tested, in the
form θ =θ0 or θ∈ ω⊂ Θ, where θ0 or ω are given.
The hypothesis can be
simple: this means that it fully specifies the distribution of the r.v. z .
composite: this means that it partially specifies the distribution of z .
For example if DN is a random sample of size N drawn from N (µ, σ2 ) the
hypothesis H :µ = µ0 , σ = σ0 , (with µ0 and σ0 known values) is simple while the
hypothesis H :µ = µ0 is composite since it leaves open the value of σin (0,∞ ).
5.11.2 Types of statistical test
Suppose we have sampled a dataset DN ={ z1 , . . . , zN } from a distribution Fz and
we have declared a null hypothesis H about F . The three most common types of
statistical test are:
Pure significance test: data DN are used to assess the inferential evidence against
H.
Significance test: the inferential evidence against His used to judge whether H
is inappropriate. This test returns a decision rule for rejecting or not rejecting
H.
Hypothesis test: data DN are used to assess the hypothesis Hagainst a specific
alternative hypothesis ¯
H.This test returns a rule for rejecting Hin favour of
¯
H.
The three tests will be discussed in the following sections.
120 CHAPTER 5. PARAMETRIC ESTIMATION
5.11.3 Pure significance test
Consider a simple null hypothesis H . Let t (DN ) be a statistic (i.e. a function of the
dataset) such that the larger its value the more it casts doubt on H . The quantity
t(DN ) is called test statistic or discrepancy measure. Suppose that the distribution
of t (DN ) under H is known. This is possible since the function t ( · ) is fixed by
the user and the simple hypothesis H entirely specifies the distribution of zand
consequently the distribution of t (DN ). Let tN =t (DN ) the observed value of t
calculated on the basis of the sample data DN . Let us define the p-value quantity
as
p= Prob { t( DN ) > tN | H} (5.11.35)
i.e. the probability of observing a statistic greater than tN if the hypothesis H were
true. Note that in the expression (5.11.35), the term t (DN ) is a random variable
having a known distribution, while tN is a value computed on the basis of the
observed dataset.
If the p quantity is small then the sample data DN are highly inconsistent
with H , and p (significance probability or significance level) is the measure of such
inconsistency. If p is small, then either a rare event has occurred or perhaps His
not true. In other terms, if Hwere true, the quantity p would be the proportion of
situations where we would observe a degree of inconsistency, at least to the extent
represented by tN . The smaller the p-value, the stronger the evidence against H5 .
Note that p depends on DN since different DN would yield different values of
tN and consequently different values of p∈ [0, 1]. Moreover, it can be shown that,
if the null hypothesis is true, the p-value has a Uniform U [0, 1] distribution. Also,
in a frequentist perspective, we cannot say that p is the probability that His true
but rather that p is the probability that the dataset DN is observed given that His
true.
5.11.4 Tests of significance
The test of significance proposes the following decision rule: if pis less than some
stated value α , we reject H . Once a critical level α is chosen, and the dataset D N
is observed, the rule rejects H at level αif
P{ t( DN ) > tα | H ) = α (5.11.36)
This is equivalent to choosing some critical value tα and to reject H if tN > tα .
This implies the existence of two regions in the space of sample data:
critical region: this is the set of values of DN
S0 = { DN : t (DN ) > tα }
such that if DN ∈ S0 , we reject the null hypothesis H.
non-critical region: this is the set of values of DN such that there is no reason
to reject Hon the basis of the level-αtest.
The principle is that we will accept Hunless what we observed has a too small
probability of happening when H is true. The upper bound of this probability is α,
i.e. the significance level αis the highest p-value for which we reject H. Note that
the p-value changes with the observed data (i.e. it is a random variable) while αis
a level fixed by the user.
5It is common habit in life-sceince research to consider a p-value smaller than 0.05 (0.01) a
(very) strong evidence against H
5.11. TESTING HYPOTHESIS 121
Example
Let DN consist of N independent observations of x ∼ N ( µ, σ2 ), with known variance
σ2 . We want to test the hypothesis H :µ = µ0 with µ0 known. Consider as test
statistic the quantity t (DN ) = | ˆ
µ−µ0 |where ˆ
µis the sample average estimator.
If H is true we know from Section 5.4 that ˆ
µ∼ N (µ0 , σ 2 /N ). Let us calculate the
value t (DN ) = | ˆ µ− µ0 | and fix a significance level α = 10%. This means that the
decision rule needs the definition of the value tα such that
Prob { t ( DN ) > tα | H} = Prob {| ˆ
µ−µ0 |> tα |H } =
Prob { ( ˆ
µ−µ0 > tα ) ∪(ˆ
µ−µ0 < −tα ) |H }= 0.1
For a Normal variable z ∼ N ( µ, σ2 ), we have that
Prob {|z − µ |> 1.645σ} = Prob |z−µ|
σ>1.645 = 2 ∗ 0 .05
It follows that being ˆ
µ∼ N ( µ0 , σ 2 /N )
Prob n | ˆ
µ−µ0 |>1 .645σ/ √ No = 0 .05 + 0.05 = 0 .1
and consequently
tα = 1 .645 σ/√ N (5.11.37)
The critical region is
S0 = nDN :| ˆ µ− µ0 | > 1 . 645σ/√ No •
Example
Suppose that σ = 0. 1 and that we want to test if µ = µ0 = 10 with a significance
level 10%. Let N = 6 and DN = { 10,11,12,13,14,15} . From the dataset we
compute
ˆ µ=10 + 11 + 12 + 13 + 14 + 15
6= 12.5
and
t( DN ) = | ˆ µ− µ0 | = 2.5
Since according to (5.11.37) tα = 1. 645 ∗ 0. 1 /√ 6 = 0. 0672, and t ( DN ) > tα ,
the observations DN are in the critical region. The conclusion is: the hypothesis
H: µ= 10 is rejected and the probability that we are making an error by rejecting
His smaller than 0 .1.
•
5.11.5 Hypothesis testing
So far we have dealt with single hypothesis tests. Let us now consider two mutually
exclusive hypothesis: H and ¯
H. Suppose we have a dataset { z1 , . . . , zN } ∼ F drawn
from a distribution F. On the basis of this dataset, one hypothesis will be accepted
and the other one rejected. In this case, given the stochastic setting, two type of
errors are possible.
Type I error. This is the kind of error we make when we reject H but H is true.
For a given critical level tα the probability of making this error is
Prob {t (DN ) > tα | H} =α (5.11.38)
122 CHAPTER 5. PARAMETRIC ESTIMATION
Type II error. This is the kind of error we make when we accept H and His
false. In order to define this error, we are forced to declare an alternative
hypothesis ¯
Has a formal definition of what is meant by Hbeing false . The
probability of type II error is
Prob t ( DN )≤tα | ¯
H (5.11.39)
that is the probability that the test leads to acceptance of Hwhen in fact ¯
H
holds.
Note that
•when the alternative hypothesis is composite, there could be no unique Type
II error.
•although H and ¯
Hare complementary events, the quantity (5.11.39) cannot
be derived from (5.11.38) (see Equation (3.1.23)).
Example
In order to better illustrate these notions, let us consider the analogy with a murder
trial, where the suspect is Mr. Bean. The null hypothesis His "Mr. Bean is
innocent". The dataset is the amount of evidence collected by the police against
Mr. Bean. The Type I error is the error that we make if, Mr. Bean being innocent,
we send him to death-penalty. The Type II error is the error that we make if,
being Mr. Bean guilty, we acquit him. Note that the two hypotheses have different
philosophical status (asymmetry). His a conservative hypothesis, not to be rejected
unless evidence against Mr Bean's innocence is clear. This means that a type I error
is more serious than a type II error (benefit of the doubt ).
•
Example
Let us consider a professor who has to decide, on the basis of empirical evidence,
whether a student copied or not during a class test. The null hypothesis His that
the student is honest. The alternative hypothesis ¯
His that the student cheated.
Let the empirical evidence tN be represented by the number of lines of the classwork
that a student shares with at least one of his classmates.
The decision rule of the professor is the following: a student passes (i.e. the null
hypothesis that she is honest is accepted) if there is not enough empirical evidence
against her (e.g. if tN ≤ tα = 2), otherwise she fails (i.e. the alternative hypothesis
is chosen). Will the professor make any error? why? and does this depend on what?
•
5.11.6 The hypothesis testing procedure
In general terms a hypothesis testing procedure can be decomposed in the following
steps:
1. Declare the null and the alternative hypothesis
2. Choose the numeric value αof the type I error (e.g. the risk I want to run
when I reject the null hypothesis).
3. Define a test statistic.
4. Determine the critical value tα of the test statistic that leads to a rejection of
Haccording to the Type I error defined in Step 2.
5.11. TESTING HYPOTHESIS 123
5. Among the set of tests of level α, choose the test that minimises the probability
of type II error.
6. Obtain the data and determine whether the observed value of the test statistic
leads to an acceptation or rejection of H.
Note that a number of tests, having a different type II error, can guarantee the
same type I error. An appropriate choice of test as a function of the type II error
is therefore required and will be discussed in the following section.
5.11.7 Choice of test
The choice of test and consequently the choice of the partition {S0 , S1 } is based on
two steps
1. Define a significance level α, that is the probability of type I error (or the
probability of incorrectly rejecting H)
Prob { reject H |H } = Prob { DN ∈ S0 | H} = α
2. Among the set of tests {S0 , S1 } of level α, choose the test that minimises the
probability of type II error
Prob accept H| ¯
H = Prob DN ∈ S1 | ¯
H
that is the probability of incorrectly accepting H. This is equivalent to max-
imising the power of the test
Prob reject H| ¯
H = Prob DN ∈ S0 | ¯
H = 1 − Prob DN ∈ S1 | ¯
H
which is the probability of correctly rejecting H . Note that for a given signif-
icance level, the higher the power, the better !
Example
In order to reason about the Type II error, let us consider an r.v. z ∼ N (µ, σ2 ),
where σ is known and a set of Niid observations are given. We want to test the
null hypothesis µ = µ0 = 0, with α= 0. 1 Consider three different tests and the
associated critical regions S0
1. | ˆ µ− µ0 | > 1 . 645σ/√N
2. ˆ µ− µ0 > 1 . 282σ/√ N (Figure 5.7)
3. | ˆ µ− µ0 | < 0 . 126σ/√ N (Figure 5.8)
Assume that the area blackened in Figure (5.7) equals the area blackened in
Figure (5.8). For all these tests Prob { DN ∈ S0 | H } ≤ α , hence the significance
level (i.e. Type I error) is the same. However if ¯
H: µ1 = 10 the type II error of
the three tests is significantly different. Which test is the best one, that is the one
which guarantees the lowest Type II error?
•
124 CHAPTER 5. PARAMETRIC ESTIMATION
Figure 5.7: On the left: distribution of the test statistic ˆ
µif H : µ0 = 0 is true. On
the right: distribution of the test statistic ˆ
µif ¯
H: µ1 = 10 is true. The interval
marked by S1 denotes the set of observed ˆ µvalues for which His accepted (non-
critical region). The interval marked by S0 denotes the set of observed ˆ µvalues for
which H is rejected (critical region). The area of the black pattern region on the
right equals Prob { DN ∈ S0 | H } , i.e. the probability of rejecting H when H is true
(Type I error). The area of the grey shaded region on the left equals the probability
of accepting H when H is false (Type II error).
Figure 5.8: On the left: distribution of the test statistic ˆ
µif H : µ0 = 0 is true.
On the right: distribution of the test statistic ˆ
µif ¯
H: µ1 = 10 is true. The two
intervals marked by S1 denote the set of observed ˆ µvalues for which His accepted
(non-critical region). The interval marked by S0 denotes the set of observed ˆ µ
values for which His rejected (critical region). The area of the pattern region
equals Prob {DN ∈ S0 |H } , i.e. the probability of rejecting H when H is true (Type
I error). Which area corresponds to the probability of the Type II error?
5.12. PARAMETRIC TESTS 125
5.11.8 UMP level-αtest
Given a significance level αwe denote by uniformly most powerful (UMP) test, the
test
1. which satisfies
Prob { reject H|H } = Prob {DN ∈ S0 |H } = α
2. for which
Prob reject H| ¯
H = Prob DN ∈ S0 | ¯
H
is maximed simultaneously for all θ∈ Θ ¯
H.
How is it possible to find UMP tests? In a simple case, an answer is given by
the Neyman-Pearson lemma.
5.11.9 Likelihood ratio test
Consider the simplest case Θ = {θ0 , θ1 } , where H :θ = θ0 and ¯
H: θ= θ1 and
θ0 , θ1 are two different values of the parameter of a r.v. z. Let us denote the two
likelihoods by L0 (θ ) and L1 (θ ), respectively.
The idea of Neyman and Pearson was to base the acceptance/rejection of Hon
the relative values L (θ0 ) and L (θ1 ). In other terms we reject Hif the likelihood
ratio L (θ1 )
L( θ0 )
is sufficiently big.
We reject Honly if the sample data DN are sufficiently more probable when
θ= θ1 than when θ= θ0 .
Lemma 2 (Neyman-Pearson lemma) . Let H :θ =θ0 and ¯
H: θ= θ1 . If a partition
{S0 , S1 } of the sample space D is defined by
S0 = {DN :L( θ1 ) > kL( θ0 )} S1 ={ DN :L( θ1 ) < kL( θ0 )}
with R S 0 p (DN , θ0 )dDN =α , then {S0 , S1 } is the most powerful level-α test of H
against ¯
H.
This lemma demonstrates that among all tests of level ≤α , the likelihood ratio
test is the optimal procedure, i.e. it has the smallest probability of type II error.
Although, for a generic distribution, the definition of an optimal test is very
difficult, all the tests that will be described in the following are optimal in the UMP
sense.
5.12 Parametric tests
Suppose we want to test an assertion about a random variable with a known para-
metric distribution F (· , θ ). Besides the distinction between simple and composite
tests presented in Section 5.11.1, there are two more ways of classifying hypothesis
tests:
One-sample vs. two-sample: one-sample tests concern an hypothesis about the
properties of a single r.v. z ∼ N ( µ, σ2 ) while two-sample test concern the
relationship between two r.v. z1 ∼ N (µ1 , σ 2
1) and z 2 ∼ N (µ 2 , σ 2
2).
126 CHAPTER 5. PARAMETRIC ESTIMATION
Single-sided (one-tailed) vs. Two-sided (two-tailed): in single-sided tests the
region of rejection concerns only one tail of the distribution of the null hypoth-
esis. This means that ¯
Hindicates the predicted direction of the difference.
In two-sided tests the region of rejection concerns both tails of the null distri-
bution. This means that ¯
Hdoes not indicate the predicted direction of the
difference.
The most common parametric tests rely on hypothesis of normality. A non-
exhaustive list of conventional parametric test is available in the following table:
Name single/two sample known H ¯
H
z-test single σ2 µ = µ0 µ 6 = µ0
z-test two σ 2
1=σ 2
2µ 1 =µ 2 µ 1 6=µ 2
t-test single µ = µ0 µ 6 = µ0
t-test two µ1 = µ2µ1 6 = µ2
χ2 -test single µ σ2 = σ 2
0σ 2 6=σ 2
0
χ2 -test single σ2 =σ2
0σ 2 6=σ 2
0
F-test two σ 2
1=σ 2
2σ 2
16=σ 2
2
The columns H and ¯
Hcontain the parameter taken into consideration by the
test.
All the parametric test procedures can be decomposed into five main steps:
1. Define the null hypothesis and the alternative one.
2. Fix the probability α of having a Type I error.
3. Choose a test statistic t (DN ).
4. Define the critical value tα that satisfies the Type I error constraint.
5. Collect the dataset DN , compute t ( DN ) and decide if the hypothesis is either
accepted or rejected.
Note that the first 4 steps are independent of the data and should be carried out
before the collection of the dataset. A more detailed description of some of these
tests is contained in the following sections and Appendix C.3.
5.12.1 z-test (single and one-sided)
Consider a random sample DN ←x ∼ N (µ, σ 2 ) with µ unknown and σ2 known.
Let us see in detail how the five steps of the testing procedure are instantiated in
this case.
STEP 1:
Consider the null hypothesis and the alternative (composite and one-sided)
H: µ= µ0 ;¯
H: µ > µ0
STEP 2: fix the value αof the type I error.
STEP 3: If H is true then the distribution of ˆ
µis N ( µ0 , σ2 /N ). This means
that the test statistic t (DN ) is
tN = t (DN ) = ( ˆ
µ−µ0 ) √ N
σ∼ N (0,1)
STEP 4: determine the critical value tα .
We reject the hypothesis H if tN > tα = zα where zα is such that Prob {N (0, 1) > zα } =
α.
5.12. PARAMETRIC TESTS 127
Example: for α = 0. 05 we would take zα = 1. 645 since 5% of the standard
normal distribution lies to the right of 1.645. Note that the value zα for a given α
can be obtained by the R command qnorm(alpha,lower.tail=FALSE).
STEP 5: Once the dataset DN is measured, the value of the test statistic is
tN =(ˆ µ− µ0 )√ N
σ
and the hypothesis is either accepted (tN ≤zα ) or rejected.
Example z-test
Consider a r.v. z ∼ N ( µ, 1). We want to test H :µ = 5 against ¯
H: µ > 5 with
significance level 0.05. Suppose that the dataset is DN = { 5 . 1,5.5,4.9,5.3} . Then
ˆ µ= 5 .2 and zN = (5. 2− 5) ∗ 2 /1=0 .4. Since this is less than 1.645, we do not
reject the null hypothesis.
•
5.12.2 t-test: single sample and two-sided
Consider a random sample from N ( µ, σ2 ) with σ2 unknown . Let
H: µ= µ0 ;¯
H: µ6= µ0
Let
t(DN ) = tN =√ N (ˆ µ− µ0 )
q1
N−1P N
i=1(z i −ˆ µ)2
=(ˆ µ− µ0 )
qˆ σ2
N
a statistic computed using the data set DN .
If the hypothesis Hholds, from Sections C.2.3 and 5.7 it follows that t ( DN ) ∼
TN−1 is a r.v. with a Student distribution with N −1 degrees of freedom. The size
αt-test consists in rejecting Hif
|tN | > k = tα/2,N − 1
where t α/2,N−1 is the upper αpoint of a T-distribution on N− 1 degrees of freedom,
i.e.
Prob tN−1 > tα/2,N−1 = α/ 2, Prob |tN−1 | > tα/2,N−1 = α.
where tN−1 ∼ TN−1 . In other terms His rejected when tN is too large.
Note that the value t α/2,N−1 for a given N and α can be obtained by the R
command qt(alpha/2,N-1,lower.tail=TRUE).
Example [65]
Suppose we want an answer to the following question: Does jogging lead to a re-
duction in pulse rate?. Let us engage eight non jogging volunteers in a one-month
jogging programme and let us take their pulses before and after the programme
pulse rate before 74 86 98 102 78 84 79 70
pulse rate after 70 85 90 110 71 80 69 74
decrease 4 1 8 -8 7 4 10 -4
Let us assume that the decreases are randomly sampled from N ( µ, σ2 ) where
σ2 is unknown. We want to test H :µ = µ0 = 0 against ¯
H: µ6= 0 with a
significance α = 0. 05. We have N = 8, ˆ µ= 2. 75, T= 1 .263, tα/2,N −1 = 2 .365 Since
|T | ≤ t α/2,N −1 , the data is not sufficient to reject the hypothesis H . In other terms
the experiment does not provide enough evidence that jogging leads to reduction in
pulse rate.
128 CHAPTER 5. PARAMETRIC ESTIMATION
Figure 5.9: On the left: distribution of the test statistic (number of identical lines)
if H is true, i.e. the student is honest. Typically honest students have very few
lines in common with others though it could happen by chance that such number
is more than 2. On the right: distribution of the test statistic (number of identical
lines) if ¯
His true, i.e. the student is dishonest. Typically dishonest students have
several lines in common with others though some of them are cunning enough to
conceal it.
•
So far we assumed that the distribution of the test statistic is known under the
null hypothesis. In this case it is possible to fix a priori the Type I error. But what
about if we do not know anything about the distribution? Is it possible to assess a
posteriori the quality (in terms of errors of Type I or II) of a certain test (e.g. using
a certain threshold) ?
5.13 A posteriori assessment of a test
Let us consider the professor example (page 122) and the hypothesis test strategy
which leads to the refusal of a student when tN > tα = 2. In this case the distri-
butions of the tN statistic for an honest student (or a dishonest one) has no known
parametric form (Figure 5.9). Moreover, the professor has no information about
such distributions and, consequently, he has no way to measure or control the Type
I error rate (i.e. the grey area in Figure 5.9). Nevertheless, it is possible to estimate
a posteriori the Type I and Type II error rate if we have access to the decisions of
the professor and the real nature of student (honest or dishonest).
Suppose that N students took part in the exam and that NN did not copy while
NP did. According to the professor's decision rule, ˆ
NN were considered honest and
passed the exam, while ˆ
NP were considered dishonest and rejected. Because of
the overlapping of the distributions in Figure 5.9, it happens that FP > 0 honest
students (the ones in the grey area) failed and FN > 0 dishonest students (the ones
in the blue area) passed. Note that the honest students who failed indeed did not
copy but they had by chance more than one line in common with a classmate. At
the same time there are dishonest students who succeeded by copying but who were
clever enough to avoid more than 2 identical lines.
The resulting situation can be summarised in Table 5.2 and Table 5.3 where
we associated the null hypothesis Hto the minus sign (non guilty or honest) and
the hypothesis ¯
Hto the plus sign. In Table 5.2, FP denotes the number of False
Positives, i.e. the number of times that the professor considered the student as guilty
5.14. CONCLUSION 129
Passed Failed
H: Honest student (-) TN FP NN =TN + F P
¯
H: Guilty student (+) FN TPNP = FN + TP
ˆ
NN = TN + FN ˆ
NP = FP + TPN
Table 5.2: Reality vs. decision: given Nstudents (NN honest and NN dishonest
ones) the table reports the breakdown of the Nprofessor decisions ( ˆ
NN passes and
ˆ
NP rejections).
H accepted H rejected
H: null hypothesis (-) 1 − α α
¯
H: alternative hypothesis (+) β1− β
Table 5.3: Reality vs. decision: the table reports the probability of correct and bad
decisions in a hypothesis test. In particular α denotes the type I error while 1 − β
the test power.
(+) but in reality she was innocent (-). The ratio FP /N represents an estimate of
the type I error (probability of rejecting the null hypothesis when it is true) which is
denoted by α in Table 5.3. The term FN represents the number of False Negatives,
i.e. the number of times that the professor considered a student to be honest (-),
yet he copied (+). The ratio FN /N is an estimation of the type II error (probability
of accepting the null hypothesis when it is false) which is denoted by βin Table 5.3.
Note that the Type I and II errors are related. For instance, the professor could
decide he does not want to unfairly fail even a single student by setting tN to infinity.
In this case, all honest students, like the dishonest ones, would succeed: this means
we would have a null Type I error (α= 0) at the cost of the highest Type II error
(β≈ NP /N).
5.14 Conclusion
The reader wishing to know more about machine learning could be disappointed.
She has been reading more than one hundred pages and has still the sensation
that she did not learn much about machine learning. All she read seems very far
from intelligent agents, neural networks and fancy applications... Nevertheless, she
already came across the most important notions of machine learning: conditional
probability, estimation and bias/variance trade-off. Is it all about that? From an
abstract perspective, yes. All the fancy algorithms that will be presented afterwards
(or that the reader is used to hear about) are nothing more (often without the
designer's knowledge) estimators of conditional probability, and as such, submitted
to a bias/variance tradeoff. Such algorithms are accurate and useful only if they
manage well such trade-off.
But we can go a step further and see the bias/variance tradeoff not only as
a statistical concept but as a metaphor of human attitude towards models and
data, beliefs and experience, ideology and observations, preconceptions and events 6.
Humans define models (not only in science but also in politics, economics, religion)
to represent the regularity of nature. Now, reality often escapes or diverges from
such regularity. In front of the gap between the Eden of regularity and the natural
Hell of observations, humans waver between two extremes: i) negate or discredit
reality and reduce all divergences to some sort of noise (measurement error) or ii)
6https://tinyurl.com/y25l4xyp
130 CHAPTER 5. PARAMETRIC ESTIMATION
adapt, change their belief, to incorporate discording data and measures in their
model (or preconceptions).
The first attitude is exposed to bias (or dogmatism or worse conspiracy think-
ing): the second to variance (or instability). A biased human learner behaves as
an estimator which is insensitive to data: her strength derives from the intrinsic
robustness and coherence, and his weakness is due to the (in)sane attitude of dis-
regarding data and flagrant evidence. On the other side, a highly variant human
learner adapts rapidly and swiftly to data and observations, but he can be easily
criticised for his excessive instability, in simple words for going where the wind
blows.
When the evidence does not confirm your expectations (or what your parents,
teachers or media told you), what is the best attitude to take? Is there an optimal
attitude? Which side are you on?
5.15 Exercises
1. Derive analytically the bias of the sample average estimator in a non i.i.d. setting.
2. Derive analytically the variance of the sample average estimator in an i.i.d. setting.
3. Consider a regression problem where
y= sin( x) + w
and x is uniformly distributed on the interval [0, 2π ] and w =N (1, 1) is a Normal
variable with both mean and variance equal to 1. Let us consider a predictor h ( x)
that is distributed like w. Compute the bias and variance of the predictor in the
following coordinates: x = 0, x =π ,x = π/2.
Solution:
•x= 0 Bias=0, Var=1
•x= πBias=0, Var= 1
•x= π/2 Bias=1, Var=1
4. Let us consider a dataset DN = {z1 ,...,z 20}of 20 observations generated according
to an uniform distribution over the interval [− 1, 1]. Suppose I want to estimate the
expected value of the distribution. Compute the bias and variance of the following
estimators:
•ˆ
θ1 =P 10
i=1 z i
10
•ˆ
θ2 =ˆ
µ=P 20
i=1 z i
20
•ˆ
θ3 =−1
•ˆ
θ4 = 1
•ˆ
θ5 =z2
Suppose I want to estimate the variance of the distribution. Compute the bias of
the following estimators:
•ˆ
σ2
1= P (zi − ˆ µ)2
19
•ˆ
σ2
2= P (zi − ˆ µ)2
20
•ˆ
σ2
3= 1/3
5.15. EXERCISES 131
Solution: Note that θ = 0 and σ 2
z= 1/3
ˆ
θ1 :B1 = 0, V1 = 0. 03. Justification: E [ ˆ
θ1 ] = θ and Var h ˆ
θ1 i =σ2 / 10
ˆ
θ2 :B2 = 0, V2 = 0. 015. Justification: E [ ˆ
θ2 ] = θ and Var h ˆ
θ2 i =σ2 / 20
ˆ
θ3 :B3 =− 1, V3 = 0. Justification: E [ ˆ
θ3 ] = − 1 and Var h ˆ
θ3 i = 0 since constant
ˆ
θ4 :B4 = 1, V4 = 0. Justification: E [ ˆ
θ4 ] = 1 and Var h ˆ
θ4 i = 0 since constant
ˆ
θ5 :B5 = 0, V5 = 0. 33. Justification: E [ ˆ
θ5 ] = θ and Var h ˆ
θ5 i =σ2
ˆ
σ2
1:B= 0 .Justification: sample variance is unbiased then E[ˆ
σ2
1] = σ 2
z
ˆ
σ2
2:B−1 /60 = − 0 .0166 .Justification: Note first that ˆ
σ2 =19
20
P(zi − ˆ µ)2
19 . Then
E[ˆ
σ2
2] = 19
20 E P (zi − ˆ µ)2
19 = 19
20 σ 2
z
then
E[ˆ
σ2
2]−σ 2
z=19
20 σ 2
z−σ 2
z=−σ 2
z/20
ˆ
σ2
3:B= 0 .Justification E[1 /3] = 1/3 = σ 2
z
5. Let us consider the following observations of the random variable z
DN ={ 0 .1 , −1 , 0 .3 ,1 .4 }
Write the analytical form of the likelihood function of the mean µfor a Gaussian
distribution with a variance σ2 = 1. The student should:
1. Trace the log-likelihood function on the graph paper
2. Determine graphically the maximum likelihood estimator.
3. Discuss the result.
Solution: Since N = 4 and σ= 1
L( µ ) =
N
Y
i=1
p(zi , µ) =
N
Y
i=1
1
√2π exp −(zi − µ ) 2
2
−2−1 0 1 2
−14 −12 −10 −8−6
mu
log− likelihood
Note that ˆ µml coincides with the sample average ˆ µ= 0 .2 of DN .
6. Suppose you want to estimate the expectation µof the uniform r.v. z ∼ U [ − 2, 3]
by using a dataset of size N= 10. By using R and its random generator, first plot
the sampling distribution then estimate the bias and the variance of the following
estimators:
132 CHAPTER 5. PARAMETRIC ESTIMATION
1. ˆ
θ=P N
i=1
zi
N
2. ˆ
θ= minN
i=1 z i
3. ˆ
θ= maxN
i=1 z i
4. ˆ
θ= z1
5. ˆ
θ= zN
6. ˆ
θ=P i=1 N|zi |
N
7. ˆ
θ= mediani zi
8. ˆ
θ= maxN
i=1 w iwhere w∼ N (0,1).
9. ˆ
θ= 1
Before each random generation set the seed to zero.
7. The student should first create a dataset of N= 1000 observations according to the
dependency
y=g (β0 +β1 x) + w
where x ∼ U [− 1, 1], β0 = 1, β1 =− 1, w ∼ N (µ = 0, σ2 = 0. 1), g (x ) = e x
1+ex .
Then by using the same dataset he should:
•estimate by maximum likelihood the parameters β0 and β1 ,
•plot the contour of the likelihood function, showing in the same graph the
values of the parameters and their estimations.
Hint: use a grid search to perform the maximisation.
8. The student should first create a dataset of N= 1000 observations according to the
dependence
Prob {y = 1| x} =g ( β0 +β1 x )
where x ∼ U [0, 1], β0 = 1, β1 = − 1, g (x ) = e x
1+ex and y∈ {0 ,1} .
Then by using the same dataset she should:
•estimate by maximum likelihood the parameters β0 and β1 ,
•plot the contour of the likelihood function, showing in the same graph the
values of the parameters and their estimations.
Hint: use a grid search to perform the maximisation.
9. Let z ∼ N (1, 1), DN a training set of N i.i.d. observations zi and ˆ µN the related
sample average estimator.
1. Compute analytically
Ez,DN [(z− ˆ
µN )2 ]
Hint: consider that z =θ +w where θ =E [z ] and w ∼ N (0,1).
2. Compute analytically
Ez,DN [(z− ˆ
µN )]
3. Validate by Monte Carlo simulation the two theoretical results above.
Solution: Since E [ ˆ
µ] = µ , Var [ˆ
µ] = σ 2
w/N and w is independent of D N:
Ez,DN [(z− ˆ
µN )2 ] = E z,DN [(θ +w− ˆ
µN )2 ] =
=Ez,DN [w2 + 2w (θ− ˆ
µN ) + (θ− ˆ
µN )2 ] =
=Ez [w2 ] + ED N [(θ− ˆ
µN )2 ] = σ 2
w+σ 2
w/N = 1 + 1/N
R code to perform Monte Carlo validation :
5.15. EXERCISES 133
rm(list=ls())
N=5
S=10000
sdw=1 ## noise variance
E=NULL
for (s in 1:S){
DN=rnorm(N,1,sdw)
muhat=mean(DN)
z=rnorm(1,1,sdw)
e=z-muhat
E=c(E,e^2)
}
cat("th=",sdw^2+sdw^2/N, "MC estimation=", mean(E),"\n")
10. Let us supposed that the only measurement of a Gaussian random variable z ∼
N(µ, 1) is the interval [− 3. 5, 1. 5]. Estimate µ by maximum-likelihood and show the
likelihood-function L (µ ). Hint: use the R function pnorm.
11. Let us suppose that 12 of the 31 days of August in Brussels are rainy. Estimate the
probability of a rainy day by maximum likelihood by using the Binomial distribution
(Section C.1.2).
134 CHAPTER 5. PARAMETRIC ESTIMATION
Chapter 6
Nonparametric approaches
to estimation and testing
6.1 Nonparametric methods
In the previous chapter, we considered estimation problems where the probability
distribution is known, parameters' value (e.g. mean and/or variance) aside. Such
estimation methods are called parametric. The meaningfulness of a parametric test
depends entirely on the validity of the assumptions made about the analytical form
of the distribution. However, in real configurations, it is not uncommon for the
experimenter to question parametric assumptions.
Consider a random sample DN ←z collected through some experimental obser-
vation and for which no hint about the underlying probability distribution Fz ( ·) is
available. Suppose we want to estimate a parameter of interest θof the distribution
of z by using the plug-in estimate ˆ
θ= t(ˆ
F) (Section 5.3). What can we say about
the accuracy of the estimator ˆ
θ? As shown in Section 5.5.3, for some specific param-
eters (e.g. mean and variance) the accuracy can be estimated independently of the
parametric distribution. In most cases, however, the assessment of the estimator is
not possible unless we know the underlying distribution. What to do, hence, if the
distribution is not available? A solution is provided by the so-called nonparametric
or distribution-free methods that work independently on any specific assumption
about the probability distribution.
The adoption of these methods enjoyed considerable success in the last decades
thanks to the evolution and parallelisation of computational processing power. In
fact, most techniques for nonparametric estimation and testing are based on re-
sampling procedures, which require a large number of repeated (and almost similar)
computations on the data.
This chapter will deal with two resampling strategies for estimation and two
resampling strategies for hypothesis testing, respectively.
Jacknife: this approach to nonparametric estimation relies on repeated computa-
tions of the statistic of interest for all the combinations of the data where one
or more of the original examples are removed. It will be presented in Section
6.3.
Bootstrap: this approach to nonparametric estimation aims to estimate the sam-
pling distribution of an estimator by sampling (with replacement) from the
original data. It will be introduced in Section 6.4.
Randomisation: This is a resampling without replacement testing procedure. It
135
136 CHAPTER 6. NONPARAMETRIC ESTIMATION AND TESTING
consists in taking the original data and either scrambling the order or the
association of the original data. It will be discussed in Section 6.5.
Permutation: This is a resampling two-sample hypothesis-testing procedure based
on repeated permutations of the dataset. It will be presented in Section 6.6.
6.2 Estimation of arbitrary statistics
Consider a set DN of N data points sampled from a scalar r.v. z . Let E [z ] = µthe
parameter to be estimated. In Section 5.3.1 we derived the bias and the variance
of the estimator ˆ
µ:
ˆ µ=1
N
N
X
i=1
zi , Bias[ ˆ
µ] = 0, Var [ ˆ
µ] = σ 2
N
Consider now another quantity of interest, for example, the median or a mode of the
distribution. While it is easy to design a plug-in estimate of these quantities, their
accuracy is difficult to be computed. In other terms, given an arbitrary estimator
ˆ
θ, the analytical form of the variance Var h ˆ
θi and the bias Bias[ˆ
θ]is typically not
available.
Example
According to the plug-in principle (Section 5.3) we can design other estimators
besides sampled mean and variance, like:
•Estimation of skewness (3.3.36) of z: see Equation (D.0.2).
•Estimation of correlation (3.6.67) between x and y : : see Equation (D.0.3).
What about the accuracy (e.g. bias, variance) of such estimators?
•
Example
Let us consider an example of estimation taken from an experimental medical
study [65]. The goal of the study is to show bioequivalence between an old and
a new version of a patch designed to infuse a certain hormone in the blood. Eight
subjects take part in the study. Each subject has his hormone levels measured after
wearing three different patches: a placebo, an "old" patch and a "new" patch. It is
established by the Food and Drug Administration (FDA) that the new patch will
be approved for sale only if the new patch is bioequivalent to the old one according
to the following criterion:
θ=| E(new) −E (old)|
E(old) − E(placebo) ≤ 0 .2 (6.2.1)
Let us consider the following plug-in estimator (Section 5.3) of (6.2.1)
ˆ
θ=| ˆ
µnew − ˆ
µold |
ˆ
µold − ˆ
µplacebo
Suppose we have collected the following data (details in [65])
6.3. JACKKNIFE 137
subj plac old new z=old-plac y=new-old
1 9243 17649 16449 8406 -1200
2 9671 12013 14614 2342 2601
3 11792 19979 17274 8187 -2705
. . . . . . . . . . . . . . . . . .
8 18806 29044 26325 10238 -2719
mean: 6342 -452.3
The estimate is
ˆ
θ= t(ˆ
F) = | ˆ µnew − ˆ µold |
ˆ µold − ˆ µplacebo
=| ˆ µy |
ˆ µz
=452.3
6342 = 0.07
Can we say on the basis of this value that the new patch satisfies the FDA
criterion in (6.2.1)? What about the accuracy, bias or variance of the estimator?
The techniques introduced in the following sections may provide an answer to these
questions.
•
6.3 Jackknife
The jackknife (or leave-one-out ) resampling technique aims at providing a compu-
tational procedure to estimate the variance and the bias of a generic estimator ˆ
θ.
The technique was first proposed by Quenouille in 1949 and is based on removing
examples from the available dataset and recalculating the estimator. It is a general-
purpose tool that is easy to implement and able to solve a number of estimation
problems.
6.3.1 Jackknife estimation
In order to show the theoretical foundation of the jackknife, we first apply this
technique to the estimator ˆ
µof the mean. Let DN ={ z1 , . . . , zN } be the available
dataset. Let us remove the ith example from DN and let us calculate the leave-
one-out (l-o-o) mean estimate from the N− 1 remaining examples
ˆ µ(i) =1
N−1
N
X
j6=i
zj = N ˆ µ− zi
N−1
Observe from above that the following relation holds
zi = Nˆ µ−( N−1)ˆ µ(i) (6.3.2)
that is, we can calculate the i th example zi ,i = 1, . . . , N if we know both ˆ µand
ˆ µ(i) . Suppose now we wish to estimate some parameter θ by using as estimator
some complex statistic of the Ndata points
ˆ
θ= g( DN ) = g( z1 , z2, . . . , zN )
The jackknife procedure consists in first computing
ˆ
θ(i) = g( z1 , z2, . . . , zi−1 , zi+1 , . . . , zN ) , i = 1 , . . . , N
which is called the i th jackknife replication of ˆ
θ. Then by analogy with the rela-
tion (6.3.2) holding for the mean estimator, we define the i -th pseudo value by
η(i) = Nˆ
θ−( N−1) ˆ
θ(i) .(6.3.3)
138 CHAPTER 6. NONPARAMETRIC ESTIMATION AND TESTING
These pseudo values assume the same role as the zi in calculating the sample aver-
age (5.3.4). Hence the jackknife estimate of θ is given by
ˆ
θjk =1
N
N
X
i=1
η(i) =1
N
N
X
i=1 N ˆ
θ−( N−1) ˆ
θ(i) = Nˆ
θ−( N−1) ˆ
θ(·) (6.3.4)
where
ˆ
θ(·) =P N
i=1 ˆ
θ(i)
N.
The rationale of the jackknife technique is to use the quantity (6.3.4) in order to
estimate the bias of the estimator. Since, according to (5.5.8), θ =E [ ˆ
θ]− Bias[ˆ
θ],
the jackknife approach consists in replacing θ by ˆ
θjk and E[ˆ
θ] by ˆ
θ, thus obtaining
ˆ
θjk =ˆ
θ−Biasjk [ ˆ
θ].
It follows that the jackknife estimate of the bias of ˆ
θis
Biasjk [ ˆ
θ] = ˆ
θ−ˆ
θjk =ˆ
θ− Nˆ
θ+ ( N−1) ˆ
θ(·) = ( N−1)( ˆ
θ(·) −ˆ
θ) .
Note that in the particular case of a mean estimator (i.e. ˆ
θ=ˆ
µ), we see that we
obtain, as expected, Biasjk [ ˆ
µ] = 0.
A jackknife estimate of the variance of ˆ
θcan be obtained from the sample
variance of the pseudo-values. We define the jackknife estimate of the variance of
ˆ
θas
Varjk [ ˆ
θ] = Var h ˆ
θjk i (6.3.5)
Under the hypothesis of i.i.d. η (i)
Var h ˆ
θjk i = Var " P N
i=1 η ( i)
N# =
Var h η(i) i
N
From (6.3.3) we have
PN
i=1 η ( i)
N= N ˆ
θ−( N−1)
N
N
X
i=1
ˆ
θ(i)
Since
η(i) = Nˆ
θ−( N−1) ˆ
θ(i) ⇔ η(i) −PN
i=1 η ( i)
N=−(N −1) ˆ
θ(i) −PN
i=1 ˆ
θ(i)
N!
from (6.3.5) and (6.3.4) we obtain
Varjk [ ˆ
θ] = P N
i=1 η ( i)−ˆ
θjk 2
N( N−1) = N−1
N
N
X
i=1 ˆ
θ(i) − ˆ
θ(·) 2 !
Note that in the case of the estimator of the mean (i.e. ˆ
θ=ˆ
µ), since η (i) = zi
and ˆ
θjk = ˆ µ, we find again the result (5.5.10)
Varjk [ ˆ
θ] = P N
i=1(z i −ˆ µ) 2
N( N−1) = ˆ σ2
N= Var [ˆ
µ] (6.3.6)
The major motivation for jacknife estimates is that they reduce bias. Also,
it can be shown that under suitable conditions on the type of estimator ˆ
θ, the
quantity (6.3.6) converges in probability to Var h ˆ
θi . However, the jacknife can fail
if the statistic ˆ
θis not smooth (i.e. small changes in data cause small changes in the
statistic). An example of non-smooth statistic for which the jacknife works badly
is the median.
6.4. BOOTSTRAP 139
6.4 Bootstrap
The method of bootstrap was proposed by Efron [62] as a computer-intensive tech-
nique to estimate the accuracy of a generic estimator ˆ
θ. Bootstrap relies on a
data-based simulation method for statistical inference. The term bootstrap derives
from the phrase to pull oneself up by one's bootstrap based on the fictional Adven-
tures of Baron Munchausen. The Baron had fallen to the bottom of a deep lake.
Just when it looked like all was lost, he thought to pick himself up by his own boot-
straps. In general terms, to pull yourself up by your bootstraps means to succeed
in something very difficult without any outside help1.
The idea of statistical bootstrap is very simple, namely that in the absence of
any other information, the sample itself offers the best guide of the sampling dis-
tribution. The method is completely automatic, requires no theoretical calculation,
and is available no matter how mathematically complicated the estimator (5.4.6)
is. By resampling with replacement from DN we can build a set of Bdatasets
D(b) , b = 1, . . . , B . From the empirical distribution of the statistics g ( D(b) ) we can
construct confidence intervals and tests for significance.
6.4.1 Bootstrap sampling
Consider a data set DN . A bootstrap data set D(b) , b = 1 , . . . , B is created by
randomly selecting N points from the original set DN with replacement (Figure
6.1).
Since DN itself contains N points, there is nearly always duplication of individual
points in a bootstrap data set. Each point has an equal probability 1/N of being
chosen on each draw. Hence, the probability that a point is chosen exactly ktimes
is given by the binomial distribution (Section C.1.2)
Prob {k } =N !
k!( N− k)! 1
N k N−1
N N−k
0≤k≤N
Given a set of Ndistinct values, there is a total of 2N−1
Ndistinct bootstrap
datasets. The number is quite large already for N > 10. For example, if N = 3
and DN = {a, b, c} , we have 10 different bootstrap sets: { a,b,c} , {a,a,b} , {a,a,c},
{b,b,a}, {b,b,c}, {c,c,a },{c,c,b}, {a,a,a }, {b,b,b }, {c,c,c }.
Under balanced bootstrap sampling, the B bootstrap sets are generated in such a
way that each original data point is present exactly Btimes in the entire collection
of bootstrap samples.
6.4.2 Bootstrap estimate of the variance
Given the estimator (5.4.6), for each bootstrap dataset D(b) ,b = 1, . . . , B , we can
define a bootstrap replication
ˆ
θ(b) = g( D(b) ) b= 1 , . . . , B
that is the value of the statistic for the specific bootstrap sample. The bootstrap
approach computes the variance of the estimator ˆ
θthrough the variance of the set
ˆ
θ(b) , b= 1 , . . . , B, given by
Varbs [ ˆ
θ] = P B
b=1(ˆ
θ(b) −ˆ
θ(·) )2
(B− 1) where ˆ
θ(·) =PB
b=1 ˆ
θ(b)
B(6.4.7)
1This term has not the same meaning (though the derivation is similar) as the one used in
computer operating systems where bootstrap stands for starting a computer from an hardwired
set of core instructions
140 CHAPTER 6. NONPARAMETRIC ESTIMATION AND TESTING
Figure 6.1: Bootstrap replications of a dataset and bootstrap statistic computation
6.4. BOOTSTRAP 141
It can be shown that if ˆ
θ= ˆ µ, then for B → ∞, the bootstrap estimate Varbs [ ˆ
θ]
converges to the variance Var[ ˆ
µ].
6.4.3 Bootstrap estimate of bias
Let ˆ
θbe a plug-in estimator (Equation (5.3.3)) based on the sample DN and
ˆ
θ(·) =P B
b=1 ˆ
θ(b)
B(6.4.8)
Since Bias[ˆ
θ] = E [ ˆ
θ]−θ , the bootstrap estimate of the bias of the plug-in esti-
mator ˆ
θis obtained by replacing E [ ˆ
θ] with ˆ
θ(·) and θwith ˆ
θ:
Biasbs [ ˆ
θ] = ˆ
θ(·) −ˆ
θ(6.4.9)
Then, since
θ= E[ˆ
θ]− Bias[ˆ
θ]
the bootstrap bias corrected estimate is
ˆ
θbs =ˆ
θ−Biasbs [ ˆ
θ] = ˆ
θ−(ˆ
θ(·) −ˆ
θ)=2 ˆ
θ−ˆ
θ(·) (6.4.10)
Note that if we want to estimate the bias of a a generic non plug-in estimator
g( DN ), the ˆ
θterm in the right-hand terms of (6.4.9) should anyway refer to the
plug-in estimator t ( ˆ
F) (Equation (5.3.3)).
R script
Run the R file patch.R for the estimation of bias and variance in the case of the
patch data example.
•
6.4.4 Bootstrap confidence interval
Standard bootstrap confidence limits are based on the assumption that the estima-
tor ˆ
θis normally distributed with mean θ and variance σ2 . Taking the bootstrap
estimate of variance, an approximate 100(1 −α )% confidence interval is given by
ˆ
θ± zα/2 q Varbs [ ˆ
θ] = ˆ
θ± zα/2 s P B
b=1(ˆ
θ(b) −ˆ
θ(·) )2
(B− 1) (6.4.11)
An improved interval is given by using the bootstrap correction for bias
2ˆ
θ−ˆ
θ(·) ± zα/2 s P B
b=1(ˆ
θ(b) −ˆ
θ(·) )2
(B− 1) (6.4.12)
Another bootstrap approach for constructing a 100(1−α )% confidence interval is
to use the upper and lower α/2 values of the bootstrap distribution. This approach
is referred to as bootstrap percentile confidence interval. If ˆ
θL,α/2 denotes the value
such that only a fraction α/2 of all bootstrap estimates are inferior to it, and
likewise ˆ
θH,α/2 is the value exceeded by only α/2 of all bootstrap estimates, then
the confidence interval is given by
[ˆ
θL,α/2 ,ˆ
θH,α/2 ] (6.4.13)
where the two extremes are also called the Efron's percentile confidence limits.
142 CHAPTER 6. NONPARAMETRIC ESTIMATION AND TESTING
6.4.5 The bootstrap principle
Given an unknown parameter θof a distribution Fz and an estimator ˆ
θ, the goal of
any estimation procedure is to derive or approximate the distribution of ˆ
θ−θ. For
example, the calculation of the variance of ˆ
θrequires the knowledge of Fz and the
computation of ED N [(ˆ
θ−E[ˆ
θ])2 ]. Now, in practical contexts, Fz is unknown, and
the calculus of ED N [(ˆ
θ−E[ˆ
θ])2 ] is not possible in an analytical way. The rationale of
the bootstrap approach is (i) to replace Fz by the empirical counterpart (5.2.2) and
(ii) to compute ED N [(ˆ
θ−E[ˆ
θ])2 ] by a Monte Carlo simulation approach (Section 3.9)
where several samples of size Nare generated by resampling DN .
The outcome of a bootstrap technique is a Monte Carlo approximation of the
distribution ˆ
θ(b) −ˆ
θ. In other terms the variability of ˆ
θ(b) (based on the empirical
distribution) around ˆ
θis expected to be similar (or mimic) the variability of ˆ
θ
(based on the true distribution) around θ.
The bootstrap principle relies on the two following properties (i) as Ngets
larger and larger, the empirical distribution ˆ
Fz (· ) converges (almost surely) to Fz (·)
(Glivenko-Cantelli theorem (C.9.17)) and (ii) as Bgets larger, the quantity (6.4.7)
converges (in probability) to the variance of the estimator ˆ
θbased on the empirical
distribution (as stated in (C.8.14)). In other terms
Varbs [ ˆ
θ]B→∞
→Ed
DN [(ˆ
θ−E[ˆ
θ])2 ]N→∞
→ED N [(ˆ
θ−E[ˆ
θ])2 ] (6.4.14)
where E d
DN [(ˆ
θ−E[ˆ
θ])2 ] stands for the plug-in estimate of the variance of ˆ
θbased
on the empirical distribution.
In practice, for a small finite N, bootstrap estimation inevitably returns some
error. This error is a combination of a statistical error and a simulation error.
The statistical error component is due to the difference between the underlying
distribution Fz ( · ) and the empirical distribution ˆ
Fz (· ). The magnitude of this error
depends on the choice of the estimator ˆ
θ(DN ) and decreases by increasing the
number N of observations.
The simulation error component is due to the use of empirical (Monte Carlo)
properties of ˆ
θ(DN ) rather than exact properties. Simulation error decreases by
increasing the number Bof bootstrap replications.
Unlike the jackknife method, in the bootstrap, the number of replicates Bcan
be adjusted to the computer resources. In practice, two rules of thumb are typically
used:
1. Even a small number of bootstrap replications, e.g. B= 25, is usually infor-
mative. B = 50 is often enough to give a good estimate of Var h ˆ
θi .
2. Very seldom are more than B= 200 replications needed for estimating Var h ˆ
θi .
Much bigger values of B are required for bootstrap confidence intervals.
Note that the use of rough statistics ˆ
θ(e.g. unsmooth or unstable) can make the
resampling approach behave wildly. Examples of nonsmooth statistics are sample
quantiles and the median.
In general terms, for i.i.d. observations, the following conditions are required
for the convergence of the bootstrap estimate
1. the convergence of ˆ
Fto F(satisfied by the Glivenko-Cantelli theorem) for
N→ ∞;
2. an estimator such that the estimate ˆ
θis the corresponding functional of the
empirical distribution.
θ= t( F)→ ˆ
θ= t(ˆ
F)
6.5. RANDOMISATION TESTS 143
This is satisfied for sample means, standard deviations, variances, medians
and other sample quantiles.
3. a smoothness condition on the functional. This is not true for extreme order
statistics such as the minimum and the maximum values.
But what happens when the dataset DN is not i.i.d. sampled from a distribution
F? In such non conventional configurations, the most basic version of bootstrap
might fail. Examples are incomplete data (survival data, missing data), dependent
data (e.g. variance of a correlated time series) and dirty data (outliers) configura-
tions. In these cases, specific adaptations of the bootstrap procedure are required.
For reason of space, we will not discuss them here. However, for a more exhaustive
discussion on the limits of bootstrap, we invite the reader to refer to [123].
6.5 Randomisation tests
Randomisation tests were introduced by R.A. Fisher in 1935. The goal of a ran-
domisation test is to help to discover some regularity (e.g. a non random property
or pattern) in a complicated data set. A classic example is to take a pack of poker
play-cards and check whether they were well shuffled by our poker opponent. Ac-
cording to the hypothesis testing terminology, randomisation tests make the null
hypothesis of randomness and test this hypothesis against data. In order to test
the randomness hypothesis, several random transformations of data are generated.
Suppose we are interested in some property which is related to the order of data.
Let the original data set DN ={x1 , . . . , xN } and t (DN ) some statistic which is a
function of the order in the data DN . We want to test if the value of t ( DN ) is due
only to randomness.
•An empirical distribution is generated by scrambling (or shuffling )R times
the N elements at random. For example the j th, j = 1, . . . , R scrambled data
set could be D (j)
N={x 23 , x 4 , x 343 , . . . }
•For each of the j th scrambled sets we compute a statistic t (i) . The resulting
distribution is called the resampling distribution.
•Suppose that the value of t ( DN ) is only exceeded by k of the R values of the
resampling distribution.
•The probability of observing t (DN ) under the null hypothesis (i.e. random-
ness) is only pt = k/R . The null hypothesis can be accepted/rejected on the
basis of pt .
The quantity pt plays the role of nonparametric p-value (Section 5.11.3) and it can
be used, like its parametric counterpart, both to assess the evidence of the null
hypothesis and to perform a decision test (e.g. refuse to play if we think cards were
not sufficiently shuffled).
A bioinformatics example
Suppose we have a DNA sequence and we think that the number of repeated se-
quences (e.g. AGTAGTAGT) in the sample is greater than expected by chance. Let
t= 17 be the number of repetitions. How to test this hypothesis? Let us formulate
the null hypothesis that the base order is random. We can construct an empirical
distribution under the null hypothesis by taking the original sample and randomly
scrambling the bases R= 1000 times. This creates a sample with the same base fre-
quencies as the original sample but where the order of bases is assigned at random.
144 CHAPTER 6. NONPARAMETRIC ESTIMATION AND TESTING
Suppose that only 5 of the 1000 randomised samples has a number of repetition
higher or equal than 17. The p-value (i.e. the probability of seeing t = 17 under
the null hypothesis) which is returned by the randomisation test amounts to 0.005.
You can run the randomisation test by using the R script file randomiz.R.
•
6.5.1 Randomisation and bootstrap
Both bootstrap and randomisation rely on resampling. But what are their peculiar-
ities? A randomised sample is generated by scrambling the existing data (sampling
without replacement) while a bootstrap sample is generated by sampling with re-
placement from the original sample. Also, randomisation tests are appropriate when
the order or association between parts of data are assumed to convey important in-
formation. They test the null hypothesis that the order or the association is random.
On the other side, bootstrap sampling aims to characterise the statistical distribu-
tion of some statistics t ( DN ) where the order makes no difference in the statistics
(e.g. mean). Randomisation would be useless in that case since t (D (1)
N) = t( D (2)
N)
if D (1)
Nand D (2)
Nare obtained by resampling D N without replacement.
6.6 Permutation test
Permutation test is used to perform a nonparametric two-sample test. Consider a
random sample {z1 , . . . , zM } drawn from an unknown distribution z∼ Fz (· ) and a
random sample {y1 , . . . , yN } from an unknown distribution y∼Fy (· ). For example,
in a bioinformatics task the two datasets could be expression measures of a gene
under M normal and N pathological conditions. Let the null hypothesis be that the
two distributions are the same regardless of the analytical forms of the distributions.
Consider a (order-independent) test statistic for the observed data and call
it t (DN , DM ). The rationale of the permutation test is to locate the statistic
t( DN , DM ) with respect to the distribution which could be obtained if the null
hypothesis were true. In order to build the null hypothesis distribution, all the pos-
sible R = M+N
Mpartitionings of the N+ M observations in two subsets of size N
and M are considered. If the null hypothesis were true, all the partitionings would
be equally likely. Then for each i-th permutation (i = 1, . . . , R ) the permutation
test computes the t (i) statistic. Eventually, the value t ( DN , DM ) is compared with
the set of values t(i) . If the the value t (DN , DM ) falls in the α/2 tails of the t (i)
distribution, the null hypothesis is rejected with type I error α.
The permutation procedure will involve substantial computation unless Mand
Nare small. When the number of permutations is too large a random sample of a
large number R of permutations can be taken.
Note that when observations are drawn according to a normal distribution, it
can be shown that the use of a permutation test gives results close to those obtained
using the ttest.
Example
Let us consider D4 = [74, 86,98,102, 89] and D3 = [10, 25, 80]. We run a permu-
tation test (R = 8
4= 70 permutations) to test the hypothesis that the two sets
belong to the same distribution (R script s perm.R).
Let t (DN ) = ˆ µ(D4 )− ˆ µ( D3 ) = 51 . 46. Figure 6.2 shows the position of t (DN )
with respect to the null sampling distribution.
•
6.7. CONSIDERATIONS ON NONPARAMETRIC TESTS 145
Figure 6.2: Null distribution returned by the permutation test and position (vertical
red line) of the observed statistic
6.7 Considerations on nonparametric tests
Nonparametric tests are a worthy alternative to parametric approaches when no
assumptions about the probability distribution may be made (e.g. in bioinformat-
ics). It is risky, however, to consider them as a panacea, and a critical attitude
towards them has to be preferred. In short terms, here you find some of the ma-
jor advantages and disadvantages concerning the use of a nonparametric approach.
Advantages:
•If the sample size is very small, there may be no alternative to using a nonpara-
metric test unless the nature of the population distribution is known exactly.
•Nonparametric tests make fewer assumptions about the data.
•Nonparametric tests are available to analyse data that are inherently in ranks
(e.g. taste of food), classificatory or categorical.
•Nonparametric tests are typically more intuitive and easier to implement.
Disadvantages:
•They involve high computational costs.
•The large availability of statistical software makes possible the potential mis-
use of statistical measures.
•A nonparametric test is less powerful than a parametric one when the as-
sumptions of the parametric test are met.
•Assumptions are associated with most nonparametric statistical tests, namely,
that the observations are independent.
146 CHAPTER 6. NONPARAMETRIC ESTIMATION AND TESTING
6.8 Exercises
1. Suppose you want to estimate the skewness γof the uniform r.v. z ∼ U [ − 2, 3] by
using a dataset of size N = 10. By using R and its random generator, first plot
the sampling distribution then estimate the bias and the variance of the following
estimators:
1. ˆ γ=1
NP i(z i −ˆ µ)3
ˆ σ3
2. ˆ γ=1
NP i|z i −ˆ µ| 3
ˆ σ3
3. ˆ γ= 1
Before each random generation set the seed to zero. Hint: the skewness of a uniform
continuous variable is equal to 0.
2. Suppose you want to estimate the skewness γof the uniform r.v. z ∼ U [− 2, 3] by
using a dataset of size N= 10. By using R and its random generator, first generate
a dataset DN with N = 10. By using the jacknife, plot the sampling distribution,
then estimate the bias and the variance of the following estimators,
1. ˆ γ=1
NP i(z i −ˆ µ)3
ˆ σ3
2. ˆ γ=1
NP i|z i −ˆ µ| 3
ˆ σ3
3. ˆ γ= 1
Compare the results with the ones of the exercise before. Before each random
generation set the seed to zero.
3. Suppose you want to estimate the skewness γof the uniform r.v. z ∼ U [− 2, 3] by
using a dataset of size N= 10. By using R and its random generator, first generate
a dataset DN with N = 10. By using the bootstrap method, plot the sampling
distribution, then estimate the bias and the variance of the following estimators,
1. ˆ γ=1
NP i(z i −ˆ µ)3
ˆ σ3
2. ˆ γ=1
NP i|z i −ˆ µ| 3
ˆ σ3
3. ˆ γ= 1
Compare the results with the ones of the two exercises before. Before each random
generation set the seed to zero.
4. Let us consider a r.v. z such that E [ z ] = µ and Var[z ] = σ2 . Suppose we want to
estimate from i.i.d. dataset DN the parameter θ =µ2 = (E [z ])2 . Let us consider
three estimators:
ˆ
θ1 = PN
i=1 z i
N! 2
ˆ
θ2 = P N
i=1 z 2
i
N
ˆ
θ3 =( P N
i=1 z i ) 2
N
•Are they unbiased?
•Compute analytically the bias of the three estimators. Hint: use (3.3.30).
•By using R, verify the result above by Monte Carlo simulation using different
values of N.
•By using R, estimate the bias of the three estimators by bootstrap.
Solution: See the file Exercise1.pdf in the directory gbcode/exercises of the
companion R package (Appendix F).
Chapter 7
A statistical framework of
supervised learning
7.1 Introduction
A supervised learning problem can be described in statistical terms by the following
elements:
1. A vector of nrandom input variables x ∈ X ⊂ Rn , whose values are
i.i.distributed according to an unknown probabilistic distribution Fx (·).
2. A target operator which transforms the input values into outputs y ∈ Y
according to an unknown conditional probability distribution Fy (y|x =x ).
3. A collection DN of N input/output data points hxi , yi i ,i = 1, . . . , N , called
the training set and drawn according to the joint input/output density F x,y ( x, y).
4. A learning machine or learning algorithm which, on the basis of the training
set DN , returns an estimation (or prediction) of the target for an input x . The
input/output function estimated by the learning machine is called hypothesis
or model.
Note that in this definition we encounter most of the notions presented in the
previous chapters: probability distribution, conditional distribution, estimation.
Examples
Several practical problems can be seen as instances of a supervised learning problem:
•Predict whether a patient, hospitalised due to a heart attack, will have a sec-
ond heart attack, on the basis of demographic, diet and clinical measurements.
•Predict the price of a stock in 6 months from now, on the basis of company
performance measures and economic data.
•Identify the risk factors for breast cancer, based on clinical, demographic and
genetic variables.
•Classify the category of a text email (spam or not) on the basis of its text
content.
•Characterise the mechanical property of a steel plate on the basis of its phys-
ical and chemical composition.
147
148 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
Figure 7.1: The supervised learning setting. The target operator returns an output
for each input according to a fixed but unknown probabilistic law. The hypothesis
predicts the value of the target output when entered with the same input.
In the case of the spam categorisation problem, the input vector may be a vector of
size n where n is the number of the most used English words and the ith component
of x represents the frequency of the ith word in the email text. The output yis
a binary class which takes two values: { SPAM,NO.SPAM} . The training set is a
set of emails previously labeled by the user as SPAM and NO.SPAM. The goal of
the learning machine is to create a classification function which, once a vector xof
word frequencies is presented, should be able to classify correctly the nature of the
email.
•
A learning machine is nothing more than a particular instance of an estima-
tor (5.4.7) whose goal is to estimate the parameters of the joint distribution F x,y ( y, x)
(or sometimes of the conditional distribution Fy (y|x =x )) on the basis of a training
set DN , i.e. a set of i.i.d. realisations of the pair x and y . The goal of a learn-
ing machine is to return a hypothesis with low prediction error, i.e. a hypothesis
which computes an accurate estimate of the output of the target when the same
test value is an input to the target and the predictor (Fig. 7.1). The prediction
error is also usually called generalisation error, since it measures the capacity of
the learned hypothesis to generalise to previously unseen test samples. A learning
algorithm generalises well if it returns an accurate prediction for i.i.d. test data, i.e.
input/output pairs which are independent from the training set yet are generated by
the same joint distribution Fx,y ( x, y) . We insist on the importance of the two "i"
in the i.i.d. assumption: test data are supposed i) to be generated by the same
distribution underlying the training set but ii) to be independent from the training
set.
We will only consider hypotheses in the form h ( ·, α ) where α∈ Λ∗ is a vector of
model parameters1or weights. Therefore, henceforth, we will denote an hypothesis
h(· , α) by the corresponding vector α∈ Λ∗ . As we will see later, examples of
hypothesis are linear models h ( x, α ) = xT α (Section 9.1) where αrepresents the
coefficients of the model, or feed-forward neural networks (Section 10.1.1) where α
is the set of values taken by the weights of the neural architecture.
1It is important to remark that by model parameter we refer here to a tunable/trainable weight
of the hypothesis function and not to the target of the estimation procedure as in Section 5.1.1
7.1. INTRODUCTION 149
Let αN be the hypothesis returned by the learning machine on the basis of
the training set, and define GN its generalisation error. The goal of the learning
machine is then to seek the hypothesis αN which minimises the value GN .
In these terms, the learning problem could appear as a simple problem of op-
timisation which consists of searching the hypothesis αwhich yields the lowest
generalisation error. Unfortunately the reality is not that simple, since the learning
machine cannot measure directly GN but only return an estimate of this quantity,
denoted by ˆ
GN . Moreover, what makes the problem still more complex is that the
same finite training set is employed both to select αN and to estimate GN , thus
inducing a strong correlation between these two quantities.
The common supervised learning practice to minimise the quantity GN consists
in
1. decomposing the set of hypothesis Λ∗into a nested sequence of hypothesis
classes (or model structures ) Λ1 ⊂ Λ2 ⊂ ··· ⊂ ΛS of increasing capacity (or
expressiveness) s with Λ∗ = ∪S
s=1Λ s
2. implementing a search procedure at two nested levels [125] (Fig. 7.2). The
inner level, also known as parametric identification, considers a single class of
hypotheses Λsand uses a method or algorithm to select a hypothesis h (·, αs
N)
from this class. The algorithm typically implements a procedure of multivari-
ate optimisation in the space of model parameters of the class Λs, which can
be solved by (conventional) optimisation techniques. Examples of paramet-
ric identification procedures which will be presented in subsequent chapters
are linear least-squares for linear models or back-propagated gradient-descent
for feedforward neural networks [165]. The outer level, also called structural
identification, ranges over nested classes of hypotheses Λs , (s = 1, . . . , S ), and
executes for each of them the parametric routine returning the vector α s
N. The
outcome of the parametric identification is used to assess the class Λsthrough
avalidation procedure which returns the estimate ˆ
Gs
Non the basis of the fi-
nite training set. It is common to use nonparametric techniques to assess the
quality of a predictor like the bootstrap (Section 6.4) or cross-validation [176]
based on the jacknife strategy (Section 6.3).
3. selecting the best hypothesis in the set {α s
N}, with s = 1, . . . , S , according
to the assessments n ˆ
Gs
Noproduced by the validation step. This final step,
which returns the model to be used for prediction, is usually referred to as the
model selection procedure. Instances of model selection include the problem
of choosing the degree of a polynomial model or the problem of determining
the best number of hidden nodes in a neural network [25].
The outline of the chapter is as follows. Section 7.2 introduces the supervised
learning problem in statistical terms. We will show that classification (Section 7.3)
and regression (Section 7.4) can be easily cast in this framework. Section 7.5 intro-
duces the statistical assessment of a learning machine while Section 7.6 reports some
results from the work of Prof. Vapnik on statistical learning and in particular the
formalisation of the notion of capacity of a learning machine. Section 7.7 discusses
the notion of generalisation error and its bias/variance decomposition. Section 7.9
introduces the supervised learning procedure and its decomposition in structural
and parametric identification. Model validation and in particular cross validation,
a technique for estimating the generalisation error on the basis of a finite number
of data, are introduced in Section 7.10.
150 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
Figure 7.2: The learning problem and its decomposition in parametric and struc-
tural identification. The larger is the class of hypothesis Λs, the large is its expressive
power in terms of functional relationships.
7.2 Estimating dependencies
This section details the main actors of the supervised learning problem:
•A data generator of random input vectors x ∈ X ⊂ Rn independently and
identically distributed (i.i.d) according to some unknown (but fixed) probabil-
ity distribution Fx (x ). The variable x is called the independent variable. It is
helpful to distinguish between cases in which the experimenter has a complete
control over the values of xand those cases in which she does not. When the
nature of inputs is completely random, we consider xas a realisation of the
random variable x having probability law Fx ( · ). When the experimenter's
control is complete, we can regard Fx (· ) as describing the relative frequencies
with which different values for xare set.
•Atarget operator, which transforms the input xinto the output value y ∈ Y
according to some unknown (but fixed) conditional distribution
Fy ( y|x= x) (7.2.1)
(this includes the simplest case where the target implements some determin-
istic function y =f (x )). The conditional distribution (7.2.1) formalizes the
stochastic dependency between inputs and output.
•Atraining set DN = {hx1 , y1 i,hx2 , y2 i ,..., hxN , yN i} made of N pairs (or
training examples) hxi , yi i∈Z = X × Y independent and identically dis-
tributed (i.i.d) according to the joint distribution
Fz ( z) = Fx,y (h x, yi) (7.2.2)
Note that, as in Section 5.4, the observed training set DN ∈ ZN = (X × Y ) N
is considered here as the realisation of a random variable DN .
•Alearning machine having three components:
1. A class of hypothesis functions h (·, α ) with α∈ Λ. We consider only the
case where the functions h (· , α ) ∈ Y are single valued mappings.
7.2. ESTIMATING DEPENDENCIES 151
2. A loss function L (·, · ) associated with a particular y and a particular h (x),
whose value L (y, h(x )) measures the discrepancy between the output y
and the prediction h (x ). For a given hypothesis h ( ·, α ), the functional
risk is the loss average over the XY -domain
R( α) = Exy [L ] =
ZX,Y
L( y, h(x, α)) dFx,y ( x, y) = Z X,Y
L( y, h( x, α)) p (x, y) dxdy (7.2.3)
Note that L is random since x and y are random test points (i.i.drawn
from the same distribution (7.2.2) of the training set) while the hypoth-
esis h (·, α ) is given. This is the expected loss if we test the hypothesis
h( · , α) over an infinite amount of i.i.d. input/output pairs generated
by (7.2.2). For the class Λ of hypothesis we define
α0 = arg min
α∈ΛR(α ) (7.2.4)
as the hypothesis in the class Λ which has the lowest functional risk.
Here, we assume for simplicity that there exists a minimum value of
R( α) achievable by a function in the class Λ. We define with R (α0 ) the
functional risk of the class Λof hypotheses.
3. If instead of a single class of hypothesis we consider the set Λ∗ containing
all possible single valued mappings h : X → Y , we may define the
quantity
α∗ = arg min
α∈ Λ∗ R(α ) (7.2.5)
and
R∗ =R( α∗ ) (7.2.6)
as the absolute minimum rate of functional risk. Note that this quan-
tity is ideal since it requires the complete knowledge of the distribution
underlying the data. In a classification setting, the optimal model with
parameters α∗ is called the Bayes classifier and R (α∗ ) the Bayes error
(Section 7.3.1). In a regression setting (Section 7.4) where y =f ( x ) + w
and the loss function is quadratic, h (·, α∗ ) = f (· ) and R (α∗ ) amounts to
the variance of w.
4. An algorithm L of parametric identification which takes as input the
training set DN and returns as output one hypothesis function h ( ·, αN )
with αN ∈ Λ. Here, we will consider only the case of deterministic and
symmetric algorithms. This means respectively that they always give
the same h (· , αN ) for the same data set DN and that they are insensitive
to the ordering of the examples in DN .
The parametric identification of the hypothesis is done according to ERM
(Empirical Risk Minimisation) inductive principle [186] where
αN = α (DN ) = arg min
α∈Λ R emp (α ) (7.2.7)
minimizes the empirical risk (also know as training error or apparent
error)
Remp ( α ) = 1
N
N
X
i=1
L( yi , h( xi , α)) (7.2.8)
constructed on the basis of the data set DN .
152 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
This formulation of a supervised learning problem is quite general, given that it
includes two basic statistical problems:
1. the problem of classification (also known as pattern recognition),
2. the problem of regression estimation.
These two problems and their link with supervised learning will be discussed in the
following sections.
7.3 Dependency and classification
Classification is one of the most common problem in statistics. It consists in ex-
ploring the association between categorical dependent variables and independent
variables which can take either continuous or discrete values. The problem of clas-
sification is formulated as follows: consider an input/output stochastic dependence
which can be described by a joint distribution F x,y ( · ), such that once an input
vector x is given, y ∈ Y = {c1 , . . . , cK } takes a value among Kdifferent classes. In
the example of spam email classification, K = 2 and c1 =SPAM, c2 =NO.SPAM.
We assume that the dependence is described by a conditional discrete probability
distribution Prob {y = ck |x =x} that satisfies
K
X
k=1
Prob {y = ck | x} = 1
This means that observations are noisy and follow a probability distribution. In
other terms, given an input x ,y does not always take the same value. Pretending
to have a zero-error classification in this setting is then completely unrealistic.
Example
Consider a stochastic dependence where xrepresents a year's month and yis a
categorical variable representing the weather situation in Brussels. Suppose that y
may take only the two values { RAIN,NO.RAIN} . The setting is stochastic since
you might have rainy August and some rare sunny December days. Suppose that
the conditional probability distribution of y is represented in Figure 7.3. This figure
plots Prob {y = RAIN|x = month} and Prob {y = NO.RAIN|x = month} for each
month. Note that for each month the probability constraint is respected:
Prob {y = RAIN|x = month} + Prob {y = NO.RAIN|x = month}= 1
•
A classifier is a particular instance of estimator which for a given x is expected
to return an estimate ˆ y= ˆ c= h( x, α) which takes a value in { c1 , . . . , cK }. Once a
cost function is defined, the problem of classification can be expressed in terms of
the formalism introduced in the previous section. An example of cost function is
the indicator function (taking only two values: zero and one)
L( c, ˆ c) = ( 0 if c= ˆ c
1 if c 6 = ˆ c(7.3.9)
also called the 0/1 loss . However, we can imagine situations where some misclassi-
fications are worse than others. In this case, it is better to introduce a loss matrix
L(K×K) where the element L(jk) =L(cj ,ck ) denotes the cost of the misclassification
7.3. DEPENDENCY AND CLASSIFICATION 153
Figure 7.3: Conditional distribution Prob {y |x } where x is the current month and
yis the random weather state. For example the column corresponding to x=Dec
and y =RAIN returns the conditional probability of RAIN in December.
when the predicted class is ˆ c( x ) = cj and the correct class is ck . This matrix must
be null on the diagonal and non negative everywhere else. In practical cases the
definition of a loss matrix could be quite challenging since it should take into ac-
count and combine several criteria, some easy to quantify (e.g. financial costs) and
some much less (e.g. ethical considerations)2. Note that in the case of the 0-1 loss
function (Equation 7.3.9) all the elements outside the diagonal are equal to one.
The goal of the classification procedure for a given xis to find the predictor
ˆ c( x) = h (x, α ) that minimises the quantity
K
X
k=1
L(ˆ c( x),ck ) Prob {y= c k|x}(7.3.10)
which is an average of the ˆ c( x) row of the loss matrix weighted by the conditional
probabilities of observing y = ck . Note that the average of the above quantity over
the Xdomain
ZX
K
X
k=1
L(ˆ c( x),ck ) Prob {y= c k|x} dF x=Z X ,Y
L( y, h( x, α)) dFx,y = R ( α ) (7.3.11)
corresponds to the functional risk (7.2.3).
The problem of classification can then be seen as a particular instance of the
more general supervised learning problem described in Section 7.2.
2By default, any automatic classifier (and the associated decision maker) implicitly or explicitly
embeds a loss function weighting often highly heterogeneous criteria. For instance, the Tesla
automatic braking systems (implicitly or explicitly) assigns a cost to false positives (e.g. a bag
wrongly identified as a pedestrian) and false negatives (e.g. a pedestrian mistaken for a bag).
154 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
7.3.1 The Bayes classifier
It can be shown that the optimal classifier h (·, α0 ) where α0 is defined as in (7.2.4)
is the one that returns for all x
c∗ ( x) = h( x, α0 ) = arg min
cj ∈{c1,...,cK }
K
X
k=1
L(j,k) Prob {y = ck | x} (7.3.12)
The optimal classifier is also known as the Bayes classifier . In the case of a 0-1 loss
function the optimal classifier returns
c∗ ( x) = arg min
cj ∈{c1 ,...,cK } X
k=1: K,k6= j
Prob {y =ck | x} (7.3.13)
= arg min
cj ∈{c1 ,...,cK }(1 − Prob {y = c j |x}) (7.3.14)
= arg min
cj ∈{c1,...,cK } Prob {y6=c j |x}= arg max
cj ∈{c1,...,cK } Prob {y= c j |x}
(7.3.15)
The Bayes decision rule selects the j ,j = 1, . . . , K , that maximizes the posterior
probability Prob {y =cj |x}.
Example
Consider a classification task where X ={ 1,2,3,4, 5 } ,Y = {c1 , c2, c3 } and the loss
matrix and the conditional probability values are given in the following figures.
Let us focus on the optimal classification for x= 2. According to (7.3.12) the Bayes
classification rule for x = 2 returns
c∗ (2) = arg min
k=1 ,2,3{L 11 Prob {y=c 1 |x= 2} + L 12 Prob {y=c 2 |x= 2} +L 13 Prob {y= c 3 |x= 2} ,
L21 Prob {y = c1 |x = 2} + L22 Prob {y = c2 |x = 2 }+ L23 Prob {y = c3 |x = 2 } ,
L31 Prob {y = c1 |x = 2 }+ L32 Prob {y = c2 |x = 2} + L33 Prob {y = c3 |x = 2}}
= arg min
k=1 ,2 ,3{0∗ 0. 2+1 ∗ 0. 8+5 ∗ 0.0, 20 ∗ 0. 2+0∗ 0. 8 + 10 ∗ 0.0,
2∗ 0. 2+1 ∗ 0. 8+0 .0∗ 0} = arg min
k=1 ,2 ,3{1, 4,1.2} = 1
What would have been the Bayes classification in the 0-1 case?
•
7.3. DEPENDENCY AND CLASSIFICATION 155
Figure 7.4: Class conditional distributions: the green class is distributed as a mix-
ture of two gaussians while the red class as a gaussian.
7.3.2 Inverse conditional distribution
An important quantity, often used in classification algorithms, is the inverse condi-
tional distribution. According to the Bayes theorem (3.1.20) we have that
Prob {y = ck |x =x} = Prob {x= x |y=ck } Prob {y =ck }
PK
k=1 Prob {x=x |y= c k }Prob {y =c k }(7.3.16)
and that
Prob {x = x|y =ck } = Prob {y=ck |x=x } Prob {x=x }
Px Prob {y=ck |x=x } Prob {x =x} . (7.3.17)
The above relation means that by knowing the a-posteriori conditional distribution
Prob {y = ck |x =x} and the a-priori distribution Prob {x =x} , we can derive the
inverse conditional distribution Prob {x =x |y =ck } . This distribution is replaced
by a density if x is continuous and is also known as the class conditional density.
This distribution characterises the values of the inputs xfor a given class ck .
Shiny dashboard
The Shiny dashboard classif2.R illustrates a binary classification task where
x∈R2 and the two classes are green and red. The green and the class condi-
tional distributions (7.3.17) are a mixture of two gaussians (Section 3.7.2) and a
unimodal gaussian, respectively (Figure 7.4). Figure 7.5 illustrates the associated
conditional distribution (7.3.16) if the two classes have an equal a-priori proba-
bility (Prob {y = red} = Prob {y = green} ). Figure 7.6 shows the scattering of a
set of N = 500 points sampled according to the class-conditional distributions in
Figure 7.4.
•
156 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
Figure 7.5: The a-posteriori conditional distribution associated to the class-
conditional distributions (equal a-priori probability) in Figure 7.4.
Figure 7.6: Dataset sampled according to the class-conditional distributions (equal
a-priori probability) in Figure 7.4.
7.4. DEPENDENCY AND REGRESSION 157
Figure 7.7: Inverse conditional distribution of the distribution in Figure 7.3
Example
Suppose we want to know during which months it is most probable to have rain.
This boils down to have the distribution of x for y = RAIN . Figure 7.7 plots the in-
verse conditional distributions Prob {x = month|y = RAIN}and
Prob {x = month|y = NO.RAIN} according to (7.3.17) when we assume that the a
priori distribution is uniform (i.e. Prob {x =x} = 1/ 12 for all x).
Note that
X
month
Prob {x = month|y = NO.RAIN}=
=X
month
Prob {x = month|y = RAIN}= 1
•
7.4 Dependency and regression
Consider the stochastic relationship between two continuous random variables x ∈
Rn and y∈ Rdescribed by
Fx,y ( x, y) (7.4.18)
This means that to each vector xsampled according to the Fx (x ) there corresponds
a scalar ysampled from Fy (y |x = x ). Assume that a set of Ninput/output obser-
vations is available. The estimation of the stochastic dependence on the basis of the
empirical dataset requires the estimation of the conditional distribution Fy (y |x ).
This is known to be a difficult problem but for prediction purposes, most of the
time, it is sufficient to estimate the conditional expectation
f( x) = Ey [y | x] = Z Y
ydFy ( y| x ) (7.4.19)
also known as the regression function.
158 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
The regression function is also related to the functional risk
R( α) = Z L( y, h(x, α)) dFx,y ( x, y) = Z ( y− h( x, α ))2 dFx,y ( x, y) (7.4.20)
for the quadratic loss L ( y, h )=( y− h)2 . From (3.6.63) it can be shown that the
minimum (7.2.4) is attained by the regression function h ( ·, α0 ) = f ( · ) if the function
fbelongs to the set h( x, α), α∈ Λ.
Once defined the regression function f , the input/output stochastic depen-
dency (7.4.18) is commonly represented in the regression plus noise form
y=f (x ) + w=Ey [ y|x ] + w(7.4.21)
where w denotes the noise term and satisfies E [w ] = 0 and E [w2 ] = σ 2
w. The role of
the noise is to make explicit that some variability of the target cannot be explained
by the regression function f. Notice that the assumption of an additive noise w
independent of x is common in statistical literature and is not overly restrictive. In
fact, many other conceivable signal/noise models can be transformed into this form.
The problem of estimating the regression function (7.4.19) is then a particular
instance of the supervised learning problem described in Section 7.2, where the
learning machine is assessed by a quadratic cost function. Examples of learning
algorithms for regression will be discussed in Section 9.1 and Section 10.1.
7.5 Assessment of a learning machine
A learning machine works well if it exhibits good generalisation, i.e. if it is able
to perform good predictions for unseen input values, which are not part of the
training set but that are generated by the same input/output distribution (7.2.2)
underlying the training set. This ability is commonly assessed by the amount of
bad predictions, measured by the generalisation error. The generalisation error of
a learning machine can be evaluated at two levels:
Hypothesis: Let αN be the hypothesis returned by a learning algorithm for a
training set DN according to the ERM principle (Eq. (7.2.7)). The functional
risk R (αN ) in (7.2.3) represents the generalisation error of the hypothesis αN .
This quantity is also known as conditional error rate [98] since it is conditional
on a given training set DN .
Algorithm: Let us define the average of the loss Lfor a given input xover the
ensemble of training sets of size Nas
gN ( x ) = ED N ,y [ L| x= x ] = Z Z N ,Y
L( y, h(x, αN )) dFy ( y| x ) dF N
z(D N )
(7.5.22)
where F N
z(D N ) is the distribution of the i.i.d. dataset D N . In this expression
Lis a function of the random variables DN (through h ) and y, while the
test input xis fixed. In the case of a quadratic loss function, this quantity
corresponds to the mean squared error (MSE) defined in Section 5.5.6. By
averaging the quantity (7.5.22) over the X domain we have
GN = Z X
gN ( x) dFx ( x ) = ED N E x,y [L(y , h(x ,αN ))] (7.5.23)
that is the generalisation error of the algorithm L (also known as expected
error rate [66] or expected test error [98]).
7.5. ASSESSMENT OF A LEARNING MACHINE 159
STOCHASTIC
PROCESS
N
N
N
N
N
LEARNING MACHINE
LEARNING MACHINE
LEARNING MACHINE
LEARNING MACHINE
LEARNING MACHINE
αN
αN
αN
αN
αN
αN
R( )
αN
αN
αN
αN
R( )
R( )
R( )
R( )
G =E[R( )] αNN
Figure 7.8: Functional risk vs. MISE
From (7.2.3) and (7.5.23) we obtain that
GN =ED N [ R(αN )]
where R (αN ) is random because of the dependence on DN (Figure 7.8).
In the case of a quadratic loss function, the quantity
MISE = ED NEx,y [(y− h(x,αN ))2 ] (7.5.24)
takes the name of mean integrated squared error (MISE).
The two criteria correspond to two different ways of assessing the learning machine:
the first is a measure to assess the specific hypothesis (7.2.7) chosen by ERM, the
second assesses the average performance of the algorithm over training sets with N
observations. According to the hypothesis-based approach the goal of learning is to
find, on the basis of observations, the hypothesis that minimises the functional risk.
According to the algorithmic-based approach the goal is to find, on the basis of
observations, the algorithm which minimises the generalisation error. The two cri-
teria will be detailed in Section 7.6 and 7.7, respectively. Note that both quantities
requires the knowledge of Fx,y which is unfortunately unknown in real situations. A
key issue in machine learning is then to take advantage of observable quantities, i.e.
quantities that may be computed on the basis of the observed dataset, to estimate
or approximate the measures discussed above. An important quantity in this sense
is the empirical risk (7.2.8) which has however to be carefully considered in order
to avoid too optimistic evaluations of the learning machine accuracy.
7.5.1 An illustrative example
The notation introduced in Section 7.2 and 7.5 is rigorous but it may appear hostile
to the practitioner. In order to make the statistical concepts more affordable we
160 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
Figure 7.9: Training set (dots) obtained by sampling uniformly in the interval [− 2,2]
an input/output distribution with regression function f (x ) = x3 and unit variance.
present a simple example to illustrate these concepts. We consider a supervised
learning regression problem where :
•The input is a scalar random variable x ∈Rwith a uniform probability
distribution over the interval [− 2,2].
•The target is distributed according to a conditional Gaussian distribution
py ( y |x= x) = N ( x3 , 1) (7.5.25)
where the conditional expected value E [y |x ] is the regression function f (x ) =
x3 and the noise w has a unit variance.
•The training set DN = {hxi , yi i}, i = 1, . . . , N consists of N = 100 i.i.d. pairs
(Figure 7.9) generated according to the distribution 7.5.25. Note that this
training set can be easily generated with the following R commands
## script regr.R
N<-100
X<-runif(N,-2,2)
Y=X^3+rnorm(N)
plot(X,Y)
•The learning machine is characterised by the following three components:
1. A class of hypothesis functions h (x, α ) = αx consisting of all the linear
models passing through the origin. The class Λ is then the set of real
numbers.
7.5. ASSESSMENT OF A LEARNING MACHINE 161
Figure 7.10: The empirical risk for the training set DN vs. the model parameter
value (x-axis). The minimum of the empirical risk is attained in α = 2.3272.
2. A quadratic loss L (y, h(x )) = (y− h (x))2.
3. An algorithm of parametric identification based on the least-squares tech-
nique, which will be detailed later in Section 9.1.2. The empirical risk is
the quantity
Remp ( α ) = 1
100
100
X
i=1
(yi −αxi )2(7.5.26)
The empirical risk is a function of αand the training set. For the given
training set DN , the empirical risk as a function of αis plotted in Fig.
7.10.
For the dataset DN in Figure 7.9, it is possible to obtain αN by minimising the
empirical risk (7.5.26)
αN = arg min
α∈ΛR emp (α) = arg min
α∈ Λ
1
100
100
X
i=1
(yi −αxi )2 = 2. 3272 (7.5.27)
The selected hypothesis is plotted in the input/output domain in Fig. 7.11.
If the joint distribution (e.g. its conditional expectation and variance) were to
be known, it would also be possible to compute the risk functional (7.2.3) as
R( α) = 1
4Z 2
−2
(x3 − αx )2 dx + 1 = 4α 2
3− 32
5α + 71/7 (7.5.28)
where the derivation of the equality is sketched in Appendix C.13. For the given
joint distribution, the quantity R (α ) is plotted as a function of α in Fig. 7.12. The
function takes a global minimum in α0 = 2. 4 as can be derived from the analytical
expression in (7.5.28).
The computation of the quantity (7.5.22) requires however an average over all
the possible realisations of the random variable αN for datasets of N = 100 points.
Figure 7.13 shows 6 different realisations of the training set for the same conditional
distribution (7.5.25) and the corresponding 6 values of αN . Note that those six
values may be considered as 6 different realisations of the sampling distribution
(Section 5.4) of αN .
It is important to remark that both the quantities (7.2.3) and (7.5.22) may be
computed only if we know a priori the data joint distribution. Unfortunately, in
real cases this knowledge is not accessible and the goal of learning theory is to study
the problem of estimating these quantities from a finite set of data.
162 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
Figure 7.11: Training set (dotted points) and the linear hypothesis function h (·, αN )
(straight line). The quantity αN , which represents the slope of the straight line, is
the value of the model parameter αwhich minimizes the empirical risk.
Figure 7.12: The functional risk (7.5.28) vs. the value of model parameter α(x-
axis). The minimum of the functional risk is attained in α0 = 2.4.
7.5. ASSESSMENT OF A LEARNING MACHINE 163
Figure 7.13: Six different realisations of a training set with N = 100 points (dots)
and the relative hypotheses (solid straight lines) chosen according to the ERM
principle (7.5.27).
164 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
Monte Carlo computation of generalisation error
The script functRisk.R computes by Monte Carlo the functional risk (7.5.28) for
different values of αand returns the value of α0 = 2. 4 which minimises it. Note
that the functional risk is computed by generating a very large number of i.i.d. test
examples.
The script gener.R computes by Monte Carlo the generalisation error (7.5.23).
Unlike the previous script which considers only the predictive value of different hy-
pothesis (with different α ), this script assesses the average accuracy of the empirical
risk minimisation strategy (7.5.27) for a finite number N= 100 of examples.
•
7.6 Functional and empirical risk
This section reports some results from the pioneering work of Prof. Vladimir Vap-
nik [188, 186, 187] on statistical learning. He defines the learning problem as the
problem of finding the hypothesis which minimises the functional risk (7.2.3) on
the basis of a finite set of observed data and without any specific assumption about
the data distribution. For details and mathematical derivations, we refer the reader
to his books [186, 187]. Here we will limit to report some of his most significant
results. We start by rewriting the functional risk notation (7.2.3) as
R( α) = Z L( y, h(x, α) dFhx, yi ( x, y) = Z Q( z , α) dFz ( z) α∈ Λ (7.6.29)
where z = hx, yi , Q (z, α ) = L ( y, h ( x, α )), the probability measure Fz (· ) is unknown
but an i.i.d. sample z1 , . . . , zN is given. Analogously, the empirical risk may be
rewritten as
Remp ( αN ) = 1
N
N
X
i=1
Q(zi , αN )
Let us define with Λ∗the set of all possible single valued mappings f : X → Y and
consider the quantity
α∗ = arg min
α∈ Λ∗ R(α )
where R (α∗ ) is the absolute minimum rate of functional risk (7.2.6).
We can write the equality
R( αN )− R( α∗ )=( R( αN )− R (α0 )) + ( R (α0 )− R ( α∗ )) =
= Errestim (αN ) + Errapprox (αN )
where α0 is the hypothesis with lowest risk in Λ (Equation (7.2.4)).
The first right-hand term is the estimation error while the second is the ap-
proximation error (Figure 7.14). The estimation error represents the discrepancy
between the generalisation error of the best hypothesis in the class (R ( α0 )) and the
one learned from DN (R ( αN )). The approximation error is non null when the best
hypothesis in the class Λ (h ( ·, α0 )) is different from h (·, α∗ ).
The trade-off between approximation and estimation error is controlled by the
size of Λ: when the size of Λ is large, R (α0 ) is close to R (α∗ ) but the estimation
error could be large. On the other way round, if the size of Λ is small, the estimation
error is limited but the approximation error could be non negligible.
7.6. FUNCTIONAL AND EMPIRICAL RISK 165
*
Err
estim
αN
α*
α0
Err approx
Λ
Λ
Figure 7.14: Decomposition of the functional risk into estimation and approximation
error.
7.6.1 Consistency of the ERM principle
Functional and empirical risk are two key quantities in statistical learning (Fig-
ure 7.15). The functional risk represents the generalisation accuracy of the hypoth-
esis once tested with new data while R emp ( ·) measures the accuracy of the fitting
to the training set. A main issue is that Remp (· ) could be a very bad estimator of
the functional risk, e.g. when the class of hypothesis is too rich with respect to the
size of the observed sample.
According to Vapnik it is important to characterise the relation between those
two quantities, i.e. to define the (necessary and sufficient) conditions for the em-
pirical risk R emp (αN ) to converge for N → ∞ to the best functional risk R (α0 )
in the class Λ. This is known as the problem of consistency of the Empirical Risk
Minimisation (ERM) principle.
In formal terms, the ERM principle is consistent for the set of functions Q ( z, α)
and for the probability distribution Pz (z ) if the following two sequences converge
in probability to the same limit
R( αN ) P
−−−−→
N→∞ R(α 0 )
Remp (αN ) P
−−−−→
N→∞ R(α 0 )
The following lemma shows that both convergences may be studied by consid-
ering the quantity supα∈Λ |R emp (α )−R (α ) |.
Lemma 3 (Devroye 1988).
R( αN )− inf
α∈ΛR(α ) = R ( α N )−R( α 0 )≤2 sup
α∈Λ|R emp (α)− R( α)|
|Remp (αN ) −R( αN ) | ≤ sup
α∈Λ |R emp (α)− R (α)|
Setting an upper bound for supα∈Λ |R emp (α )−R (α )| , we obtain an upper bound
for three quantities:
166 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
Λ
α0 αN
Remp
Remp
R
R
RN
0
(
(α )
(α )
(α )
N
Figure 7.15: Functional and empirical risk.
1. the estimation error R (αN )−R (α0 ) which returns the sub-optimality of the
the model chosen by the ERM principle within the class α∈ Λ
2. |R emp (αN )−R (αN )| that is the error committed when the empirical risk is
used to estimate the functional risk of the selected model
3. |R emp (αN )−R (α0 )| that is the error made when the empirical risk is used to
estimate the functional risk of the best model in the class Λ.
It can be shown that bounding supα∈Λ | Remp (α )−R (α )| is not only a sufficient but
also a necessary condition for consistency of the ERM principle.
7.6.2 Key theorem of learning
Theorem 6.1 (Vapnik,Chervonenkis, 1991). Let Q ( z, α )α∈ Λ be a set of functions
that satisfy the condition
a≤Z Q( z, α) dP ( z)≤ b
Condition necessary and sufficient for the ERM principle to be consistent is that
the empirical risk Remp ( α) converges uniformly to the actual risk R (α ) over the set
Q( z, α), α∈Λ that is
lim
N→∞ Prob sup
α∈Λ
(R(α )−Remp (α )) > ε = 0 ∀ε > 0
This theorem rephrases the problem of ERM consistency as a problem of uniform
convergence, which ensures that the empirical risk is a good approximation of the
functional risk over all functions of Λ (i.e. including the worst-case).
The uniform convergence is trivial and guaranteed by the Law of Large Numbers
if the set of functions Q (z, α ) contains a single element: in fact, this is nothing more
than the convergence of the average to expectation for increasing N. For a real-
valued bounded function a≤ Q (z, α )≤ b , by Hoeffding's inequalities (Section 5.6)
we have
Prob ( ZQ( z, α) dP ( z)− 1
N
N
X
i=1
Q( zi , α) > ε) < exp − 2 ε 2 N
(b− a )2
7.6. FUNCTIONAL AND EMPIRICAL RISK 167
Then the probability of a deviation between empirical and functional risk converges
to zero for N → ∞. It is easy to generalise to the case where Q (z , α ) has a finite
number K of elements:
Prob (sup
1≤k ≤K ZQ(z, α ) dP (z )− 1
N
N
X
i=1
Q(zi , α) > ε)<
Kexp − 2 ε 2 N
(b− a )2 = exp ln K
N− 2 ε 2
(b− a )2 N
In order to obtain uniform convergence for any ε, the expression
lim
N→∞
ln K
N= 0 (7.6.30)
has to be satisfied. A problem arises when the set of functions is infinite, like in
machine learning where the most common classes of hypothesis are uncountable.
In this case we need to generalise the classical law of large numbers to functional
spaces. Consider the sequence of random variables
ξN = sup
α∈ Λ
(R (α )−R emp (α )) = sup
α∈Λ Z Q( z, α) dF ( z)− 1
N
N
X
i=1
Q(zi , α)!
where the set of functions Q (z, α ), α∈ Λ, has an infinite number of elements. Unlike
the finite case, the sequence ξN does not necessarily converge to zero. The problem
of learning is then strongly related to the problem of defining which properties of
the class of functions Q (z, α ), α∈ Λ, guarantee the convergence in probability of the
sequence ξN to zero. In the following section we show some theoretical results from
Vapnik about the relation between ERM consistency and the topological properties
(notably the diversity ) of the class of hypothesis.
7.6.2.1 Entropy of a set of functions
In what follows we limit to consider the binary classification setting though similar
results can be shown for regression. In this setting the functions Q ( z, α ), α∈ Λ are
indicator functions since they may take only 0 or 1 values. In order to characterise
the diversity of the set of functions Q (z, α ), α∈ Λ, on the dataset DN , let NΛ (DN )
be the number of possible separations of DN using the functions Q (z, α ), α∈ Λ.
Note that NΛ (DN ) is a random variable since DN is a random variable.
An example of this concept is presented in Figure 7.163 where N = 3 and the
functions h (· ) implement linear separators of the 2D (n= 2) input space. This class
of functions is able to perform all possible (i.e. 2N= 8) separations of the dataset.
It is also said that the class Λ of functions shatters the dataset of size N = 3. In
other words, a set of Npoints is said to be shattered by a class of hypothesis Λ if,
no matter how we assign a binary label to each point, there exists a hypothesis in
Λ that separates them. Note that a set of N= 4 points is not shattered by a class
of linear separators.
The quantity
HΛ ( N) = Eln NΛ ( DN )
is called the entropy of the set of functions on the given data and measures the
diversity of the class of hypothesis for a given number of observations.
The following theorem from Vapnik shows that this quantity is related to the
consistency of the ERM principle.
3Taken from https://datascience.stackexchange.com/questions/16140/
how-to- calculate-vc- dimension/16146
168 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
Figure 7.16: Number of linear separations of a dataset of N = 3 points.
Theorem 6.2. A necessary and sufficient condition for the two-sided uniform con-
vergence of the functional risk to the empirical risk is that
lim
N→∞
HΛ ( N)
N= 0 (7.6.31)
In other words, the ratio of the entropy to the number of observations should
decrease to zero with increasing number of observations. Note that this condition
depends on the underlying probability distribution Fz (· ) and that the entropy plays,
for uncountable classes, the role played by the number of functions in the finite case
(compare (7.6.30) with (7.6.31)).
7.6.2.2 Distribution independent consistency
Vapnik has been also able to extend the distribution-dependent result of the previous
Section to a distribution-free setting.
Theorem 6.3. Necessary and sufficient condition for consistency of ERM for any
probability measure is
lim
N→∞
GΛ ( N )
N= 0
where
GΛ ( N ) = ln max
DN N Λ (D N )
is the growth function.
Vapnik proved that in the pattern recognition case
Prob sup
α∈ Λ
(R(α )−Remp (α )) > ε ≤ 4 exp G Λ (2N)
N− ε 2 N (7.6.32)
7.6. FUNCTIONAL AND EMPIRICAL RISK 169
This means that, provided that GΛ (N ) does not grow linearly in N , it is actu-
ally possible to bound the (unknown) functional risk R (αN ) on the basis of the
(observable) empirical risk Remp (αN ).
If we set the probability in (7.6.32) to δ > 0 and we solve for ε, then the following
inequality holds with probability 1 −δ :
R( αN )≤ R emp (αN ) + √ E
2(7.6.33)
where the right-hand side is called the guaranteed risk and
E= 4 G Λ (2N)− ln(δ/4)
N.
Several other bounds have been derived for different class of hypothesis in [186].
7.6.3 The VC dimension
Vapnik and Chervonenkis showed that either the relation GΛ (N ) = N ln 2 holds
true for all N, or there exists some maximal Nfor which this relation is satisfied.
In this case, this maximal Nis called the VC (Vapnik and Chervonenkis) dimension
and denoted by D. By construction, the VC dimension is the maximal number of
points which can be shattered by functions in Λ.
Theorem 6.4. Any growth function either satisfies the equality
GΛ ( N) = Nln 2
or is bounded by the inequality
GΛ ( N) ≤ D ln N
D+ 1
where D is an integer such that when N= D
GΛ (D ) = D ln 2 , GΛ (D + 1) < (D + 1) ln 2
The VC dimension of a set of indicator functions Q (z, α ) is infinite if the growth
function is linear. It is finite and equal to Dif the growth function is bounded by
a logarithmic function with coefficient D.
The VC dimension quantifies the richness or capacity of a set of functions. If for
any N an hypothesis function h ( ·, α ), α∈ Λ can shatter Npoints (i.e. separate them
in all 2Npossible ways) then GΛ (N ) = N ln 2. In this case, the class of function has
an infinite capacity and there is no ERM convergence (the empirical risk is always
zero whatever is the functional risk): no learning from data is possible4.
The finiteness of the the VC dimension is a necessary and sufficient condition for
distribution independent consistency of ERM learning machines. The VC dimension
of the set of linear functions with n + 1 model parameters is equal to D =n + 1.
Note that, though for this specific class the VC dimension equals the number of free
parameters, this is not necessarily true for other family of functions. For instance,
it can be shown that the VC dimension of the highly wiggly set of functions
h( x, α) = sin αx, α ∈ R
4Note that, in Popper terminology (Section 2.6) this corresponds to a non scientific situation
where no dataset may falsify the hypothesis, or equivalently it is always possible to find a hypothesis
justifying what we observe. Since the class of hypothesis is too rich, no falsification (and then no
generalisation or scientific discovery) is possible.
170 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
is infinite though it has a single parameter. At the same time, you can have set of
functions with infinite number of parameters yet a finite VC dimension.
Generally speaking, the VC dimension of a set of functions can be either larger
than or smaller than the number of parameters. The VC dimension of the set of
functions (rather than the number of parameters) is responsible for the generalisa-
tion ability of learning machines.
Once defined D, the relation between empirical and functional risk of a class of
function with finite VC dimension is now made explicit by the bound (7.6.33) where
the second summand is
E( D, N, δ )=4 D ln 2N
D+ 1 − ln(δ/4)
N
The reliability of the empirical risk as approximation of the functional risk depends
on the ratio N/D . If N/D is large (i.e. sample size much larger than the VC
dimension), the E term is small and the empirical risk is a good approximation
of the functional risk. In other terms, minimising the empirical risk guarantees a
small value of the (expected) risk. On the contrary, if N/D is small (i.e. number of
samples comparable to the VC dimension), a small empirical risk Remp (αN ) does
not guarantee a small value of the actual risk. In other terms a small empirical
risk could be an optimistic (then biased) estimator of the associated functional risk.
In those configurations, to minimise the actual risk R (αN ) it is recommended to
address both terms of the confidence interval (e.g. by considering alternative classes
of hypothesis).
7.7 Generalisation error
In the previous section we presented how Vapnik [185, 186, 187] formalised the
learning task as the minimisation of functional risk R ( αN ) in a situation where the
joint distribution is unknown. This section focuses on the algorithm-based criterion
GN (Equation (7.5.24)) as a measure of the generalisation error of the learning
machine.
In particular we will study how the generalisation error can be decomposed in
the regression formulation and in the classification formulation.
7.7.1 The decomposition of the generalisation error in re-
gression
Let us focus now on of the gN measure (Equation (7.5.22)) of the generalisation
error in the case of regression. In the case of a quadratic loss
L( y ( x) , h(x, α)) = ( y ( x)− h( x, α))2(7.7.34)
the quantity gN is often referred to as the mean squared error (MSE) and its
marginal (7.5.24) as the mean integrated squared error (MISE). If the regression
dependency is described in the regression plus noise form (7.4.21), the conditional
target density can be written as
py ( y− f( x )| x ) = py ( y− Ey [y | x ]| x ) = pw ( w ) (7.7.35)
where w is a noisy random variable with zero mean and variance σ 2
w.
This supervised learning problem can be seen as a particular instance of the
estimation problem discussed in Chapter 5, where, for a given x , the unknown
parameter θ to be estimated is the quantity f (x ) and the estimator based on the
7.7. GENERALISATION ERROR 171
training set is ˆ
θ=h (x, αN ). The MSE quantity, defined in (5.5.14) coincides, apart
from an additional term, with the term (7.5.22) since
gN ( x ) = ED N ,y [ L| x ] = (7.7.36)
=ED N ,y (y− h(x, αN ))2 = (7.7.37)
=ED N ,y (y− Ey [y |x ] + Ey [y |x ]− h (x, αN ))2 = (7.7.38)
=ED N ,y (y−Ey [y |x ])2 + 2w (Ey [y|x ]− h(x, αN ))+ (7.7.39)
+ (Ey [y |x ]− h(x, αN ))2 = (7.7.40)
=Ey (y−Ey [y|x ])2 +ED N (h (x, αN )−Ey [y|x ])2 = (7.7.41)
=Ey w2 +ED N (h (x, αN )−Ey [y |x ])2 (7.7.42)
=σ2
w+E D N[(f( x)− h (x, α N )) 2 ] = σ 2
w+E D N[(θ− ˆ
θ)2 ] = (7.7.43)
=σ2
w+ MSE (7.7.44)
Note that y =f (x ) + w = Ey [y |x ] + w ,f is fixed but unknown and that the noise
term w is independent of DN and satisfies E [w ] = 0 and E [w2 ] = σ 2
w
We can then apply bias/variance decomposition (5.5.14) to the regression prob-
lem where θ =f (x ) and ˆ
θ=h (x, αN ):
gN ( x ) = ED N ,y [L( x, y)] =
=σ2
w+E D N(h( x, α N )−E y [y|x ]) 2 =
=σ2
w+noise variance
+ (ED N [h ( x, αN )] −Ey [y |x ])2 + squared bias
+ED N (h(x, αN )−ED N [h(x, αN )])2 = model variance
=σ2
w+B 2 (x) + V( x)
(7.7.45)
In a regression task, the bias B (x ) measures the difference in xbetween the aver-
age of the outputs of the hypothesis functions over the set of possible DN and the
regression function value f (x ) = Ey [y |x ]. The variance V ( x ) reflects the variabil-
ity of the guessed h ( x, αN ) as one varies over training sets of fixed dimension N.
This quantity measures how sensitive the algorithm is to changes in the data set,
regardless of the target. So by Eq. (7.5.24) by averaging (7.7.45) over Xwe obtain
MISE = GN =σ 2
w+Z X
B2 ( x) dFx + Z X
V( x) dFx (7.7.46)
where the three terms are
1. the intrinsic noise term reflecting the target alone,
2. the integrated squared bias reflecting the target's relation with the learning
algorithm and
3. the integrated variance term reflecting the learning algorithm alone.
As the aim of a learning machine is to minimise the quantity GN and the com-
putation of (7.7.46) requires the knowledge of the joint input/output distribution,
this decomposition could appear as a useless theoretical exercise. In practical set-
tings, the designer of a learning machine does not have access to the term GN but
can only estimate it on the basis of the training set. Nevertheless, the bias/variance
decomposition is relevant in practical learning too since it provides a useful hint
about how to control the error GN . In particular, the bias term measures the lack
of representational power of the class of hypotheses. This means that to reduce the
172 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
Capacity
Underfitting
Overfitting
Noise variance
Model
variance
Squared bias
s*
MISE
Figure 7.17: Bias/variance/noise tradeoff in regression: this is a qualitative rep-
resentation of the relationship between the hypothesis' bias and variance and the
capacity of the class of functions. The MISE generalisation error is the sum of
the three terms (squared bias, hypothesis variance and noise variance) as shown
in (7.7.46). Note that the variance of the noise is supposed to be target indepen-
dent and then constant.
bias term of the generalisation error we should consider classes of hypotheses with
a large capacity s, or in other words hypotheses which can approximate a large
number of input/output mappings. On the other side, the variance term warns us
against an excessive capacity (or complexity) s of the approximator. This means
that a class of too powerful hypotheses runs the risk of being excessively sensitive to
the noise affecting the training set; therefore, our class Λscould contain the target
but it could be practically impossible to find it out on the basis of the available
dataset.
In other terms, it is commonly said that an hypothesis with large bias but low
variance underfits the data while an hypothesis with low bias but large variance
overfits the data. In both cases, the hypothesis gives a poor representation of the
target and a reasonable trade-off needs to be found.
A graphical illustration of the bias/variance/noise tradeoff (7.7.46) is made in
Figure 7.17. The left side of the figure corresponds to an underfitting configuration
where the model has too low capacity (i.e. high bias) to capture the nonlinearity
of the regression function. The right side of the figure corresponds to an overfitting
configuration where the model capacity is too large (i.e. high variance) leading then
to high instability and poor generalisation. Note that Figure 7.17 requires a formal
definition of the notion of capacity and that it is only a qualitative visualisation of
the theoretical link between the hypothesis' properties and the capacity of the class
of functions. Nevertheless it provides useful hints about the impact of the learning
procedure on the final generalisation accuracy. The task of the model designer is to
search for the optimal trade-off between the variance and the bias terms (ideally the
capacity s∗ in Figure 7.17), on the basis of the available training set. Section 7.9
will discuss how this search proceeds in practice in a real setting.
7.7. GENERALISATION ERROR 173
Two naive predictors
Consider a regression task y =f (x ) + w , where Var [w ] = σ 2
wand two naive
predictors:
1. h(1) (x )=0
2. h(2) (x ) = P N
i=1 y i
N
What about their generalisation errors in x = ¯ x? By using (7.7.45) we obtain
1. g (1)
N(¯ x) = σ2
w+f(¯ x)2
2. g (2)
N(¯ x) = σ2
w+ (f (¯ x)− E[y])2 + Var [y ] /N
The script naive.R executes a Monte Carlo validation of the formulas above.
•
7.7.2 The decomposition of the generalisation error in clas-
sification
Let us consider a classification task with Koutput classes and a loss function L.
For a given input x, we denote by ˆy the class predicted by the classifier h ( x, αN )
trained with a dataset DN . We derive the analytical expression of gN (x ), usually
referred to as the mean misclassification error (MME).
MME(x ) = Ey,DN [ L (y , h(x, αN ))|x ] = Ey,DN [L (y, ˆy)] = (7.7.47)
=Ey,DN [
K
X
k,j=1
L(j,k) 1(ˆy =cj | x )1( y= ck | x )] = (7.7.48)
=
K
X
k,j=1
L(j,k) ED N [1(ˆy = cj | x )] Ey [1( y=ck |x )]] = (7.7.49)
=
K
X
k,j=1
L(j,k) Prob { ˆy = cj | x} Prob { y=ck |x } (7.7.50)
where 1(·) is the indicator function which returns zero when the argument is false
and one otherwise. Note that the distribution of ˆy depends on the training set DN
while the distribution of yis the distribution of a test set (independent of DN ).
For zero-one loss function, since y and ˆy are independent, the MME expression
simplifies to
MME(x ) =
K
X
k,j=1
1(cj 6 =ck )Prob { ˆy =cj |x } Prob { y= ck | x} =
= 1 −
K
X
k,j=1
1(cj =ck )Prob { ˆy =cj |x } Prob { y= ck | x} =
= 1 −
K
X
k
Prob { ˆy =ck |x } Prob { y= ck | x} = Prob { y 6= ˆy} (7.7.51)
174 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
A decomposition of a related quantity was proposed in [198]. Let us consider
the squared sum:
1
2
K
X
j=1
(Prob {y = cj } − Prob { ˆy =cj })2=
1
2
K
X
j=1
Prob {y =cj }2
+1
2
K
X
j=1
Prob { ˆy = cj }2
−
K
X
j=1
Prob {y = cj } Prob { ˆy =cj }
By adding one to both members and by using (7.7.47) we obtain a decomposition
analogous to the one in (7.7.45)
gN ( x ) = MME( x ) =
=1
2
1−
K
X
j=1
Prob {y = cj | x}2
+"noise"
+1
2
K
X
j=1
(Prob {y = cj | x } − Prob { ˆy = cj | x })2 + "squared bias"
+1
2
1−
K
X
j=1
Prob { ˆy = cj |x}2
"variance"
(7.7.52)
The noise term measures the degree of uncertainty of yand consequently the degree
of stochasticity of the dependance. It equals zero if and only if there exists a class
csuch that Prob {y = c| x}= 1 and zero otherwise. Note that this quantity does
not depend on the learning algorithm nor on the training set.
The variance term measures how variant the classifier prediction ˆy = h( x, αN )
is. This quantity is zero if the predicted class is always the same regardless of the
training set.
The squared bias term measures the squared difference between the yand the
ˆy probability distributions on the domain Y.
7.8 The hypothesis-based vs the algorithm-based
approach
In the previous sections we introduced two different manners of assessing the accu-
racy of a learning machine. The reader could logically raise the following question:
which approach is the most adequate in practice?
Instead of providing a direct answer to such question, we prefer to conduct a
short comparison of the assumptions and limitations related to the two approaches.
The hypothesis-based approach formulates learning as the problem of finding
the hypothesis which minimises the functional risk. Vapnik reformulates this prob-
lem into the problem of consistency of a learning process based on ERM. The main
result is that it is possible to define a probabilistic distribution-free bound on the
functional risk which depends on the empirical risk and the VC dimension of the
class of hypothesis. Though this achievement is impressive from a theoretical and
scientific perspective (it was published in a Russian book in the 60s), its adop-
tion in practical settings is not always easy for several reasons: results derive from
asymptotic considerations though learning by definition deals with finite samples,
the computation of the VC dimension is explicit only for specific classes of hypoth-
esis functions and the bound, derived from worst-case analysis, is not always tight
enough for practical purposes.
7.9. THE SUPERVISED LEARNING PROCEDURE 175
The algorithm-based approach relies on the possibility of emulating the stochas-
tic process underlying the dataset by means of resampling procedures like cross-
validation or bootstrap. Note that this approach is explicitly criticised by Vapnik
and others who consider it inappropriate to reason in terms of data generation once
a single dataset is available. According to [58] "averaging over the data would be
unnatural, because in a given application, one has to live with the data at hand. It
would be marginally useful to known the number GN as this number would indicate
the quality of an average data sequence, not your data sequence". Nevertheless,
though it is hard to guarantee formally the accuracy of a resampling strategy, its
general-purpose nature, simplicity and ease of implementation have been, along
years, key ingredients of its success.
Whatever the degree of realism of the hypothesis made by the two approaches
is, it is worth making a pragmatic and historical consideration. Though the Vap-
nik results represent a major scientific success and underlie the design of powerful
learning machines (notably SVM), in a wider perspective it is fair to say that cross-
validation is the most common and successful workhorse of practical learning appli-
cations. This means that, though most data scientists have been eager to formalise
the consistency of their algorithms in terms of Vapnik bounds, in practice they had
recourse to intensive cross-validation tricks to make it work in the real world. Now,
more than 60 years after the first computational version of learning processes, we
have enough evidence to say that cross-validation is a major element of the machine
learning success story. This is the reason why in the following sections we will focus
on an algorithm-based approach aiming to assess (and minimise) the generalisation
error by means of a resampling strategy.
7.9 The supervised learning procedure
The goal of supervised learning is to return the hypothesis with the lowest gen-
eralisation error. Since we assume that data samples are generated in a random
way, there is no hypothesis which gives a null generalisation error. Therefore, the
generalisation error GN of the hypothesis returned by a learning machine has to
be compared to the minimal generalisation error that can be attained by the best
single-valued mapping. Let us define by Λ∗the set of all possible single valued
mappings h : X → Y and consider the hypothesis
α∗ = arg min
α∈ Λ∗ R(α ) (7.9.53)
where R (α ) has been defined in (7.2.3).
Thus, R (α∗ ) represents the absolute minimum rate of error obtainable by a single
valued approximator of the unknown target. For maintaining a simple notation, we
put G∗ =R (α∗ ). For instance, in our illustrative example in Section 7.5.1, α ∗
denotes the parameters of the cubic function and G∗ amounts to the unit variance
of the Gaussian noise.
In theoretical terms, a relevant issue is to demonstrate that the generalisation
error GN of the model with parameters αN learned from the dataset DN converges
to the minimum G∗ for N going to infinity. Unfortunately, in real learning settings,
two problems must be dealt with. The first is that the error GN cannot be computed
directly but has to be estimated from data. The second is that a single class Λ could
not be large enough to contain the hypothesis α∗ .
A common practice to handle these problems is to decompose the learning pro-
cedure in the following sequence of steps:
1. A nested sequence of classes of hypotheses
Λ1 ⊆ ··· ⊆ Λs ⊆ . . . ΛS (7.9.54)
176 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
Capacity
Underfitting Overfitting
s*
GN
Figure 7.18: Bias/variance/noise tradeoff and model selection: since the generali-
sation error (e.g. MISE) is not accessible in practical settings, model selection is
performed on the basis of an estimation (dotted line) which may induce an error
(and a variability) in the selection (7.9.55) of the best capacity.
is defined so that Λ∗ = ∪S
s=1Λ s where s denotes the capacity of the class. This
guarantees that the set of hypotheses taken into consideration will necessarily
contain the best hypothesis α∗ .
A priori informations as well as considerations related to the bias/variance
dilemma can help in the design of this sequence.
2. For each class in the sequence, a hypothesis h (· , αs
N), s = 1, . . . , S , is selected
by minimising the empirical risk (7.2.8). This step is defined as the parametric
identification step of the learning procedure.
3. For each class in the sequence, a validation procedure returns ˆ
Gs
Nwhich esti-
mates the generalisation error Gs
Nof the hypothesis α s
N. This step is called
the validation step of the learning procedure.
4. The hypothesis h ( ·, α¯ s
N)∈Λ ¯ swith
¯ s= arg min
s
ˆ
Gs
N(7.9.55)
is returned as the final outcome. This final step is called the model selection
step.
In order to accomplish the learning procedure, and specifically the selection in (7.9.55),
we need an estimation of the generalisation error (Section 7.10). However, since the
estimator of the generalisation error may be affected by an error (as any estima-
tor), this may induce an error and a variability in the model selection step (7.9.55)
(Figure 7.18).
7.10. VALIDATION TECHNIQUES 177
7.10 Validation techniques
This section discusses validation methods to estimate the generalisation error GN
from a finite set of N observations.
The empirical risk (also called apparent error)R emp ( αN ) introduced in (7.2.7)
could be the most intuitive estimator of GN . However, it is generally known that the
empirical risk is a biased (and optimistic) estimate of GN and that Remp (αN ) tends
to be smaller than GN , because the same data have been used both to construct
and to evaluate h (·, αN ). A demonstration of the biasedness of the empirical risk
for a quadratic loss function in a regression setting is available in Appendix C.14.
In Section 9.1.16 we will analytically derive the biasedness of the empirical risk in
case of linear regression models.
The study of error estimates other than the apparent error is of significant
importance if we wish to obtain results applicable to practical learning scenarios.
There are two main ways to obtain better, i.e. unbiased, estimates of GN : the first
requires some knowledge on the distribution underlying the data set, the second
makes no assumptions on the data. As we will see later, an example of the first
approach is the FPE criterion (presented in Section 9.1.16.2) while examples of the
second approach are the resampling procedures.
7.10.1 The resampling methods
Cross-validation [176] is a well-known method in sampling statistics to circumvent
the limits of the apparent error estimate. The basic idea of cross-validation is that
one builds a model from one part of the data and then uses that model to predict
the rest of the data. The dataset DN is split ltimes in a training and a test subset,
the first containing N tr examples, the second containing Nts =N− N tr examples.
Each time, N tr examples are used by the parametric identification algorithm Lto
select a hypothesis α i
Ntr ,i= 1, . . . , l , from Λ and the remaining N ts examples are
used to estimate the error of h ( ·, αi
Ntr ) (Fig. 7.19)
ˆ
Rts ( αi
Ntr ) =
Nts
X
j=1
L(yj , h( xj , αi
Ntr )) (7.10.56)
The resulting average of the l errors ˆ
Rts ( α i
Ntr ), i = 1, . . . , l , is the cross-validation
estimate
ˆ
Gcv =1
l
l
X
i=1
ˆ
Rts ( αi
Ntr ) (7.10.57)
A common form of cross-validation is the "leave-one-out" (l-o-o). Let D (i) be the
training set with zi removed, and h ( x, αN(i ) ) be the corresponding prediction rule.
The l-o-o cross-validated error estimate is
ˆ
Gloo =1
N
N
X
i=1
L yi , h( xi , αN(i ) )(7.10.58)
In this case lequals the number of training points and N ts = 1.
Bootstrap (Section 6.4) is also used to return a nonparametric estimate of GN ,
by repeatedly sampling the training cases with replacement. Since empirical risk is
a biased optimistic estimation of generalisation error and bootstrap is an effective
method to assess bias (Section 6.4.3), it follows that bootstrap plays a role in a
validation strategy.
A bootstrap sample D(b) is a "fake" dataset {z1b , z2b , . . . , zN b } ,b = 1, . . . , B
randomly selected from the training set { z1 , z2, . . . , zN } with replacement.
178 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
Figure 7.19: Partition of the training dataset in the i th fold of cross-validation. The
quantity N i
tr is the number of training points while N i
ts is the number of test points.
Efron and Tibshirani [65] proposed to use bootstrap to correct the bias (or
optimism) of empirical risk by adopting a strategy similar to Section 6.4.3. Equa-
tion (6.4.9) estimates the bias of an estimator by computing the gap between the
average bootstrap (6.4.8) estimate and the sample estimation. In the case of gener-
alisation, the sample estimation ˆ
θis the empirical risk and the bootstrap estimate
ˆ
θ(·) may be computed as follows
ˆ
G(·) =1
B" N
X
i=1
(P ib L yi , h (xi , α(b) )# (7.10.59)
where P ib indicates the proportion of the bootstrap sample D (b) ,b = 1, . . . , B
containing the ith training point zi ,
Pib =# N
j=1(z j b =z i )
N(7.10.60)
and α(b) is the output of the parametric identification performed on the set D(b) .
The difference between empirical risk and (7.10.59)
Biasbs = 1
B
B
X
b=1
N
X
i=1 (P ib −1
N) L y i , h(xi , α(b) ) (7.10.61)
is the bias correction term to be subtracted to empirical risk to obtain a bootstrap
bias corrected estimate (6.4.10) of the generalisation error.
An alternative consists in using the holdout principle in combination with the
bootstrap one [65]. Since each bootstrap set is a resampling of the original training
set, it may happen that some of the original examples (called out-of-bag ) do not
belong to it: we can then use them to have an independent holdout set to be used
7.11. CONCLUDING REMARKS 179
for generalisation assessment. The bootstrap estimation of the generalisation error
(also known as E0) is then
ˆ
Gbs =1
N
N
X
i=1
1
|B (i) | X
b∈ B(i)
L yi , h( xi , α(b) )(7.10.62)
where B (i) is the set of bootstrap samples which do not contain the ith point and
|B(i) |is its size. The terms where |B (i) | = 0 are discarded.
7.11 Concluding remarks
The goal of a learning procedure is to return a hypothesis which is able to predict
accurately the outcome of an input/output probabilistic mapping on the basis of
past observations. In order to achieve this goal, the learning procedure has to deal
with three major difficulties.
Minimisation of the empirical risk: in a general case finding the global mini-
mum of the empirical risk as in (7.2.7) demands the resolution of a multivari-
ate and nonlinear optimisation problem for which no analytical solution could
exist. Some heuristics to address this issue are discussed in Section 8.6.
Finite number of data: in real problems, a single random realisation of the sta-
tistical process, made of a finite number of input/output pairs, is accessible to
the learning machine. This means, that the hypothesis generated by a learn-
ing algorithm is a random variable as well. In theory, it would be required
to have access to the underlying process and to generate several times the
training set, in order to have a reliable assessment of the learning algorithm.
In practice, the use of repeated realisations is not viable in a real learning
problem.
The validation procedure copes with this problem by trying to assess a random
variable on the basis of a single realisation. In particular we focused on cross-
validation, a resampling method which works by simulating the stochastic
process underlying the data.
No a priori knowledge: we consider a setting where no knowledge about the
process underlying the data is available. This lack of a priori knowledge puts
no constraints on the complexity of the class of hypotheses to consider, with
the consequent risk of using an inadequate type of approximator. The model
selection deals with this issue by considering classes of hypotheses of increasing
complexity and selecting the one which behaves the best according to the
validation criteria. This strategy ensures the covering of the whole spectrum of
approximators, ranging from low bias/high variance to high bias/low variance
models, making easier the selection of a good trade-off on the basis of the
available data.
So far, the learning problem has been introduced and discussed for a generic class
of hypotheses, and we did not distinguish on purpose between different learning
machines. The following chapter will show the parametric and the structural iden-
tification procedure as well as the validation phase for some specific learning ap-
proaches.
7.12 Exercises
1. Consider an input/output regression task where n = 1, E [y |x ] = sin(πx/ 2) and
p( y| x) = N (sin( πx/2) , σ2 ), σ = 0 . 1 and x ∼ U (− 2 , 2). Let N be the size of the
180 CHAPTER 7. STATISTICAL SUPERVISED LEARNING
training set and consider a quadratic loss function.
Let the class of hypothesis be hM (x ) = α0 + P M
m=1 α m x m with α j ∈[− 2, 2], j =
0,...,M.
For N = 20 generate S = 50 replicates of the training set. For each replicate,
estimate the value of the parameters that minimise the empirical risk, compute the
empirical risk and the functional risk.
1. Plot the evolution of the distribution of the empirical risk for M = 0, 1, 2.
2. Plot the evolution of the distribution of the functional risk for M = 0, 1, 2.
Hints: to minimise the empirical risk, perform a grid search in the space of pa-
rameter values, i.e. by sweeping all the possible values of the parameters in the
set [− 1,− 0. 9,− 0 . 8,..., 0. 8, 0. 9, 1]. To compute the functional risk generate a set of
Nts = 10000 i.i.d. input/output testing examples.
Solution: See the file Exercise6.pdf in the directory gbcode/exercises of the
companion R package (Appendix F).
Chapter 8
The machine learning
procedure
8.1 Introduction
Raw data is rarely of direct benefit. Its true value resides in the amount of informa-
tion that a model designer can extract from it. Modelling from data is often viewed
as an art form, mixing the insight of the expert with the information contained in the
observations. Typically, a modelling process is not a sequential process but is better
represented as a sort of loop with a lot of feedback and a lot of interactions with
the designer. Different steps are repeated several times aiming to reach, through
continuous refinements, a good model description of the phenomenon underlying
the data.
This chapter reviews the practical steps constituting the process of constructing
models for accurate prediction from data. Note that the overview is presented with
the aim of not distinguishing between the different families of approximators and
of showing which procedures are common to the majority of modelling approaches.
The following chapters will be instead devoted to the discussion of the peculiarities
of specific learning approaches.
We partition the data modelling process into two phases: a preliminary phase
which leads from the raw data to a structured training set, and a learning phase,
which leads from the training set to the final model. The preliminary phase is
made of a problem formulation step (Section 8.2) where the designer selects the
phenomenon of interest and defines the relevant input/output features, an experi-
mental design step (Section 8.3) where input/output data are collected, and a data
preprocessing step (Section 8.4) where preliminary conversion and filtering of data
is performed.
Once the numeric dataset has been formatted, the learning procedure begins.
In qualitative terms, this procedure can be described as follows. First, the designer
defines a set of models (e.g. polynomial models, neural networks) characterised by
acapacity (or complexity) index (or hyper-parameter) (e.g. degree of the polyno-
mial, number of neurons, VC dimension) which controls the approximation power
of the model. According to the capacity index, the set of models is consequently
decomposed in a nested sequence of classes of models (e.g. classes of polynomials
with increasing degree). Hence, a structural identification procedure loops over the
set of classes, first by identifying a parametric model for each class (parametric
identification) and then by assessing the prediction error of the identified model on
the basis of the finite set of points (validation ). Finally, a model selection procedure
selects the final model to be used for future predictions. A common alternative to
181
182 CHAPTER 8. THE MACHINE LEARNING PROCEDURE
model selection is a model combination, where a combination (e.g. averaging) of
the most promising models is used to return a meta-model, presumably with better
accuracy properties.
The problem of parametric identification is typically a problem of multivariate
optimisation [2]. Section 8.6 introduces the most common optimisation algorithms
for linear and nonlinear configurations. Structural identification is discussed in
Section 8.8, which focuses on the existing methods for model generation, model
validation and model selection. The last section concludes and resumes the whole
modelling process with the support of a diagram.
8.2 Problem formulation
The problem formulation is the preliminary and somewhat the most critical step of
a learning procedure. The model designer chooses a particular application domain
(e.g. finance), a phenomenon to be studied (e.g. the credit risk of a customer)
and hypothesises the existence of an unknown dependency (e.g. between the finan-
cial situation of the customer and the default risk) which is to be estimated from
experimental data. First, the modeller specifies a set of constructs , i.e. abstract
concepts or high-level topics which are potentially relevant for the study (e.g. the
profile and the financial situation of a client). Second, a set of variables (e.g. the
client age and her salary) is defined by grounding the constructs into a measurable
form. Eventually, an operationalisation, i.e. the definition of how to measure those
variables (e.g. by accessing a bank database), is proposed.
In this step, domain-specific knowledge and experience are the most crucial
requirements to come up with a meaningful problem formulation. Note that the
time spent in this phase is usually highly rewarding and can save vast amounts
of modelling time. There is often no substitute for physical intuition and human
analytical skills.
8.3 Experimental design
The most precious thing in data-driven modelling is the data itself. No matter how
powerful a learning method is, the resulting model would be ineffective if the data
are not informative enough. Hence, it is necessary to devote a great deal of attention
to the process of observing and collecting the data. In input/output modelling it is
essential that the training set be a representative sample of the phenomenon and
cover the input space adequately. To this aim, it is relevant to consider the relative
importance of the various areas of the input space. Some regions are more relevant
than others, as in the case of a dynamical system whose state has to be regulated
about some specified operating point.
The discipline of creating an optimal sampling of the input space is called exper-
imental design [72]. The study of experimental design is concerned with locating
training input data in the space of input variables so that the performance of the
modelling process is maximised. However, in some cases, the designer cannot ma-
nipulate the process of collecting data, and the modelling process has to deal with
what is available. This configuration, which is common to many real problems, is
called the observational setting [42]. Though this setting seems the most adequate
for a learning approach ("just learn from what you observe"), it is worth reminding
that most of the time, behind an observation setting, there is the strong implicit
assumption that the observations are i.i.d. samples of a stationary (i.e. invariant)
stochastic process. Now, in most realistic cases, this assumption is not valid (at
least not for a long time), and considerations of nonstationarity, drift should be
8.4. DATA PRE-PROCESSING 183
integrated in the learning process. Other problems are related to the poor causal
value of inferences made in an observational setting, e.g. in situations of sampling
bias or non-observable variables. Nevertheless, given the introductory nature of
this book, in what follows, we will limit to consider the simplest observational and
stationary setting.
8.4 Data pre-processing
Once data have been recorded, it is common practice to pre-process them. The hope
is that such treatment might make learning easier and improve the final accuracy.
Pre-processing includes a large set of actions on the observed data, and some of
them are worth being discussed:
Numerical encoding. Some interesting data for learning might not be in a nu-
meric format (e.g. text, image). Since, in what follows, we will assume that
all data are numeric, a preliminary conversion or encoding step is needed.
Given that most encoding procedures are domain-specific, we will not further
discuss them here.
Missing data treatment. In real applications, it often happens that some input
values are missing. If the quantity of data is sufficiently large, the simplest
solution is to discard the examples having missing features. When the amount
of data is too restricted or there are too many partial examples, it becomes
important to adopt some specific technique to deal with the problem. Various
heuristics [93], as well as methods based on the Expectation Maximisation
(EM) algorithm [79], have been proposed in the literature. Note that any
missing data treatment strategy makes assumptions about the process that
caused some observations to be missing (e.g. missing at random or not) [124]:
it is recommended to be aware of such assumptions before applying them (see
also Section 13.7.4 on selection bias).
Categorical variables. It may be convenient to treat categorical variables, specif-
ically in situations when they may take a very large number of values (e.g.
names of retailers in a business intelligence application). Two common ways
to deal with are: i) replace them with dummy variables encoding the dif-
ferent values in binary terms (e.g. K bits for Kcategories) ii) replace each
category with numerical values informative about the conditional distribution
of the target given such category: for instance, in regression (binary clas-
sification) we could replace a category x = "black " with an estimation of
E[y| x= " black"] (Prob {y = 1| x= " black"}).
Feature selection. The feature selection problem consists in selecting a relevant
subset of input variables in order to maximise the performance of the learning
machine. This approach is useful if there are inputs that carry only little useful
information or are strongly correlated. In these situations a dimensionality
reduction improves the performance reducing the variance of the estimator at
the cost of a slight increase in the bias.
Several techniques exist for feature selection, such as conventional methods
in linear statistical analysis [59], principal component analysis [139] and the
general wrapper approach [117]. For more details, we refer the reader to
Chapter 12.
Outliers removal. Outliers are unusual data values that are not consistent with
most of observations. Commonly, outliers are due to wrong measurement
procedures, storage errors and coding malfunctioning. There are two common
184 CHAPTER 8. THE MACHINE LEARNING PROCEDURE
strategies to deal with outliers: the first is performed at the preprocessing
stage [71] and consists in their detection and consequent removal, the second
is to delay their treatment at the model identification level by adopting robust
methodologies [105] that are by design insensitive to outliers.
Other common preprocessing operations are pre-filtering to remove noise effects,
anti-aliasing to deal with sampled signals, variable scaling to standardise all vari-
ables to have a zero mean and a unity variance 1 , compensation of nonlinearities or
the integration of some domain-specific information to reduce the distorting effects
of measurable disturbances.
8.5 The dataset
The outcome of the pre-processing phase is a dataset in a tabular numeric form
where each row represents a particular observation (also called instance, example
or data point), and each column a descriptive variable (also called feature, attribute
or covariate). We denote the dataset
DN = {z1 , z2, . . . , zN }
where N is the number of examples, nis the number of features, the i th example is
an input/output pair zi =hxi , yi i, i = 1, . . . , N ,xi is a [n× 1] input vector and yi
is a scalar output.
Note that hereafter, for the sake of simplicity, we will restrict ourselves to a
regression setting. We will assume the input/output data to be i.i.d. generated by
the following stochastic dependency:
y=f ( x) + w, (8.5.1)
where E [w ] = 0 and σ 2
wis the noise variance.
The noise term wis supposed to lump together all the unmeasured contributions
to the variability of y , like, for instance, missing or non-observable variables. There
are two main assumptions underlying Equation (8.5.1). The first is that the noise is
independent of the input and has a constant variance. This assumption also called
homeskedasticity in econometrics, is typically made in machine learning because of
the primary focus on the dependency fand the lack of effective methodologies to
assess it a priori in nonlinear and high dimensional settings. The reader should
be aware, however, that heteroskedastic configurations may have a strong impact
on the final model accuracy and that some a priori output variable transformation
and/or a posteriori assessment is always recommended (e.g. study of the residual
distribution after fitting). The second assumption is that noise enters additively
to the output. Sometimes the measurements of the inputs to the system may also
be noise corrupted; in system identification literature, this is what is called the
error-in-variable configuration [9]. As far as this problem is concerned, we adopt
the pragmatic approach proposed by Ljung [125], which assumes that the measured
input values are the actual inputs and that their deviations from the correct values
propagate through fand lump into the noise w.
In the following, we will refer to the set of vectors xi and yi through the following
matrices:
1. the input matrix X of dimension [N× n ] whose ith row is the vector x T
i,
2. the output vector Y of dimension [N× 1] whose ith component is the scalar
yi .
1This can be easily done with numeric inputs by the R command scale
8.6. PARAMETRIC IDENTIFICATION 185
8.6 Parametric identification
Assume that a class of hypotheses h ( ·, α ) with α∈ Λ has been fixed. The problem of
parametric identification from a finite set of data consists in seeking the hypothesis
whose vector of parameters αN ∈ Λ minimises the loss function
Remp ( α) = 1
N
N
X
i=1
L( yi , h( xi , α)) (8.6.2)
This phase of the learning procedure requires the resolution of the optimisation
task (7.2.7). In this section, we will review some of the most common algorithms
that address the problem of parametric identification in linear and nonlinear cases.
To make the notation more readable, henceforth we will define the error function
J( α) = Remp ( α)
and we will formulate the optimisation problem (7.2.7) as
αN = arg min
α∈ΛJ(α ). (8.6.3)
Also, we will use the term model as a synonymous of hypothesis.
8.6.1 Error functions
The choice of an optimisation algorithm is strictly dependent on the form of the
error function J (α ). The function J (α ) is directly determined by two factors
1. the form of the model h (·, α ) with α∈ Λ,
2. the loss function L (y, h ( x, α )) for a generic x.
As far as the cost function is concerned, there are many possible choices depending
on the type of data analysis problem. In regression problems, the goal is to model
the conditional distribution of the output variable conditioned on the input vari-
able (see Section 7.4) whose mean is the value minimising the mean squared error
(Equation 3.3.29). This motivates the use of a quadratic function
L( y, h( x, α)) = ( y− h( x, α))2(8.6.4)
which gives to J (α ) the form of a sum-of-squares.
For classification problems, the goal is to model the posterior probabilities of
class membership, again conditioned on the input variables. Although the sum-of-
squares J can also be used for classification, there are more appropriate error func-
tions to be considered [60]. The most used is cross-entropy which derives from the
adoption of the maximum-likelihood principle (Section 5.8) for supervised classifica-
tion. Consider a classification problem where the output variable ytakes values in
the set {c1 , . . . , cK } and Prob {y =cj | x} , j = 1, . . . , K is the conditional probabil-
ity. Given a training dataset and a set of parametric models ˆ
Pj ( x, α), j = 1, . . . , K
of the conditional distribution, the classification problem boils down to the minimi-
sation of the quantity
J( α) = −
N
X
i=1
log ˆ
Py i ( xi , α) (8.6.5)
Note that the models ˆ
Pj ( x, α), j = 1, . . . , K must satisfy two important constraints:
1. ˆ
Pj ( x, α) > 0
186 CHAPTER 8. THE MACHINE LEARNING PROCEDURE
2. P K
j=1 ˆ
Pj ( x, α )=1
In the case of 0/1 binary classification problem, the cross-entropy is written as
J( α) = −
N
X
i=1
[yi log ˆ
P1 ( xi , α) + (1 − yi ) log(1 − ˆ
P1 ( xi , α))] (8.6.6)
where ˆ
P1 ( xi , α) is the estimation of the conditional probability of the class y = 1.
Since this chapter will focus mainly on regression problems, we will limit to
consider the case of a quadratic loss function.
8.6.2 Parameter estimation
8.6.2.1 The linear least-squares method
The parametric identification of a linear model
h( x, α) = αT x
is obtained by minimising a quadratic function J (α ) by the well-known linear least-
squares method. In Chapter 9 we will see in detail the linear least-squares minimisa-
tion. Here we just report that, in the case of non-singularity of the matrix (XT X ),
J( α) has a single global minimum in
αN = ( XT X) −1 XT Y (8.6.7)
8.6.2.2 Iterative search methods
In general cases, when either the model is not linear, or the cost function is not
quadratic, J (α ) can be a highly nonlinear function of the parameters α , and there
may exist many minima, all of which satisfy
∇J= 0 (8.6.8)
where ∇ denotes the gradient of Jin parameter space. We will define as stationary
points all the points which satisfy condition (8.6.8). They include local maxima,
saddle points and minima. The minimum for which the value of the error function
is the smallest is called the global minimum, while other minima are called local
minima. As a consequence of the non-linearity of the error function J (α ), it is not
in general possible to find closed-form solutions for the minima. For more details
on multivariate optimisation we refer the reader to [2].
We will consider iterative algorithms, which involve a search through the pa-
rameter space consisting of a succession of steps of the form
α(τ +1) =α(τ) + ∆α(τ) (8.6.9)
where τ labels the iteration step.
Iterative algorithms differ for the choice of the increment ∆α (τ) .
In the following, we will present some gradient-based and non-gradient-based it-
erative algorithms. Note that each algorithm has a preferred domain of application
and that it is not possible, or at least fair, to recommend a single universal opti-
misation algorithm. We consider it much more interesting to highlight the relative
advantages and limitations of the different approaches.
8.6. PARAMETRIC IDENTIFICATION 187
8.6.2.3 Gradient-based methods
In some cases, the analytic form of the error function makes it possible to evaluate
the gradient of the cost function Jwith respect to the parameters α, increasing the
rate of convergence of the iterative algorithm. Some examples of gradient-based
methods are reported in the following sections. Those methods require the deriva-
tives of the cost function with respect to the model parameters. Such computation
is not always easy in complex nonlinear mappings, but it has been recently facil-
itated by the appearance of automatic differentiation functionalities [17], like the
ones made available by libraries like TensorFlow or PyTorch.
8.6.2.4 Gradient descent
It is the simplest of the gradient-based optimisation algorithms, also known as the
steepest descent. This algorithm starts with some initial guess α (0) for the parameter
vector (often chosen at random). Then, it iteratively updates the parameter vector
such that, at the τ th step, the estimate is updated by moving a short distance in
the direction of the negative gradient evaluated in α (τ) :
∆α(τ) =−µ ∇J (α(τ) ) (8.6.10)
where µ is called the learning rate. The updates are repeatedly executed until
convergence, i.e. when further improvements are considered to be too small to be
useful.
The gradient descent method is known to be a very inefficient procedure. One
drawback is the need for a suitable value of the learning rate µ. In fact, a decrease
of the cost function is guaranteed by (8.6.10) only for learning rates of infinitesimal
size: if its value is sufficiently small, it is expected that the value of J (α(τ) ) will
decrease at each successive step, eventually leading to a parameter vector at which
the condition (8.6.8) is satisfied. Too small learning rates may considerably delay
the convergence, while too large rates might result in numerical overflows.
Further, at most points in the parameter space, the local gradient does not point
directly towards the minimum: gradient descent then needs many small steps to
reach a stationarity point.
Example of gradient-based univariate optimisation
Let us consider the univariate function
J( α) = α2 −2 α+ 3
visualised in Figure 8.1. By running the script optim.R you can visualise the gra-
dient search of the minimum of the function. Note that the gradient is obtained by
computing analytically the derivative
J0 ( α) = 2 α−2
We invite the reader to assess the impact of the learning rate µon the conver-
gence of the minimisation process.
The function
J( α) = α4 /4− α" /3− α2 + 2
with two local minima is shown in Function 8.2 and minimised in the script optim2.R.
We invite the reader to assess the impact of the initial value α (0) of the solution on
the result of the minimisation process.
•
188 CHAPTER 8. THE MACHINE LEARNING PROCEDURE
Figure 8.1: Gradient-based minimisation of the function J (α ) = α2 − 2α + 3 with
a single global minimum.
Figure 8.2: Gradient-based minimisation of the function J (α ) = α4 / 4−α" / 3−α2 +2.
8.6. PARAMETRIC IDENTIFICATION 189
Figure 8.3: Contour plot: gradient-based minimisation of the bivariate function
J( α) = α2
1+α 2
2−2α 1 −2α 2 + 6 with a single global minimum.
Example of gradient-based bivariate optimisation
Let us consider the bivariate function
J( α) = α2
1+α 2
2−2α 1 −2α 2 + 6
whose contour plot is visualised in Figure 8.3. By running the script optim2D.R
you can visualise the gradient search of the minimum of the function. Note that
the gradient is obtained by computing analytically the gradient vector
∇J( α) = [2 α1 −2, 2α2 −2]T
We invite the reader to assess the impact of the learning rate µ on the conver-
gence of the minimisation process.
The contour plot of the function
J( α) = α 4
1+α 4
2
4− α 3
1+α 3
2
3−α2
1−α 2
2+ 4
with three local minima is shown in Function 8.4 and minimised in the script
optim2D2.R. We invite the reader to assess the impact of the initial value α (0)
of the solution on the result of the minimisation process.
•
In alternative to the simplest gradient descent, there are many iterative methods
in the literature, as the momentum -based method [155], the enhanced gradient
descent method [191] and the conjugate gradients techniques [157], which make
implicit use of second-order derivatives of the error function.
In the following section, we present instead a class of algorithms that make
explicit use of second-order information.
8.6.2.5 The Newton method
The Newton's method is a well-known example in optimisation literature. It is an
iterative algorithm which uses at the τ th step a local quadratic approximation in
190 CHAPTER 8. THE MACHINE LEARNING PROCEDURE
Figure 8.4: Contour plot: gradient-based minimisation of the function J (α ) =
α4
1+α 4
2
4− α3
1+α 3
2
3−α 2
1−α 2
2+ 4.
the neighbourhood of α (τ)
ˆ
J( α) = J α(τ) + α− α(τ) T ∇ J α(τ) +1
2 α−α(τ) T H( α(τ) ) α− α(τ)
(8.6.11)
where H (α (τ) ) is the Hessian matrix of J (α ) computed in α (τ) and
H( α) =
∂2 J
∂α2
1
∂2 J
∂α1∂α2 . . . ∂ 2 J
∂α1∂αp
.
.
..
.
... . .
.
.
∂2 J
∂αp∂α1
∂2 J
∂αp∂α2 . . . ∂ 2 J
∂α2
p
is a [p, p] square matrix of second-order partial derivatives if α∈Rp .
The minimum of (8.6.11) satisfies
αmin =α(τ) − H−1 ( α(τ) )∇ J α(τ) (8.6.12)
where the vector H−1 (α (τ) )∇J α(τ) is denoted as the Newton direction or the
Newton step and forms the basis for the iterative strategy
α(τ +1) =α(τ) − H−1 ∇ J α(τ) (8.6.13)
There are several difficulties with such an approach, mainly related to the pro-
hibitive computational demand. Alternative approaches, known as quasi-Newton or
variable metric methods, are based on (8.6.12) but instead of calculating the Hes-
sian directly, and then evaluating its inverse, they build up an approximation to the
inverse Hessian. The two most commonly used update formulae are the Davidson-
Fletcher-Powell (DFP) and the Broyden-Fletcher-Goldfarb-Shanno (BFGS) proce-
dures [128].
8.6.2.6 The Levenberg-Marquardt algorithm
This algorithm is designed specifically for minimising a sum-of-squares error func-
tion
J( α) = 1
2
N
X
i=1
L( yi , h( xi , α)) = 1
2
N
X
i=1
e2
i=1
2kek2 (8.6.14)
8.6. PARAMETRIC IDENTIFICATION 191
where ei is the error for the ith training case, eis the [N× 1] vector of errors and
k·k is a 2-norm. Let us consider an iterative step in the parameter space
α(τ) →α(τ+1) (8.6.15)
If the quantity (8.6.15) is sufficiently small, the error vector ecan ban expanded in
a first-order Taylor series form:
e α(τ+1) = e α(τ) + E α(τ+1) −α(τ) (8.6.16)
where the generic element of the matrix Eis in the form
Eij = ∂e i
∂αj
(8.6.17)
and αj is the j th element of the vector α. The error function can then be approxi-
mated by
J α(τ+1) =1
2
e α (τ) +E α (τ +1) −α (τ)
2(8.6.18)
If we minimise with respect to α(τ+1) we obtain:
α(τ +1) =α(τ) −( ETE)−1 ET e( α(τ) ) (8.6.19)
where (ET E )−1 ET is the pseudo-inverse of the matrix E . For the sum-of-squares
error function (8.6.14), the elements of the Hessian take the form
Hjk = ∂ 2 E
∂αjαk
=
N
X
i=1 ∂e i
∂αj
∂ei
∂αk
+ei
∂2 ei
∂αjαk (8.6.20)
Neglecting the second term, the Hessian can be written in the form
H' ET E(8.6.21)
This relation is exact in the case of linear models, while in the case of nonlinearities
it represents an approximation that holds exactly only at the global minimum of the
function [25]. The update formula (8.6.19) could be used as the step of an iterative
algorithm. However, the problem with such an approach could be a too large step
size returned by (8.6.19), making the linear approximation no longer valid.
The idea of the Levenberg-Marquardt algorithm is to use the iterative step, at
the same time trying to keep the step size small so as to guarantee the validity of
the linear approximation. This is achieved by modifying the error function in the
form
Jlm =1
2
e( α (τ) ) + E (α (τ +1) −α(τ) )
2+λ
α (τ +1) −α (τ)
2(8.6.22)
where λ is a parameter that governs the step size. The minimisation of the error
function (8.6.22) ensures, at the same time, the minimisation of the sum-of-square
cost and a small step size. Minimising (8.6.22) with respect to α (τ +1) we obtain
α(τ+1) =α(τ) −( ET E+ λI ) −1 ET e( α(τ) ) (8.6.23)
where I is the unit matrix. For very small values of λwe have the Newton formula,
while for large values of λ we recover the standard gradient descent.
A common approach for setting λis to begin with some arbitrary low value
(e.g. λ = 0. 1) and at each step (8.6.23) check the change in J . If J decreases, the
new parameter is retained, λis decreased (e.g. by a factor of 10), and the process
repeated. Otherwise, if J increased after the step (8.6.23), the old parameter is
restored, λ decreased, and a new step performed. The procedure is iterated until a
decrease in Jis obtained [25].
192 CHAPTER 8. THE MACHINE LEARNING PROCEDURE
8.6.3 Online gradient-based algorithms
The algorithms above are called batch since they compute the gradient of the quan-
tity (8.6.2) over the entire training set. In the case of very large datasets or se-
quential settings, this procedure is not recommended since it requires the storage
of the entire dataset. For this reason, online modification of the batch algorithms
have been proposed in the literature. The idea consists of replacing the gradient
computed on the entire training set with a gradient computed on the basis of a
single data point
α(τ +1) =α(τ) − µ(τ) ∇α J(τ) ( α(τ) ) (8.6.24)
where
J(τ) ( α(τ) )=( yτ − h( xτ , α(τ) ))2
and zτ = hyτ , xτ i the input/output observation at the τth instant.
The underlying assumption is that the training error obtained by replacing the
average with a single term will not perturb the average behaviour of the algorithm.
Note also that the dynamics of µ(τ) plays an important role in the convergence.
This algorithm can be easily used in an adaptive online setting where no training
set needs to be stored, and observations are processed immediately to improve
performance. A linear version of the iteration (8.6.24) is the Recursive Least Squares
regression algorithm presented in Section 9.1.20. Note also that the earliest machine
learning algorithms were based on sequential gradient-based minimisation. Well-
known examples are Adaline and LMS [195].
8.6.4 Alternatives to gradient-based methods
Virtually no gradient-based method is guaranteed to find the global optimum of
a complex nonlinear error function. Additionally, all descent methods are deter-
ministic in the sense that they inevitably lead to convergence to the nearest local
minimum. As a consequence, the way a deterministic method is initialised is decisive
for the final result.
Further, in many practical situations, the gradient-based computation is time-
consuming or extremely difficult due to the complexity of the objective function.
For these reasons, a lot of derivative-free and stochastic alternatives to gradient-
based methods have been explored in the literature. We will limit to cite the most
common solutions:
Random search methods. They are iterative methods that are primarily used
for continuous optimisation problems. Random search methods explore the
parameter space of the error function sequentially in a random fashion to find
the global minimum. Their strength lies mainly in their simplicity, which
makes these methods easily understood and conveniently customised for spe-
cific applications. Moreover, it has been demonstrated that they converge
to the global optimum with probability one on a compact set. However, the
theoretical result of convergence to the minimum is not really important here
since the optimisation process could take a prohibitively long time.
Genetic algorithms. They are derivative-free stochastic optimisation methods
based loosely on the concepts of natural selection and evolutionary processes
[82]. Important properties are the strong parallelism and the possibility to
be applied to both continuous and discrete optimisation problems. Typically,
Genetic Algorithms (GA) encode each parameter solution into a binary bit
string (chromosome) and associate each solution with a fitness value. GAs
usually keep a set of solutions (population ) which is evolved repeatedly toward
a better overall fitness value. In each generation, the GA constructs a new
8.7. REGULARISATION 193
population using genetic operators such as crossover or mutation; members
with higher fitness values are more likely to survive and to participate in fu-
ture operations. After a number of generations, the population is expected to
contain members with better fitness values and to converge, under particular
conditions, to the optimum.
Simulated annealing. It is another derivative-free method suitable for continu-
ous and discrete optimisation problems. In Simulated Annealing (SA), the
value of the cost function J (α ) to be minimised is put in analogy to the en-
ergy in a thermodynamic system at a certain temperature T[115]. At high
temperatures T(τ) , the SA technique allows function evaluations at points far
away from α (τ) , and it is likely to accept a new parameter value with a higher
function value. The decision whether to accept or reject a new parameter
value α (τ +1) is based on the value of an acceptance function, generally shaped
as the Boltzmann probability distribution. At low temperatures, SA evaluates
the objective function at more local points, and the likelihood of accepting
a new point with a higher cost is much lower. An annealing schedule regu-
lates how rapidly the temperature Tgoes from high values to low values as a
function of time or iteration counts.
R script
The script grad.R compares four parameter identification algorithms in the case
of a univariate linear model: least-squares, random search, gradient-based and
Levenberg-Marquardt.
•
8.7 Regularisation
Parameter identification relies on Empirical Risk Minimisation to return an estima-
tor in supervised learning problems. In Section 7.7, we stressed that the accuracy
of such an estimator depends on the bias/variance trade-off, which is typically con-
trolled by capacity related hyper-parameters. There is, however, another important
strategy, called regularisation , to control the bias/variance trade-off by constraining
the ERM problem. The rationale consists then in restricting the set of possible so-
lutions by transforming the unconstrained problem (8.6.3) into a constrained one.
An example of constrained minimisation is
αN = arg min
α∈ΛJ(α ) + λk αk (8.7.25)
where λ > 0 is the regularisation parameter. By adding the squared norm term,
solutions with large values of components of αare penalised unless they play a
major role in the J (α ) term. An alternative version is
αN = arg min
α∈ΛJ(α ) + λS (h ( ·, α )) (8.7.26)
where the term S penalises non-smooth wiggling hypothesis functions. An example
is
S( α) = Z ( h"( x, α))2 dx
where the integral of the second derivative is a measure of lack of smoothness.
Regularisation is a well-known strategy in numerical analysis and optimisation
to avoid or limit the ill-conditioning of the solution. In estimation, regularisation
194 CHAPTER 8. THE MACHINE LEARNING PROCEDURE
is an additional way to control the variance of the estimator resulting from the
optimisation procedure. This is particularly effective in learning problems with a
number of observations comparable or even smaller than the input dimension.
Note, however, that variance reduction occurs at the cost of both increased bias
and additional complexity of the optimisation problem. For instance, according
to the nature of the regularisation and the nonlinearity of the hypothesis function
we could lose some interesting properties like the closed-form of the solution. In
Chapter12, we will show some examples of regularisation to address the curse of
dimensionality problem.
8.8 Structural identification
Once a class of models Λ is given, the identification procedure, described above,
returns a model h (,· , αN ) defined by the set of parameters αN ∈ Λ .
The choice of an appropriate class or model structure [125] is, however, the most
crucial aspect for a successful modelling application. The procedure of selecting the
best model structure is called structural identification. The structure of a model
is made of a series of features that influence the generalisation power of the model
itself. Among others, there are:
•The type of the model. We can distinguish, for example, between nonlinear
and linear models, between physically based and black box representations,
between continuous-time or discrete-time systems.
•The size of the model. This is related to features like the number of inputs
(regressors), the number of parameters, the degree of the polynomials in a
polynomial model, the number of neurons in a neural network, the number of
nodes in a classification tree, etc.
In general terms, structural identification requires (i) a procedure for proposing
a series of alternative model structures, (ii) a method for assessing each of these
alternatives and (iii) a technique for choosing among the available candidates.
We denote the first issue as model generation. Some techniques for obtaining
different candidates to the final model structure are presented in Section 8.8.1.
The second issue concerns the important problem of model validation , and will
be extensively dealt with in Section 8.8.2.
Once models have been generated and validated, the last step is the model
selection, which will be discussed in Section 8.8.3
It is important to remark that a selected model structure should never be ac-
cepted as a final and true description of the phenomenon. Rather, it should be
regarded as a good enough description, given the available dataset.
8.8.1 Model generation
The goal of the model generation procedure is to generate a set of candidate model
structures among which the best one is to be selected. The more this procedure
is effective, the easier will be the selection of a powerful structure at the end of
the whole identification. Traditionally there have been a number of popular ways
to search through a large collection of model structures. Maron and Moore [131]
distinguish between two main methods of model generation:
Brute force. This is the exhaustive approach. Every possible model structure is
generated in order to be evaluated.
Consider, for instance, the problem of selecting the best structure in a 3-
layer Feed Forward Neural Network architecture. The brute force approach
8.8. STRUCTURAL IDENTIFICATION 195
consists of enumerating all the possible configurations in terms of the number
of neurons.
The exhaustive algorithm runs in a time that is generally unacceptably slow
for complex architectures with a large number of structural parameters. The
only advantage, however, is that this method is guaranteed to return the best
learner according to a specified assessment measure.
Search methods. These methods treat the collection of models as a continuous
and differentiable surface. They start at some point on the surface and search
for the model structure that corresponds to the minimum of the generalisation
error until some stop condition is met. This procedure is much faster than
brute force since it does not need to explore all the space. It only needs to
validate those models that are on the search path. Gradient-based techniques
and/or non-gradient based methods can be used for the search in the model
space. Besides the well-known problem related to local minima in the gradi-
ent base case, a more serious issues derives from the structure of the model
selection procedure. At every step of the search algorithm, we need to find
a collection of models that are near or related to the current model. Both
gradient-based and non-gradient-based techniques require some metric in the
search space. This implies a notion of model distance, difficult to define in a
general model selection problem. Examples of search methods in model gen-
eration are the growing and pruning techniques in Neural Networks structural
identification [18].
8.8.2 Validation
The output of the model generation procedure is a set of model structures Λs,
s= 1 , . . . , S. Once the parametric identification is performed on each of these model
structures, we have a set of models h (·, αs
N) identified according to the Empirical
Risk Minimisation principle.
Now, the prediction quality of each one of the model structures Λs ,s = 1, . . . , S,
has to be assessed on the basis of the available data. In principle, the assessment
procedure, known as model validation , could measure the goodness of a structure
in many different ways: how the model relates to our a priori knowledge, how the
model is easy to be used, to be implemented or to be interpreted. In this book, as
stated in the introduction, we will focus only on criteria of accuracy, neglecting any
other criterion of quality.
In the following, we will present the most common techniques to assess a model
on the basis of a finite set of observations.
8.8.2.1 Testing
An obvious way to assess the quality of a learned model is by using a testing sequence
Dts = (h xN+1 , yN+1 i ,...,h xN+Nts , yN+Nts i) (8.8.27)
that is a sequence of i.i.d. pairs, independent of DN and distributed according
to the probability distribution P (x, y ) defined in (7.2.2). The testing estimator is
defined by the sample mean
ˆ
Rts ( α s
N) = 1
Nts
N+ Nts
X
j= N+1
(yj − h (xj , αs
N)) 2 (8.8.28)
This estimator is clearly unbiased in the sense that
ED ts [ˆ
Rts (α s
N)] = R (α s
N) (8.8.29)
196 CHAPTER 8. THE MACHINE LEARNING PROCEDURE
When the number of available examples is sufficiently high, the testing technique
is an effective validation technique at a low computational cost. A serious problem
concerning the practical applicability of this estimate is that it requires a large,
independent testing sequence. In practice, unfortunately, an additional set of in-
put/output observations is rarely available.
8.8.2.2 Holdout
The holdout method, sometimes called test sample estimation, partitions the data
DN into two mutually exclusive subsets, the training set D tr and the holdout or
test set DN ts . It is common to design 2/3 of the data as training set and the
remaining 1/3 as test set. However, when the training set has a reduced number of
cases, the method can present a series of shortcomings, mainly due to the strong
dependence of the prediction accuracy on the repartition of the data between the
training and validation set. Assuming that the error R (αN tr ) decreases as more
cases are inserted in the training set, the holdout method is a pessimistic estimator
since only a reduced amount of data is used for training. The larger the number
of points used for test set, the higher the bias of the estimate αN tr ; at the same
time, fewer test points implies a larger confidence interval of the estimate of the
generalisation error.
8.8.2.3 Cross-validation in practice
In Chapter 7 we focused on the theoretical properties of cross-validation and boot-
strap. Here we will see some more practical details on these validation proce-
dures, commonly grouped under the name of computer-intensive statistical methods
(CISM) [101].
Consider a learning problem with a training set of size N.
In l -fold cross-validation the available points are randomly divided into l mutu-
ally exclusive test partitions of approximately equal size. The examples not found
in each test partition are independently used for selecting the hypothesis, which
will be tested on the partition itself (Fig. 7.19). The average error over all the l
partitions is the cross-validated error rate.
A special case is the leave-one-out (l-o-o). For a given algorithm and a dataset
DN , an hypothesis is generated using N− 1 observations and tested on the single
remaining one. In leave-one-out, cross-validation is repeated l =N times, each data
point is used as a test case, and each time nearly all the examples are used to design
a hypothesis. The error estimate is the average over the N repetitions.
In a general nonlinear case, leave-one-out is computationally quite expensive.
This is not true for a linear model where the PRESS l-o-o statistic is computed as
a by-product of the least-squares regression (Section 9.1.17).
8.8.2.4 Bootstrap in practice
Bootstrap is a resampling technique that samples the training set with replacement
to return a nonparametric estimate of the desired statistic.
There are many bootstrap estimators, but two are the most commonly used in
model validation: the E0 and the E632 bootstrap.
The E0 bootstrap estimator, denoted by ˆ
Gb in (7.10.62), samples with replace-
ment from the original training set Bbootstrap training sets, each consisting of N
cases. The cases not found in the training group form the test groups. The average
of error rate on the Btest groups is the E0 estimate of the generalisation error.
The rationale for the E632 technique is given by Efron [63]. He argues that while
the resubstitution error R emp is the error rate for patterns that are "zero" distance
8.8. STRUCTURAL IDENTIFICATION 197
from the training set, patterns contributing to the E0 estimate can be considered as
too far out from the training set. Since the asymptotic probability that a pattern
will not be included in a bootstrap sample is
(1 − 1/N)N ≈e −1 ≈ 0 .368
the weighted average of R emp and E0 should involve patterns at the "right" distance
from the training set in estimating the error rate:
ˆ
GE632 = 0 . 368 ∗ Remp + 0.632 ∗ ˆ
Gbs (8.8.30)
where the quantity ˆ
Gbs is defined in (7.10.62). The choice of B is not critical as
long as it exceeds 100. Efron [63] suggests, however, that Bneed not be greater
than 200.
There a lot of experimental results on comparison between cross-validation and
bootstrap methods for assessing models [107], [116]. In general terms, only some
guidelines can be given to the practitioner [194]:
•For training set size greater than 100, use cross-validation; either 10-fold cross-
validation or leave-one-out is acceptable.
•For training set sizes less than 100, use leave-one-out.
•For very small training sets (N < 50), in addition to the leave-one-out esti-
mator, the ˆ
GE632 and the ˆ
Gboot estimates may be useful measures.
8.8.2.5 Complexity based criteria
In conventional statistics, various criteria have been developed, often in the context
of linear models, for assessing the generalisation performance of the learned hypoth-
esis without the use of further validation data. Such criteria aim to understand the
relationship between the generalisation performance and the training error. Gen-
erally, they take the form of a prediction error, which consists of the sum of two
terms ˆ
GPE = R emp + complexity term (8.8.31)
where the complexity (or capacity) term represents a penalty that grows as the
number of free parameters in the model grows.
This expression quantifies the qualitative consideration that simple models re-
turn high empirical risk with a reduced complexity term while complex models have
a low empirical risk thanks to the high number of parameters. The minimum for
the criterion (8.8.31) represents a trade-off between performance on the training set
and complexity. Note that the bound (7.6.33) derived from the Vapnik learning
theory agrees with the relation (8.8.31) if we take the functional risk as a measure
of generalisation.
Let us consider a quadratic loss function and the quantity
\
MISEemp = Remp (αN ) = min
αP N
i=1(y i −h(x i , α))2
N
If the input/output relation is linear and nis the number of input variables, well-
known examples of complexity based criteria are:
1. the Final Prediction Error (FPE) (see Section 9.1.16.2 and [6])
FPE = \
MISEemp
1 + p/N
1−p/N (8.8.32)
with p =n + 1,
198 CHAPTER 8. THE MACHINE LEARNING PROCEDURE
2. the Predicted Squared Error (PSE) (see Section 9.1.16.2)
PSE = \
MISEemp + 2ˆ σ2
w
p
N(8.8.33)
where ˆ σ2
wis an estimate of the noise variance. This quantity is also known as
the Mallows' Cp statistics [130]
3. the Generalised Cross-Validation (GCV) [48]
GCV = \
MISEemp
1
(1 − p
N) 2 (8.8.34)
A comparative analysis of these different measures is reported in [14].
These estimates are computed assuming a linear model underlying the data.
Moody [133] introduced the Generalised Prediction Error (GPE) as an estimate of
the prediction risk for generic biased nonlinear models. The algebraic form is:
GPE = \
MISEemp + 2
Ntr ˆ
Vˆ
R(8.8.35)
where tr(·) denotes the trace, ˆ
Vis a nonlinear generalisation of the estimated noise
covariance matrix of the target and ˆ
Ris the estimated generalised influence matrix.
GPE can be expressed in an equivalent form as:
GPE = \
MISEemp + 2ˆ σeff
ˆ peff
N(8.8.36)
where ˆ peff = tr ˆ
Ris the estimated effective number of model parameters, and
ˆ σeff =tr ˆ
Vˆ
R
tr ˆ
Ris the estimated effective noise variance in the data. For nonlinear
models, ˆ peff is generally not equal to the number of parameters (e.g. number of
weights in a neural network). When the noise in the target variables is assumed
to be independent with uniform variance and the squared error loss function is
used, (8.8.36) simplifies to:
GPE = \
MISEemp + 2ˆ σ2
w
ˆ peff
N(8.8.37)
In neural network literature another well-known form of complexity-based criterion
is the weight decay technique
U( λ, α, DN ) =
N
X
i=1
(yi − h (xi , α ))2 + λg (α ) (8.8.38)
where the second term penalizes either small, medium or large weights of the neurons
depending on the form of g ( · ). Two common examples of weight decay functions
are the ridge regression form g (α ) = α2 which penalizes large weights, and the
Rumelhart form g (α ) = α 2
α0 + α2 which penalizes weights of intermediate values near
α0 .
Several roughness penalties likeZ [h"(x)]2 dx
have been proposed too. Their aim is penalising hypothesis functions that vary too
rapidly by controlling large values of the second derivative of h.
Another important method for model validation is the minimum-description
length principle proposed by Rissanen [161]. This method proposes to choose the
model which induces the shortest description for the data available. Rissanen and
Barron [14] have each shown a qualitative similarity between this principle and the
complexity based approaches. For further details refer to the cited works.
8.8. STRUCTURAL IDENTIFICATION 199
8.8.2.6 A comparison of validation methods
Computer intensive statistical methods are relatively new and must be measured
against more established statistical methods, as the complexity based criteria. In
the following we summarize some practical arguments on behalf of one or the other
method. The benefits of a CISM method are:
•All the assumptions of prior knowledge on the process underlying the data
are discarded.
•The validation technique replaces theoretical analysis by computation.
•Results are generally much easier to grasp for non-theorist.
•No assumption on the statistical properties of noise is required.
•They return an estimate of the model precision and an interval of confidence.
Arguments on behalf of complexity criteria are:
•The whole dataset can be used for estimating the prediction performance and
no partitioning is required.
•Results valid for linear models remain valid to the extent that the nonlin-
ear model can be approximated by some first-order Taylor expansion in the
parameters.
Some results in literature show the relation existing between resampling and com-
plexity based methods. For example, an asymptotic relation between a kind of
cross-validation and the Akaike's measure was derived by Stone [177], under the
assumptions that the real model α∗ is contained in the class of hypothesis Λ and
that there is a unique minimum for the log-likelihood.
Here we will make the assumption that no a priori information about the correct
structure or the quasi-linearity of the process is available. This will lead us to con-
sider computer intensive methods as the preferred method to validate the learning
algorithms.
8.8.3 Model selection criteria
Model selection concerns the final choice of the model structure in the set that
has been proposed by model generation and assessed by model validation. In real
problems, this choice is typically a subjective issue and is often the result of a
compromise between different factors, like the quantitative measures, the personal
experience of the designer and the effort required to implement a particular model
in practice.
Here we will reduce the subjectivity factors to zero, focusing only on a quan-
titative criterion of choice. This means that the structure selection procedure will
be based only on the indices returned by the methods of Section 8.8.2. We distin-
guish between two possible quantitative approaches: the winner-takes-all and the
combination of estimators approach.
8.8.3.1 The winner-takes-all approach
This approach chooses the model structure that minimises the generalisation error
according to one of the criteria described in Section 8.8.2.
Consider a set of candidate model structures Λs ,s = 1, . . . , S , and an associated
measure ˆ
G(Λs ) which quantifies the generalisation error.
200 CHAPTER 8. THE MACHINE LEARNING PROCEDURE
The winner-takes-all method simply picks the structure
¯ s= arg min ˆ
G(Λs ) (8.8.39)
that minimises the generalisation error. The model which is returned as final out-
come of the learning process is then h (· , α¯ s
N).
From a practitioner perspective it may be useful to make explicit the entire
winner-takes-all procedure in terms of pseudo-code . Here below you will find a
compact pseudo-code detailing the structural, parametric, validation and selection
steps in the case of a leave-one-out assessment .
1. for s = 1, . . . , S : (Structural loop)
•for j = 1, . . . , N
(a) Inner parametric identification (for l-o-o):
αs
N−1= arg min
α∈ Λs X
i=1:N,i6= j
(yi − h (xi , α ))2
(b) ej = yj −h (xj , αs
N−1)
•\
MISELOO(s ) = 1
NP N
j=1 e 2
j
2. Model selection: ¯ s= arg mins=1,...,S \
MISELOO ( s)
3. Final parametric identification:
α¯ s
N= arg min α∈Λ ˜ sPN
i=1(y i −h( x i , α))2
4. The output prediction model is h (· , α ¯ s
N)
8.8.3.2 The combination of estimators approach
The winner-takes-all approach is intuitively the approach which should work the
best. However, recent results in machine learning [153] show that the performance
of the final model can be improved not by choosing the model structure which is ex-
pected to predict the best but by creating a model whose output is the combination
of the output of models having different structures.
The reason for that non intuitive result is that in reality any chosen hypothesis
h(· , αN ) is only an estimate of the real target (Figure 7.18) and, like any estimate,
is affected by a bias and a variance term.
Section 5.10 presented some results on the combination of estimators. The
extension of these results to supervised learning is the idea which underlies the
first results in combination [16] and that has led later to more enhanced forms of
averaging different models.
Consider m different models h ( ·, αj ) and assume they are unbiased and uncor-
related. By (5.10.32), (5.10.33) and (5.10.34) we have that the combined model
is
h( ·) = P m
j=1 1
ˆ vj h(·, α j )
Pm
j=1 1
ˆ vj
(8.8.40)
where ˆ vj is an estimation of the variance of h (· , αj ). This is an example of the
generalised ensemble method (GEM) [153].
More advanced applications of the combination principle to supervised learning
will be discussed in Chapter 11.
8.9. PARTITION OF DATASET IN TRAINING, VALIDATION AND TEST 201
8.9 Partition of dataset in training, validation and
test
The main challenge of machine learning consists of using a finite size dataset for i)
learning several predictors, ii) assessing them, iii) selecting the most promising one
and finally iv) returning it together with a reliable estimate of its generalisation
error.
Section 7.10 discussed the need of avoiding correlation between training and
validation examples. While the training set is used for parametric identification, a
non-overlapping portion of the dataset (validation set) should be used to estimate
the generalisation error of model candidates.
The use of validation (or cross-validation) does not prevent, however, a risk of
overfitting inherent to the winner-take-all model selection. If we take the minimum
generalisation error ˆ
G(Λ¯ s) in (8.8.39) as the generalisation error of the winning
model, we have an optimistic estimation again. This is known as selection bias , i.e.
the bias that occurs when we make a selection in a stochastic setting and due to
the fact the expectation of minimum is lower than the minimum of expectations
(Appendix C.11).
Anested cross-validation strategy [40] is recommended to avoid such bias. If
we have enough observations (i.e. large N), the strategy consists in randomly
partitioning (e.g. 50%, 25%, 25%) the labelled dataset into three parts: a training
set, a validation set, and a test set. The test portion is supposed to be used for
the unbiased assessment of the generalisation error of the model ¯ sin (8.8.39). It is
important to use only this set to assess the generalisation accuracy of the chosen
model. For this reason, the test set should be carefully made inaccessible to the
learning process (and ideally forgotten) and considered only at the very end of the
data analysis. Any other use of the test-set during the analysis (e.g. before the final
assessment) would "contaminate" the procedure and make it irreversibly biased.
Selection bias
A Monte Carlo illustration of the selection bias effect in a univariate regression task
is proposed in the R script selectionbias.R. The script estimates the generalisa-
tion errors of a constant model (h1 ), a linear model (h2 ) and a third model which is
nothing more than the winner-takes-all of the twos in terms of leave-one-out valida-
tion. It appears that the winner-takes all model is not better than the best between
h1 and h2 : in other terms it has a generalisation error larger than the minimum
between h1 and h2 .
•
8.10 Evaluation of a regression model
Let us consider a test set of size N ts where Yts ={ y1 , . . . , yN ts } is the target and
ˆ
Y={ ˆ y1 ,..., ˆ yNts }is the prediction returned by the learner. The canonical way to
assess a regression model by using a testing set is to measure the mean-squared-
error (8.8.28) (MSE):
MSE = P N ts
i=1(y i −ˆ yi )2
Nts
Let us suppose that the test of a learning algorithm returns a mean-squared-
error of 0.4. Is that good or bad? Is that impressive and/or convincing? How may
we have a rapid and intuitive measure of the quality of a regression model?
202 CHAPTER 8. THE MACHINE LEARNING PROCEDURE
A recommended way is to compare the learner to a baseline, e.g. the simplest
(or naive) predictor we could design. This is the rationale of the Normalised Mean-
Squared-Error measure which normalises the accuracy of the learner with respect
to the accuracy of the average predictor, i.e. the simplest predictor we could learn
from data. Then
NMSE = P N ts
i=1(y i −ˆ yi )2
PN ts
i=1(y i −¯ y)2 (8.10.41)
where
¯ y=P Nts
i=1 y i
Nts
(8.10.42)
is the prediction returned by an average predictor. NMSE is then the ratio between
the MSE of the learner and the MSE of the baseline naive predictor (8.10.42).
As for the MSE, the lower the NMSE the better. At the same time, we should
target a NMSE (significantly) lower than one, if we wish to claim that the complex
learning procedure is effective. NMSE values close to (yet smaller than) one are
either indicators of a bad learning design or, more probably, of a high noise to signal
ratio (e.g. large σ 2
win (8.5.1)) which makes any learning effort ineffective.
Our recommendation is always to measure the NMSE of a regression model
before making too enthusiastic claims about the success of the learning procedure.
A very small NMSE could be irrelevant if not significantly smaller than what we
could obtain by a simple naive predictor.
Another common way to assess the MSE of a predictor is to normalise it with
respect to the MSE of the same learning algorithm, yet trained on a randomly
shuffled version of the training set. For instance it is enough to shuffle the training
target to cancel any dependency between inputs and outputs. Again in this case
we expect that the NMSE is much lower than one. Otherwise any claim that our
prediction is better than a random one would be unfounded.
8.11 Evaluation of a binary classifier
The most popular measure of performance is error rate or misclassification rate,
i.e. the proportion of test examples misclassified by the rule. However, misclassifi-
cation error is not necessarily the most appropriate criterion in real settings since
it implicitly assumes that the costs of different types of misclassification are equal.
When there are only a few or a moderate number of classes, the confusion matrix is
the most complete way of summarising the classifier performance. In the following,
we will focus on evaluating a binary classifier.
Suppose to use the classifier to make Ntest classifications and that among the
values to be predicted there are NP examples of class 1 and NN examples of class
0. The confusion matrix is
Negative (0) Positive (1)
Classified as negative TN FN ˆ
NN
Classified as positive FP TP ˆ
NP
NNNP N
where FP is the number of False Positives and FN is the number of False Negatives.
The confusion matrix contains all the relevant information to assess the general-
isation capability of a binary classifier. From its values it is possible to derive a
number of commonly used error rates or measures of accuracy. For instance the
misclassification error rate is
ER = F P +F N
N(8.11.43)
8.11. EVALUATION OF A BINARY CLASSIFIER 203
8.11.1 Balanced Error Rate
In a setting where the two classes are not balanced the misclassification error
rate (8.11.43) can lead to a too optimistic interpretation of the rate of success.
For instance if NP = 90 and NN = 10, a naive classifier returning always the
positive class would have a misclassification ER = 0. 1 since FN = 0 and FP = 10.
This low value of misclassification gives a false sense of accuracy since humans tend
to associate a 50% error to random classifiers. This is true in balanced settings
while in an unbalanced setting (as the one above) this generalisation performance
may be obtained with a trivial classifier making no use of the input information.
In these cases, it is preferable to adopt the balanced error rate which is the
average of the errors on each class:
BER = 1
2 F P
TN +FP
+F N
FN + TP
Note that in the example above BER= 0. 5, normalising the misclassification error
rate to a value correctly interpretable by humans.
8.11.2 Specificity and sensitivity
In many research works on classification, it is common usage to assess the classifier
in terms of sensitivity and specificity.
Sensitivity is a synonymous of True Positive Rate (TPR)
SE = TPR = T P
TP +FN
=TP
NP
=N P −FN
NP
= 1 −F N
NP
,0≤ SE ≤ 1 (8.11.44)
It is a quantity to be maximised, and it increases by reducing the number of
false negatives. Note that it is also often called the recall in information retrieval.
Specificity stands for the True Negative Rate (TNR)
SP = TNR = T N
FP +TN
=TN
NN
=N N −FP
NN
= 1 − F P
NN
,0≤ SP ≤1
It is a quantity to be maximised and it increases by reducing the number of false
positive.
In other terms, sensitivity is the proportion of positive examples classified as
positive while specificity is the proportion of negative examples classified as negative.
There exists a trade-off between these two quantities. This is the reason why
both quantities have to be calculated to have a thorough assessment of the classifier
accuracy. In fact, it is trivial to maximise one of those quantities at the detriment
of the other.
For instance, for a naive classifier who return always 0 we have ˆ
NP = 0, ˆ
NN = N,
FP = 0, TN = NN . This means that a naive classifier may attain maximal specificity
(SP = 1) but at the cost of minimal sensitivity (SE = 0).
Analogously in the case of a naive classifier who returns always 1 we have ˆ
NP =
N,ˆ
NN = 0, FN = 0, TP = NP , i.e. maximal sensitivity ( SE = 1) but null specificity
(SP = 0).
8.11.3 Additional assessment quantities
Other commonly used quantities which can be derived by the confusion matrix are
•False Positive Rate:
FPR=1-SP = 1 − T N
FP +TN
=FP
FP + TN
=FP
NN
,0≤ FPR ≤1
It decreases by reducing the number of false positive.
204 CHAPTER 8. THE MACHINE LEARNING PROCEDURE
•False Negative Rate:
FNR = 1-SE = 1 − T P
TP + F N
=F N
TP +FN
=F N
NP
0≤ FNR ≤1
It decreases by reducing the number of false negatives.
•Positive Predictive value: the ratio (to be maximised)
PPV = T P
TP +FP
=TP
ˆ
NP
,0≤ PPV ≤ 1 (8.11.45)
This quantity is also called precision in information retrieval.
•Negative Predictive value: the ratio (to be maximised)
PNV = T N
TN +FN
=TN
ˆ
NN
,0≤ PNV ≤1
•False Discovery Rate: the ratio (to be minimised)
FDR = F P
TP +FP
=FP
ˆ
NP
= 1 − PPV, 0≤ FDR ≤1
8.11.4 Receiver Operating Characteristic curve
All the assessment measures discussed so far make the assumption that the classifier
returns a class for each test point. However, since most binary classifiers compute an
estimation of conditional probability, a class may be returned as outcome provided
a threshold on the conditional probability. In other terms, the confusion matrix,
as well as its derived measured depends on a specific threshold. The choice of a
threshold is related to Type I error and Type II errors (Section 5.13) that we are
ready to accept in a stochastic setting.
In order to avoid conditioning our assessment on a specific threshold, it is in-
teresting to assess the overall accuracy for all possible thresholds. This is possible
by plotting curves, like the Receiver Operating Characteristic (ROC) which plots
the true positive rate (i.e. sensitivity or power) against the false positive rate (1-
specificity) for different classification thresholds.
In other terms, ROC visualises the probability of detection vs. the probability
of false alarm. Different points on the curve correspond to different thresholds used
in the classifier.
The ideal ROC curves would follow the two axes. In practice, real-life classifi-
cation rules produce ROC curves which lie between these two extremes. It can be
show that a classifier with a ROC curve following the bissetrix line would be useless.
For each threshold, we would have TP/NP =FP /NN , i.e. the same proportion of
true positives and false positives. In other terms, this classifier would not separate
the classes at all.
A common way to summarise a ROC curve is to compute the area under the
curve (AUC). By measuring the AUC of different classifiers, we have a compact way
to compare classifiers without setting a specific threshold.
8.11.5 Precision-recall curves
Another commonly used curve to visualise the accuracy of a binary classifier is
the precision-recall (PR) curve. This curve shows the relation between preci-
sion (8.11.45) (probability that an example is positive given that it has been clas-
sified as positive) vs recall (8.11.44) (probability that an example is classified as
positive given that is positive).
8.11. EVALUATION OF A BINARY CLASSIFIER 205
Figure 8.5: ROC and PR curves of a binary classifier.
Since precision is dependent on the probability a priori of the positive class, in
largely unbalanced problems (e.g. few positive classes like in fraud detection), the
PR curve is more informative than the AUC.
R script: visual assessment of a binary classifier
The R script roc.R illustrates the assessment of a binary classifier for a task where
x∈R, p( x |y = +) ∼ N (1 , 1) and p( x|y= −) ∼ N ( − 1 , 1). Suppose that the
classifier categorises the examples as positive if t >Th and negative if t < Th, where
Th∈R is a threshold. Note that if Th= −∞ , all the examples are classed as positive:
TN =FN = 0 which implies SE =T P
NP = 1 and FPR = F P
FP + TN = 1. On the other
way round, if Th= ∞ , all the examples are classed as negative: TP =FP = 0 which
implies SE = 0 and FPR = 0.
By sweeping over all possible values of Th we obtain the ROC and the PR curves
in Figure 8.5. Each point on the ROC curve, associated to a specific threshold, has
an abscissa FPR =FP /NN and an ordinate TPR = TP/NP . Each point on the
PR curve, associated to a specific threshold, has an abscissa TPR = TP/NP and
an ordinate PR=T P / (T P + F P ).
•
Fraud detection example
Let us consider a fraud detection problem [52] with NP = 100 frauds out of
N= 2 · 106 transactions. Since one of the two classes (in this case the fraud)
is extremely rare, the binary classification setting is called unbalanced [51]. Un-
balanced classification settings are very common in real-world tasks (e.g. churn
detection, spam detection, predictive maintenance).
Suppose we want to compare two algorithms: the first returns 100 alerts, 90 of
which are frauds. Its confusion matrix is
Genuine (0) Fraudulent (1)
Classified as genuine 1, 999, 890 10 1, 999,900
Classified as fraudulent 10 90 100
1, 999, 900 100 2 · 106
206 CHAPTER 8. THE MACHINE LEARNING PROCEDURE
The second algorithm returns a much larger number of alerts (1000) 90 of which
are actual frauds. Its confusion matrix is then
Genuine (0) Fraudulent (1)
Classified as genuine 1, 998, 990 10 1, 999,000
Classified as fraudulent 910 90 1000
1,999, 900 100 2 · 106
Which of two algorithms is the best? In terms of TPR and FPR we have
1. TPR=TP/NP = 90/100 = 0. 9, FPR= FP/NN = 10 / 1 , 999 , 900 = 0 . 00000500025
2. TPR=90/ 100 = 0. 9, FPR=910/ 1,999, 900 = 0.00045502275
The FPR difference between the two algorithms is then negligible. Nevertheless,
though the recalls of the two algorithms are almost identical, the first algorithm is
definitely better in terms of false positives (much higher precision):
1. A1: PR=TP/ (TP + FP ) = 90/100 = 0. 9, recall=0.9
2. A2: PR=90/ 1000 = 0. 09, recall= 0.9
The example shows that, in strongly unbalanced settings, the performance of a
classification algorithm may be highly sensitive to the adopted cost function.
•
8.12 Multi-class problems
So far we limited to consider binary classification tasks. However, real-world clas-
sification tasks (e.g. in bioinformatics or image recognition) are often multi-class.
Some classification strategies (detailed in following chapters) may be easily adapted
to the multi-class setting, like the Naive Bayes (Section 10.2.3.1) or the KNN clas-
sifier (Section 10.2.1.1).
Suppose, however, to have a binary classification strategy that we want to use
in a multi-class context. There are three main strategies to extend binary classifiers
to handle multi-class tasks y ∈ {c1 , . . . , ck } .
1. One-versus-the rest (or one-versus-all, OVA): it builds for each class ck a
binary classifier that separates this class from the rest. To predict the class
of a query point q, the outputs of the kclassifiers are considered. If there
is a unique class label which is consistent with the kpredictions, the query
point is labeled with such a class. Otherwise, one of the kclasses is selected
randomly.
2. Pairwise (or one-versus-one, OVO): it trains a classifier for each pair of classes,
requiring in total the independent learning of k (k− 1)/ 2 binary classifiers. To
predict a query point class, the output of the k (k− 1)/ 2 classifiers is calculated
and a majority vote is considered. If there is a class which receives the largest
number of votes, the query point is assigned to such a class. Otherwise each
tie is broken randomly.
3. Coding: it first encodes each class by a binary vector of size d, then it trains a
classifier for each component of the vector. The aggregation of the outputs of
the d classifiers returns an output word , i.e. a binary of size d. Given a query
q, the output word is compared against the codeword of each class, and the
class having the smallest Hamming distance (the number of disagreements)
to the output word is returned.
8.13. CONCLUDING REMARKS 207
Suppose that we have a task with k = 8 output classes. According to the coding
strategy, d log2 8e = 3 binary classifiers can be used to handle this problem.
ˆ c1 ˆ c2 ˆ c3
c1 0 0 0
c2 0 0 1
c3 0 1 0
c4 0 1 1
c5 1 0 0
c6 1 0 1
c7 1 1 0
c8 1 1 1
The table columns denote the classifiers while the rows contain the coding of the
associated class. For instance, the ˆ c3 classifier will i) encode the training points
labeled with the classes { c2 , c4, c6, c8 } as ones ii) encode all the remaining examples
as zeros and iii) learn the corresponding binary classifier.
Note that, though each strategy requires the learning of more than a single
classifier, the number of trained classifiers is not the same. Given k > 2 classes, the
number of classifiers trained for each method is mentioned here below.
•One-versus-the rest: k binary classifiers
•Pairwise: k (k −1)/ 2 binary classifiers
•Coding: dlog2 k ebinary classifiers where d·e denote the ceiling operator.
8.13 Concluding remarks
The chapter presented the most important steps to learn a model on the basis of a
finite set of input/output data. Though the entire procedure was globally depicted
as a waterfall process (Figure 8.6), it should be kept in mind that a learning pro-
cess, like any modelling effort, is better represented by a spiral model characterised
by feedbacks, iterations and adjustments. An example is the identification step,
composed of two nested loops, the inner one returning the parameters of a fixed
structure, and the external one searching for the best configuration.
The chapter focused on the core of the learning process which begins once the
data are in a tabular format. Nevertheless, it is worth reminding that the upstream
steps, sketched in Section 8.2-8.4 play a very important role as well. However, since
those steps are often domain and task-dependent, we considered them beyond the
scope of this book.
In the following chapters we will quit the general perspective and we will delve
into the specificities of the most known learning algorithms.
8.14 Exercises
1. Consider an input/output regression task where n = 1, E [y | x ] = sin(x ) and p (y| x ) ∼
N(sin(x ), 1). Let N = 100 be the size of the training set and consider a quadratic
loss function.
Let the class of hypothesis be h3 (x ) = α0 + P 3
m=1 α m x m .
1. Estimate the parameter by least-squares.
2. Estimate the parameter by gradient-based search and plot the evolution of the
training error as a function of the number of iterations. Show in the same
figure the least-squares error.
208 CHAPTER 8. THE MACHINE LEARNING PROCEDURE
FORMULATION
PHENOMENON
EXPERIMENTAL
DESIGN
RAW DATA
PREPROCESSING
MODEL
MODEL
GENERATION
PARAMETRIC
IDENTIFICATION
VALIDATION
MODEL
SELECTION
DATA
MODEL
PROBLEM
TEST ERROR
ASSESSMENT
TRAINING SET
VALIDATION SET
TEST SET
SELECTED
MODEL
ASSESSMENT
TR+VAL
TEST
DATASET
Figure 8.6: From the phenomenon to the predictive model: overview of the steps
constituting the modelling process.
8.14. EXERCISES 209
3. Plot the evolution of the gradient-based parameter estimations as a function of
the number of iterations. Show in the same figure the least-squares parameters.
4. Discuss the impact of the gradient-based learning rate on the training error
minimisation.
5. Estimate the parameter by Levenberg-Marquardt search and plot the evolution
of the training error as a function of the number of iterations. Show in the
same figure the least-squares error.
6. Plot the evolution of the Levenberg-Marquardt parameter estimations as a
function of the number of iterations. Show in the same figure the least-squares
parameters.
2. Consider an input/output regression task where n = 1, E [y|x ] = 3x +2 and p (y| x ) ∼
N(3x + 2, 1). Let N = 100 be the size of the training set and consider a quadratic
loss function. Consider an iterative gradient-descent procedure to minimize the
empirical error.
1. Show in a contour plot the evolution ˆ
β(τ) of the estimated parameter vector
for at least 3 different learning rates.
2. Compute the least-squares solution and show the convergence of the iterated
procedure to the least-squares solution.
3. Let us consider the dependency where the conditional distribution of yis
y= sin(2πx1x2 x3 ) + w
where w∼N (0, σ2 ), x∈R3 has a 3D normal distribution with an identity covari-
ance matrix, N = 100 and σ= 0.25.
Consider the following families of learners:
1. constant model returning always zero
2. constant model h ( x ) = β0
3. linear model h (x ) = xTβ
4. K nearest neighbour for K = 1, 3, 5, 7 where the distance is Euclidean
Implement for each learner above a function:
learner<-function(Xtr,Ytr,Xts){
####
## Xtr [N,n] input training set
## Ytr [N,1] output training set
## Xts [Nts,n] input test set
return(Yhat)
}
which returns a vector [N ts ,1] of predictions for the given input test set.
By using Monte Carlo simulation (S= 100 runs) and by using a fixed-input test set
of size Nts = 1000
•compute the average squared bias of all the learners,
•compute the average variance of all the learners,
•check the relation between squared bias, variance, noise variance and MSE
•define what is the best learner in terms of MSE,
•discuss the results.
4. The student should prove the following equality concerning the quantities defined
in Section 8.11:
FPR = p
1−p
1− PPV
PPV (1 −FNR)
where p = Prob {y = +}.
Hint: use the Bayes theorem.
210 CHAPTER 8. THE MACHINE LEARNING PROCEDURE
Chapter 9
Linear approaches
The previous chapters distinguished between two types of supervised learning tasks
according to the type of output:
Regression when we predict quantitative outputs, e.g. real or integer numbers.
Predicting the weight of an animal on the basis of its age and height is an
example of regression problem.
Classification (or pattern recognition) where we predict qualitative or cate-
gorical outputs which assume values in a finite set of classes (e.g. black, white
and red) where there is no explicit ordering. Qualitative variables are also
referred to as factors. Predicting the class of an email on the basis of English
words frequency is an example of classification task.
This chapter will consider learning approaches to classification and regression
where the hypothesis functions are linear combinations of the input variables.
9.1 Linear regression
Linear regression is a very old technique in statistics and traces back to the work
of Gauss.
9.1.1 The univariate linear model
The simplest regression model is the univariate linear regression model where the
input is supposed to be a scalar variable and the stochastic dependency between
input and output is described by
y=β0 +β1 x + w(9.1.1)
where x∈R is the regressor (or independent) variable, yis the measured response
(or dependent) variable, β0 is the intercept, β1 is the slope and wis called noise
or model error. We will assume that E [w ] = 0 and that its variance σ 2
wis inde-
pendent of the xvalue. The assumption of constant variance is often referred to as
homoscedasticity. From (9.1.1) we obtain
Prob {y =y| x} = Prob {w =y− β0 −β1 x} , E [y |x ] = f (x ) = β0 +β1 x
The function f (x ) = E [y |x ], also known as regression function, is a linear func-
tion in the parameters β0 and β1 (Figure 9.1). In the following we will intend as
linear model each input/output relationship which is linear in the parameters but
211
212 CHAPTER 9. LINEAR APPROACHES
Figure 9.1: Conditional distribution and regression function for stochastic linear
dependence
not necessarily in the dependent variables. This means that: i) any value of the
response variable yis described by a linear combination of a series of parameters
(regression slopes, intercept) and ii) no parameter appears as an exponent or is
multiplied or divided by another parameter. According to this definition of linear
model, then
•y= β0 +β1 xis a linear model
•y= β0 +β1 x2 is again a linear model. Simply by making the transformation
X= x2 , the dependency can be put in in the linear form (9.1.1).
•y= β0 xβ 1 can be studied as a linear model between Y= log( y) and X=
log(x ) thanks to the equality
log(y ) = β0 +β1 log(x )⇔Y =β0 +β1 X
•the relationship y =β0 +β1 β x
2is not linear since there is no way to linearize
it.
9.1.2 Least-squares estimation
Suppose that Npairs of observations (xi , yi ), i = 1, . . . , N are available. Let us
assume that data are generated by the following stochastic dependency
yi = β0 +β1 xi + wi , i = 1 , . . . , N (9.1.2)
where
1. the wi ∈R are i.i.d realisations of the r.v. whaving mean zero and constant
variance σ 2
w(homoscedasticity),
2. the xi ∈R are non random and observed with negligible error.
The unknown parameters (also known as regression coefficients)β0 and β1 can be
estimated by the least-squares method. The method of least squares is designed to
provide
1. the estimations ˆ
β0 and ˆ
β1 of β0 and β1 , respectively
9.1. LINEAR REGRESSION 213
2. the fitted values of the response y
ˆ yi =ˆ
β0 +ˆ
β1 xi , i = 1 , . . . , N
so that the residual sum of squares (which is Ntimes the empirical risk )
SSEemp =N· \
MISEemp =
N
X
i=1
(yi − ˆ yi )2=
N
X
i=1
(yi − ˆ
β0 −ˆ
β1 xi )2
is minimised. In other terms
{ˆ
β0 ,ˆ
β1 }= arg min
{b0,b1 }
N
X
i=1
(yi − b0 − b1 xi )2
It can be shown that the least-squares solution is
ˆ
β1 = Sxy
Sxx
ˆ
β0 = ¯ y−ˆ
β1 ¯ x
if Sxx 6 = 0, where
¯ x=PN
i=1 x i
N,¯ y=PN
i=1 y i
N
and
Sxy =
N
X
i=1
(xi − ¯ x) yi
Sxx =
N
X
i=1
(xi − ¯ x)2 =
N
X
i=1
(x 2
i−2x i ¯ x+ ¯ x2 ) =
N
X
i=1
(x 2
i−x i ¯ x− xi ¯ x+ ¯ x2 ) =
=
N
X
i=1
[(xi − ¯ x)xi ] +
N
X
i=1
[¯ x(¯ x− xi )] =
N
X
i=1
(xi − ¯ x)xi
It is worth noting that if ¯ x= 0 and ¯ y= 0 then ˆ
β0 = 0 and
Sxy = h X, Y i , Sxx = h X, X i (9.1.3)
where X and Y are the [N, 1] vectors of x and y observations, respectively, and the
inner product h·, ·i of two vectors is defined in Appendix B.2.
Also it is possible to write down the relation between the least squares estimation
ˆ
β1 and the sample correlation coefficient (D.0.3):
ˆ ρ2 =ˆ
β1 Sxy
Syy
(9.1.4)
R script
The script lin uni.R computes and plots the least-squares solution for N = 100
observations generated according to the dependency (9.1.2) where β0 = 2 and β1 =
−2.
•
214 CHAPTER 9. LINEAR APPROACHES
If the dependency underlying the data is linear then the estimators are unbiased.
We show this property for ˆ
β1 :
ED N [ˆ
β1 ] = ED N S xy
Sxx =
N
X
i=1
(xi − ¯ x) E[yi ]
Sxx
=1
Sxx
N
X
i=1
(xi − ¯ x)( β0 +β1 xi ) =
=1
Sxx " N
X
i=1
[(xi − ¯ x) β0 ] +
N
X
i=1
[(xi − ¯ x) β1 xi ]# = β 1 Sxx
Sxx
=β1
Note that the analytical derivation relies on the relation P N
i=1(x i −¯ x) = 0 and the
fact that xis not a random variable. Also it can be shown [139] that
Var h ˆ
β1 i =σ 2
w
Sxx
(9.1.5)
E[ˆ
β0 ] = β0 (9.1.6)
Var h ˆ
β0 i =σ2
w1
N+¯ x2
Sxx (9.1.7)
Another important result in linear regression is that the quantity
ˆ σ2
w=P N
i=1(y i −ˆ yi )2
N−2(9.1.8)
is an unbiased estimator of σ 2
wunder the (strong) assumption that observations
have been generated according to (9.1.1). The denominator is often referred to as
the residual degrees of freedom, also denoted by df. The degree of freedom can be
be seen as the number Nof observations reduced by the numbers of parameters
estimated (slope and intercept). The estimate of the variance σ 2
wcan be used in
Equations (9.1.7) and (9.1.5) to derive an estimation of the variance of the intercept
and slope, respectively.
9.1.3 Maximum likelihood estimation
The properties of least-squares estimators rely on the only assumption that the
wi = yi − β0 −β1 xi (9.1.9)
are i.i.d. realisations with mean zero and constant variance σ 2
w. Therefore, no
assumption is made concerning the probability distribution of w(e.g. Gaussian or
uniform). On the contrary, if we want to use the maximum likelihood approach
(Section 5.8), we have to define the distribution of w. Suppose that w ∼ N (0, σ2
w).
By using (9.1.9), the likelihood function can be written as
LN ( β0 , β1 ) =
N
Y
i=1
pw ( wi ) = 1
(2π)N/2 σ N
w
exp ( − P N
i=1(y i −β 0 −β 1 x i ) 2
2σ2
w)(9.1.10)
It can be shown that the estimates of β0 and β1 obtained by maximising LN (·)
under the normal assumption are identical to the ones obtained by least squares
estimation.
9.1.4 Partitioning the variability
An interesting way of assessing the quality of a linear model is to evaluate which
part of the output variability the model is able to explain. We can use the following
relation
9.1. LINEAR REGRESSION 215
N
X
i=1
(yi − ¯ y)2=
N
X
i=1
(ˆ yi − ¯ y)2 +
N
X
i=1
(yi − ˆ yi )2
i.e.
SSTot =SSMod +S SRes
where SSTot (which is also Ntimes the sample variance of y) represents the total
variability of the response, SSMod is the variability explained by the model and
SSRes is the variability left unexplained. This partition helps to determine whether
the variation explained by the regression model is real or is no more than chance
variation. It will be used in the following section to perform hypothesis test on the
quantities estimated by the regression model.
9.1.5 Test of hypotheses on the regression model
Suppose that we want to answer the question whether the regression variable xtruly
influences the distribution Fy (· ) of the response yor, in other words, that they are
linearly dependent. We can formulate the problem as an hypothesis testing problem
on the slope β1 where
H: β1 = 0 ,¯
H: β1 6= 0
If H is true this means that the regressor variable does not influence the response
(at least not through a linear relationship). Rejection of Hin favor of ¯
Hleads to
the conclusion that xsignificantly influences the response in a linear fashion. It
can be shown that, under the assumption that w is normally distributed, if the null
hypothesis H (null correlation) is true then
SSMod
SSRes /( N− 2) ∼ F 1,N −2 .
Large values of the Fstatistic (Section C.2.4) provide evidence in favor of ¯
H(i.e. a
linear trend exists). The test is a two-sided test. In order to perform a single-sided
test, typically T -statistics are used.
9.1.5.1 The t-test
We want to test whether the value of the slope is equal to a predefined value ¯
β:
H: β1 =¯
β, ¯
H: β1 6=¯
β
Under the assumption of normal distribution of w , the following relation holds
ˆ
β1 ∼ N (β1 ,σ 2
w
Sxx
) (9.1.11)
It follows that
(ˆ
β1 −β1 )
ˆ σp Sxx ∼ TN−2
where ˆ σ2 is the estimation of the variance σ2
w. This is a typical t-test applied to
the regression case. Note that this statistic can be used also to test a one-sided
hypothesis, e.g.
H: β1 =¯
β, ¯
H: β1 >¯
β
216 CHAPTER 9. LINEAR APPROACHES
9.1.6 Interval of confidence
Under the assumption of normal distribution, according to (9.1.11)
Prob ( − tα/2,N−2 < ( ˆ
β1 −β1 )
ˆ σp Sxx < tα/2,N −2 ) = 1 − α
where t α/2,N−2 is the upper α/2 critical point of the T -distribution with N− 2
degrees of freedom. Equivalently we can say that with probability 1 −α , the real
parameter β1 is covered by the interval described by
ˆ
β1 ± tα/2,N−2 s ˆ σ2
Sxx
(9.1.12)
Note that the interval (9.1.12) may be used to test the hypothesis of input irrele-
vance. If the value 0 is outside the interval above, we can reject the input irrelevance
hypothesis with 100(1 −α )% confidence.
Similarly from (9.1.7) we obtain that the 100(1 −α )% confidence interval of β 0
is
ˆ
β0 ± tα/2,N−2 ˆ σs 1
N+¯ x2
Sxx
9.1.7 Variance of the response
Let
ˆy =ˆ
β0 +ˆ
β1 x
be the estimator of the regression function value in x. If the linear dependence (9.1.1)
holds, we have for an arbitrary x =x 0
E[ˆy|x0 ] = E [ ˆ
β0 ] + E [ ˆ
β1 ]x0 =β0 +β1 x0 =E [y|x0 ]
This means that the prediction ˆy is an unbiased estimator of the value of the
regression function in x0 . Under the assumption of normal distribution of w , the
variance of ˆy in x 0
Var [ˆy|x0 ] = σ 2
w1
N+( x 0 −¯ x)2
Sxx
where ¯ x=PN
i=1 x i
N. This quantity measures how the prediction ˆy would vary if
repeated data collections from (9.1.1) and least-squares estimations were conducted.
R script
Let us consider a data set DN = {xi , yi }i=1,...,N where
yi = β0 +β1 xi + w i
where β0 and β1 are known and w ∼ N (0, σ2
w) with σ 2
wknown. The R script bv.R
may be used to:
•Study experimentally the bias and variance of the estimators ˆ
β0 ,ˆ
β1 and
ˆ
σwhen data are generated according to the linear dependency (9.1.2) with
β0 =− 1, β1 = 1 and σw = 4.
•Compare the experimental values with the theoretical results.
•Study experimentally the bias and the variance of the response prediction.
•Compare the experimental results with the theoretical ones.
•
9.1. LINEAR REGRESSION 217
R script
Consider the medical dataset available in the R script medical.R. This script may be
used to: i) estimate the intercept and slope of the linear model fitting the dataset,
ii) plot the fitted linear model, iii) estimate the variance of the estimator of the
slope, iv) test the hypothesis β1 = 0, v) compute the confidence interval of β1 and
compare your results with the output of the R command lm().
•
9.1.8 Coefficient of determination
The coefficient of determination, also known as R2 ,
R2 = SS Mod
SSTot
=PN
i=1(ˆ yi − ¯ y)2
PN
i=1(y i −¯ y)2 = 1 − SSRes
SSTot
is often used as a measure of the fit of the regression line.
This quantity, which satisfies the inequality 0 ≤R2 ≤ 1, represents the propor-
tion of variation in the response data that is explained by the model. The coefficient
of determination is easy to interpret and can be understood by most experimenters
regardless of their training in statistics. However, it is a dangerous criterion for com-
parison of candidate models because any additional model terms (e.g. a quadratic
term) will decrease SSRes and thus increase R2 . In other terms R2 can be made
artificially high by a practice of overfitting (Section 7.7) since it is not merely the
quality of fit which influences R2 .
9.1.9 Multiple linear dependence
Consider a linear relation between an independent vector x ∈ X ⊂ Rn and a
dependent random variable y ∈ Y ⊂ R
y=β0 +β1 x·1 +β2 x·2 +· ·· +βn x·n + w(9.1.13)
where w represents a random variable with mean zero and constant variance σ 2
w.
Note that it is possible to establish a link between the partial regression coefficients
βi and partial correlation terms (Section 3.8.3) showing that βi is related to the
conditional information of xi about y once fixed all the other terms (ceteris paribus
effect) [139].
In matrix notation1the equation 9.1.13 can be written as:
y=xT β + w(9.1.14)
where x stands for the [p× 1] vector x = [1, x·1 , x·2 , . . . , x·n ]T and p =n + 1 is the
total number of model parameters.
9.1.10 The multiple linear regression model
Consider N observations DN = {hxi , yi i :i = 1, . . . , N} generated according to the
stochastic dependence (9.1.14) where xi = [1, xi1 , . . . , xin ]T . We suppose that the
following multiple linear relation holds
Y= Xβ +W
1We use the notation x ·jto denote the jth variable of the non random vector x , while xi =
[1, xi1 , xi2 ,...,x in] T denotes the ith observation of the vector x. This extension of notation is
necessary when the input is not considered a random vector. In the generic case xj will be used
to denote the jth variable
218 CHAPTER 9. LINEAR APPROACHES
where Y is the [N× 1] response vector, Xis the [N× p ] data matrix , whose j th column
of X contains readings on the j th regressor, β is the [p× 1] vector of parameters
Y=
y1
y2
.
.
.
yN
X=
1x11 x 12 . . . x1n
1x21 x 22 . . . x2n
.
.
..
.
..
.
..
.
.
1xN1 xN2 . . . xN n
=
xT
1
xT
2
.
.
.
xT
N
β=
β0
β1
.
.
.
βn
W=
w1
w2
.
.
.
wN
Here wi are assumed uncorrelated, with mean zero and constant variance σ 2
w(ho-
mogeneous variance). Then Var [w1 ,...,wN ] = σ 2
wI N .
9.1.11 The least-squares solution
We seek the least-squares estimator ˆ
βsuch that
ˆ
β= arg min
b
N
X
i=1
(yi − x T
ib) 2 = arg min
b(Y− Xb ) T (Y− Xb )(9.1.15)
where
SSEemp =N· \
MISEemp = (Y− Xb )T (Y− Xb ) =eT e (9.1.16)
is the residual sum of squares (which is Ntimes the empirical risk (7.2.8) with
quadratic loss) and
e= Y− Xb
the [N× 1] vector of residuals. The quantity SSEemp is quadratic function in the p
parameters. In order to minimise
(Y− X ˆ
β)T ( Y− Xˆ
β) = ˆ
βT XT Xˆ
β−ˆ
βT XT Y− YT Xˆ
β+ YT Y
the vector ˆ
βmust satisfy
∂
∂ˆ
β[( Y−Xˆ
β)T ( Y− Xˆ
β)] = − 2 XT ( Y− Xˆ
β) = 0 (9.1.17)
Assuming X is of full column rank, the second derivative
∂2
∂ˆ
β∂ ˆ
βT [( Y−Xˆ
β)T ( Y− Xˆ
β)] = 2 XT X
is definite positive and the SSEemp attains its minimum in the solution of the least-
squares normal equations
(XTX ) ˆ
β= XT Y
As a result ˆ
β= ( XT X)−1 XT Y= X† Y(9.1.18)
where the XT X matrix is a symmetric [p× p ] matrix (also known as Gram ma-
trix) and X† = (XT X ) −1 XT is called the pseudo-inverse of X since X† X = IN .
Note that the computation of ˆ
βrepresents the parametric identification step of the
supervised learning procedure (Section 7.9) when the class of hypothesis is linear.
9.1. LINEAR REGRESSION 219
9.1.12 Least-squares and non full-rank configurations
A full-rank Xis required to ensure that the matrix XT X is invertible in (9.1.18).
However, for numerical reasons it is recommended that XT X is not only invertible
but also well-conditioned, or equivalently non ill-conditioned [2]. An ill-conditioned
matrix is an almost singular matrix: its inverse may contain very large entries
and sometimes numeric overflows. This means that small changes in the data may
cause large and unstable changes in the solution ˆ
β. Such sensibility of the solution
to the dataset should evoke in the attentive reader the notion of estimator variance
(Section 5.5). In fact, in the following sections we will show that the variance of
least-squares estimators is related to the inverse of XT X (e.g. Equation (9.1.19)).
But what to do in practice if Xis not full-rank (or rank-deficient) or ill-
conditioned? A first numerical fix consists in computing the generalised QR de-
composition (Appendix B.4)
X= QR
where Q is an orthogonal [N, p0 ] matrix and Ris a [p0 , p ] upper-triangular matrix
of full row rank with p0 < p . Since RRT is invertible, the pseudo-inverse in (9.1.18)
can be written as X† =RT (RRT ) −1 QT (details in Section 2.8.1 of [2]). A second
solution consists in regularising the optimisation, i.e. constraining the optimisation
problem (9.1.15) by adding a term which penalises solutions ˆ
βwith a too large
norm. This leads to the ridge regression formulation which will be discussed in
Section 12.5.1.1. In more general terms, since non invertible or ill-conditioned con-
figurations are often due to highly correlated (multicollinear) or redundant inputs,
the use of feature selection strategies (Chapter 12) before the parametric identifi-
cation step may be beneficial.
9.1.13 Properties of least-squares estimators
Under the condition that the linear stochastic dependence (9.1.14) holds, it can be
shown [139] that:
•If E [w ] = 0 then the random vector ˆ
βis an unbiased estimator of β.
•The residual mean square estimator
ˆ
σ2
w=(Y− X ˆ
β)T (Y− X ˆ
β)
N− p
is an unbiased estimator of the error variance σ 2
w.
•If the wi are uncorrelated and have common variance, the variance-covariance
matrix of ˆ
βis given by
Var[ ˆ
β] = σ 2
w(X T X) −1(9.1.19)
It can also be shown (Gauss-Markov theorem) that the least-squares estima-
tion ˆ
βis the "best linear unbiased estimator" (BLUE) i.e. it has the lowest
variance among all linear unbiased estimators.
From the results above it is possible to derive the confidence intervals of model
parameters, similarly to the univariate case discussed in Section 9.1.6.
R script
A list of the most important least-squares summary statistics is returned by the
summary of the R command lm. See for instance the script ls.R.
220 CHAPTER 9. LINEAR APPROACHES
summary(lm(Y~X))
Call:
lm(formula = Y ~ X)
Residuals:
Min 1Q Median 3Q Max
-0.40141 -0.14760 -0.02202 0.03001 0.43490
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.09781 0.11748 9.345 6.26e-09
X 0.02196 0.01045 2.101 0.0479
(Intercept) ***
X *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2167 on 21 degrees of freedom
Multiple R-Squared: 0.1737, Adjusted R-squared: 0.1343
F-statistic: 4.414 on 1 and 21 DF, p-value: 0.0479
•
9.1.14 Variance of the prediction
Since the estimator ˆ
βis unbiased, this is also the case for the the prediction ˆy =x T
0ˆ
β
for a generic input value x =x0 . Its variance is
Var [ˆy| x0 ] = σ 2
wx T
0(X T X) − 1 x 0 (9.1.20)
Assuming that w is normally distributed, the 100(1 −α )% confidence bound for the
regression value in x0 is given by
ˆ y( x0 )± tα/2,N − p ˆ σw q xT
0(X T X) − 1 x 0
where t α/2,N− p is the upper α/ 2 percent point of the t-distribution with N− p
degrees of freedom and the quantity ˆ σw p xT
0(X T X) − 1 x 0 , obtained from (9.1.20),
is the standard error of prediction for multiple regression.
R script
The R script bv mult.R validates by Monte Carlo simulation the properties of least-
squares estimation mentioned in Section 9.1.11 and 9.1.14.
In order to assess the generality of the results, we invite the reader to run the
script for different input sizes n, different number of observations Nand different
values of the parameter β.
•
9.1.15 The HAT matrix
The Hat matrix is defined as
H= X( XTX)−1 XT (9.1.21)
9.1. LINEAR REGRESSION 221
It is a symmetric, idempotent [N× N ] matrix that transforms the output values Y
of the training set in the regression predictions ˆ
Y:
ˆ
Y= Xˆ
β= X(XT X)−1 XT Y= HY
Using the above relation, the vector of residuals can be written as:
e= Y− Xˆ
β= Y− X(XT X)−1 XT Y= [ I− H]Y
and the residual sum of squares as
eT e= YT [ I− H ]2 Y= YT [ I− H ] Y= YT P Y (9.1.22)
where P is a [N× N ] matrix, called the projection matrix.
if X has full rank, by commutativity of the trace operator it follows that
tr(H ) = tr(X (XT X )−1 XT ) = tr(XTX (XT X ) −1 ) = tr(Ip ) = p(9.1.23)
If we perform a QR decomposition of X(Appendix B.4) then we obtain
H= X(XT X)−1 XT = QR( RT QTQR)−1 RT QT =QRR−1 ( RT )−1 RT QT =QQT
(9.1.24)
Note that, in this case the input matrix Xis replaced by the matrix Q which
contains an orthogonalised transformation of the original inputs.
9.1.16 Generalisation error of the linear model
Given a training dataset DN = {hxi , yi i :i = 1, . . . , N } and a query point x , it is
possible to return a linear prediction
ˆ y= h( x, α) = xT ˆ
β
where ˆ
βis returned by least-squares estimation (9.1.18). From an estimation per-
spective, the ˆ
βis a realisation of the random estimator ˆ
βfor the specific dataset
DN . But which precision can we expect from ˆ
y=xT ˆ
βif we average the pre-
diction error over all finite-size datasets DN that can be generated by the linear
dependency (9.1.13)?
A quantitative measure of the quality of the linear predictor on the whole domain
Xis the Mean Integrated Squared Error (MISE) defined in (7.5.24). But how can
we estimate this quantity in the linear case? Also, is the empirical risk \
MISEemp
in (9.1.16) a reliable estimate of MISE?
9.1.16.1 The expected empirical error
This section derives analytically that the empirical risk \
MISEemp (defined in (9.1.16))
is a biased estimator of the MISE generalisation error.
Let us first compute the expectation of the residual sum of squares which is
equal to N times the empirical risk. According to (9.1.22) and Theorem 4.2 the
expectation can be written as
ED N [SSEemp ] = ED N [eTe ] = ED N [YT PY] = σ2
wtr(P ) + E [Y T ]P E[Y]
Since tr(ABC ) = tr(CAB)
tr(P ) = tr(I− H ) = N− tr(X(XT X ) −1 XT )
=N− tr(XT X (XT X )−1 ) = N− tr(Ip ) = N− p
222 CHAPTER 9. LINEAR APPROACHES
and we have
ED N [eTe ]=( N− p) σ 2
w+ (Xβ) T P(Xβ ) = (9.1.25)
= (N− p )σ 2
w+β T X T (I− X(X T X) −1X T)Xβ = (9.1.26)
= (N− p )σ 2
w(9.1.27)
It follows that
ED N [\
MISEemp ] = ED N SSE emp
N = E D N e T e
N = (1 − p/N )σ2
w(9.1.28)
is the expectation of the error made by a linear model trained on DN to predict the
value of the output in the same dataset DN .
In order to obtain the MISE term we derive analytically the expected sum of
squared errors of a linear model trained on DN and used to predict for the same
training inputs X a set of outputs Yts distributed according to the same linear
law (9.1.13) but independent of the training output Y.
ED N ,Y ts [(Y ts − X ˆ
β)T(Y ts −X ˆ
β)] =
=ED N ,Y ts [(Y ts − Xβ + Xβ − X ˆ
β)T (Y ts − Xβ + Xβ − X ˆ
β)] =
=ED N ,Y ts [(W ts + Xβ − X ˆ
β)T ( Wts + Xβ − X ˆ
β)] =
=Nσ2
w+E D N[(Xβ − X ˆ
β)T (Xβ − X ˆ
β)]
Since
Xβ − X ˆ
β= Xβ − X ( XTX )−1 XT Y =
=Xβ − X (XT X )−1 XT (Xβ +W ) = −X (XTX)−1 XT W
we obtain
Nσ2
w+E D N[(Xβ − X ˆ
β)T(Xβ − X ˆ
β)]
=Nσ2
w+E D N[(W T X(X T X) −1X T )(X(X T X) −1X TW)]
=Nσ2
w+E D N[W T X(X T X) −1X TW]
=Nσ2
w+σ 2
wtr(X(X T X) −1X T)
=Nσ2
w+σ 2
wtr(X T X(X T X) −1)
=Nσ2
w+σ 2
wtr(I p ) = σ 2
w(N+ p)
By dividing the above quantity by N
MISE = (1 + p/N )σ 2
w(9.1.29)
From (9.1.28) and (9.1.29) it follows that that the empirical error \
MISEemp is a
biased estimate of MISE:
ED N [ \
MISEemp ] = ED N e T e
N = σ 2
w(1 −p/N) 6= MISE = σ 2
w(1 + p/N) (9.1.30)
As a consequence, if we replace \
MISEemp with
eTe
N+ 2 σ 2
wp
N(9.1.31)
we correct the bias and we obtain an unbiased estimator of the MISE generalisation
error. Nevertheless, this estimator requires an estimate of the noise variance.
9.1. LINEAR REGRESSION 223
R script
The R script ee.R performs a Monte Carlo validation of (9.1.30).
•
Example
Let { y1 , . . . , yN } ← Fy be the training set. Consider the simplest linear predictor
of the output variable: the average ˆ
µy (i.e. p = 1). This means that
ˆ yi = P N
i=1 y i
N= ˆ µy , i = 1 , . . . , N
We want to show that, even for this simple estimator, the empirical error is a biased
estimator of the quality of this predictor. Let µbe the mean of the r.v. y . Let us
write yas
y=µ + w
where E [w ] = 0 and Var [w] = σ2 . Let {z1 , . . . , zN } ← Fy a test set coming from
the same distribution underlying DN . Let us compute the expected empirical error
and the mean integrated square error.
Since E [ ˆ
µy ] = µ and Var ˆ
µy =σ2 /N
N·MISE = ED N ,Y ts [
N
X
i=1
(zi − ˆ
µy )2 ] = ED N ,w [
N
X
i=1
(µ +wi − ˆ
µy )2]
=Nσ2 +
N
X
i=1
ED N [( µ− ˆ
µy )2 ]
=Nσ2 +N (σ2 /N ) = (N + 1)σ 2
Instead, since ˆ σy = (P N
i=1(y i −µ y ) 2 )/(N− 1) and E [ ˆ
σ2
y] = σ 2
ED N [
N
X
i=1
(yi − ˆ
µy )2 ] = ED N [(N− 1) ˆ
σ2
y]=( N−1) σ2 6= N·MISE
It follows that, even for a simple estimator like the estimator of the mean, the
empirical error is a biased estimate of the accuracy (see R file ee mean.R).
•
9.1.16.2 The PSE and the FPE
In the previous section we derived that \
MISEemp is a biased estimate of MISE and
that the addition of the correction term 2σ 2
wp/N makes it unbiased.
Suppose we have an estimate ˆ σ2
wof σ 2
w. By replacing it into the expres-
sion (9.1.31) we obtain the so-called Predicted Square Error (PSE) criterion
PSE = \
MISEemp + 2ˆ σ2
wp/N (9.1.32)
In particular, if we take as estimate of σ 2
wthe quantity
ˆ σ2
w=1
N− pSSE emp = N
N− p
\
MISEemp
we obtain the so-called Final Prediction Error (FPE)
FPE = 1 + p/N
1−p/N
\
MISEemp (9.1.33)
224 CHAPTER 9. LINEAR APPROACHES
Figure 9.2: Estimation of the generalisation error of hm ,m = 2,..., 7 returned by
the empirical error.
The PSE and the FPE criteria allow us to replace the empirical risk with a
more accurate estimate of the generalisation error of a linear model. Although their
expression is easy to compute, it is worth reminding that their derivation relies on
the assumption that the stochastic input/output dependence has the linear form
9.1.14.
R script
Let us consider an input/output dependence
y=f (x ) + w= 1 + x +x2 +x3 + w(9.1.34)
where w ∼ N (0, 1) and x ∼ U ( − 1, 1). Suppose that a dataset DN of N = 100
input/output observations is drawn from the joint distribution of hx, y i . The R
script fpe.R assesses the prediction accuracy of 7 different models having the form
hm ( x ) = ˆ
β0 +
m
X
j=1
ˆ
βj xj (9.1.35)
by using the empirical risk and the FPE measure. These results are compared with
the generalisation error measured by
MISEm = 1
N
N
X
i=1
(hm (xi )−f (xi ))2(9.1.36)
The empirical risk and the FPE values for m = 2,..., 7 are plotted in Figure 9.2
and 9.3, respectively. The values MISEmare plotted in Figure 9.4. It is evident, as
confirmed by Figure 9.4, that the best model should be h3 (x ) since it has the same
analytical structure as f (x ). However, the empirical risk is not able to detect this
and returns as the best model the one with the highest complexity (m = 7). This
is not the case for FPE which, by properly correcting the \
MISEemp value, is able to
select the optimal model.
•
9.1. LINEAR REGRESSION 225
Figure 9.3: Estimation of the generalisation error of hm ,m = 2,..., 7 returned by
the FPE.
Figure 9.4: Computation of the generalisation error of hm ,m = 2,..., 7 by (9.1.36).
226 CHAPTER 9. LINEAR APPROACHES
Figure 9.5: Leave-one-out for linear models. The leave-one-out error can be com-
puted in two equivalent ways: the slowest way (on the right) which repeats Ntimes
the training and the test procedure; the fastest way (on the left) which performs
only once the parametric identification and the computation of the PRESS statistic.
9.1.17 The PRESS statistic
Section 7.10.1 introduced cross-validation to provide a reliable estimate of the gener-
alisation error GN . The disadvantage of this approach is that it requires the training
process to be repeated l times, implying a large computational effort. However, in
the linear case the PRESS (Prediction Sum of Squares) statistic [7] returns the leave-
one-out cross-validation error at a reduced computational cost (Fig. 9.5). PRESS
relies on a simple formula which returns the leave-one-out (l-o-o) as a by-product
of the parametric identification of ˆ
βin Eq. (10.1.41). Consider a training set DN
in which for Ntimes
1. we set aside the ith observation (i = 1, . . . , N )hxi , yi i from the training set,
2. we use the remaining N− 1 observations to estimate the linear regression
coefficients ˆ
β−i ,
3. we use ˆ
β−i to predict the target in xi .
The leave-one-out residual is
eloo
i=y i −ˆ y−i
i=y i −x T
iˆ
β−i , i = 1 , . . . , N (9.1.37)
The PRESS statistic is an efficient way to compute the l-o-o residuals on the basis of
the simple regression performed on the whole training set. This allows a fast cross-
validation without repeating Ntimes the leave-one-out procedure. The PRESS
procedure can be described as follows:
1. use the whole training set to estimate the linear regression coefficients ˆ
β. This
procedure is performed only once and returns as by product the Hat matrix
(see Section 9.1.15)
H= X(XT X)−1 XT (9.1.38)
2. compute the residual vector e, whose ith term is ei = yi − x T
iˆ
β,
3. use the PRESS statistic to compute e loo
ias
eloo
i=e i
1−Hii
(9.1.39)
9.1. LINEAR REGRESSION 227
where H ii is the i th diagonal term of the matrix H.
Note that (9.1.39) is not an approximation of (9.1.37) but simply a faster way of
computing the leave-one-out residual e loo
i.
Let us now derive the formula of the PRESS statistic. Matrix manipulations
show that
XT X− xixT
i=X T
−i X −i (9.1.40)
where X T
−iX −i is the X T Xmatrix obtained by putting the i th row aside. Using
the relation (B.9.13) we have
(XT
−i X −i) −1 = (X T X−x i x T
i) −1 = (X T X) −1+(X T X) −1x i x T
i(X T X) −1
1−Hii
(9.1.41)
and
ˆ
β−i = ( XT
−iX −i) −1X 0
−iy −i= (X T X) −1+(X T X) −1x i x T
i(X T X) −1
1−Hii X T
−iy −i
(9.1.42)
where y−i is the target vector with the ith example set aside.
From (9.1.37) and (9.1.42) we have
eloo
i=y i −x T
iˆ
β−i
=yi − x T
i(X T X) −1+(X T X) −1x i x T
i(X T X) −1
1−Hii X T
−iy −i
=yi − x T
i(X T X) −1X T
−iy −i−H ii x T
i(X T X) −1X T
−iy −i
1−Hii
=(1 −Hii ) yi −xT
i(X T X) −1X T
−iy −i
1−Hii
=(1 −Hii ) yi −xT
i(X T X) −1(X Ty−x i y i )
1−Hii
=
=(1 −Hii ) yi − ˆ yi +Hii yi
1−Hii
=y i − ˆ yi
1−Hii
=e i
1−Hii
(9.1.43)
where X T
−iy −i+x iy i=X T yand x T
i(X T X) −1X Ty= ˆ yi . Thus, the leave-one-out
estimate of the local mean integrated squared error is:
ˆ
Gloo =1
N
N
X
i=1 y i −ˆ yi
1−Hii 2
(9.1.44)
Since from (9.1.23) the sum of the diagonal terms of the H matrix is p , the
average value of Hii is p/N . It follows that the PRESS may be approximated by
ˆ
Gloo ≈ 1
N
N
X
i=1 y i −ˆ yi
1−p/N 2
which leads us to the GCV formula (8.8.34).
9.1.18 Dual linear formulation
Consider a linear regression problem with [N, n] input matrix Xand [N , 1] out-
put vector y. The conventional least-squares solution is the [p, 1] parameter vec-
tor (9.1.18) where p =n +1. This formulation is common in conventional statistical
settings where the number of observations is supposed to be much larger than the
number of variables.
228 CHAPTER 9. LINEAR APPROACHES
However, machine learning may be confronted with high-dimensional settings
where the ratio of observations to features is low: this would imply a very large
value of p and risks of ill-conditioning of the numerical solution. In this case it
is interesting to consider a dual formulation of the least-squares problem based
on (B.9.14). In this formulation
ˆ
β= ( X0 X)−1 X0 y= X0X(X0 X)−2 X0 y
| {z }
α
=X0 α =
N
X
i=1
αi xi
where α is a [N, 1] vector and xi is the [n, 1] vector which represents the ith obser-
vation. It follows that if N << p, the dual formulation has less parameters than the
conventional one with advantages in terms of storage requirements and numerical
conditioning.
9.1.19 The weighted least-squares
The assumption of homogeneous variance of the noise wmade in Eq. (9.1.14) is often
violated in practical situations. Suppose we relax the assumption that Var(w ) =
σ2
wI N with I N the identity matrix and assume instead that there is a positive
definite matrix Vfor which Var(w ) = V . We may wish to consider
V= diag[ σ2
1, σ 2
2, . . . , σ 2
N] (9.1.45)
in which case we are assuming uncorrelated errors with error variances that vary
from observation to observation. As a result it would seem reasonable that the
estimator of β should take this into account by weighting the observations in some
way that allows for the differences in the precision of the results. Then the function
being minimised is no more (9.1.16) but depends on Vand is given by
(y− X ˆ
β)T V−1 ( y− Xˆ
β) (9.1.46)
The estimate of βis then
ˆ
β= ( XT V−1 X)−1 XT V−1 y(9.1.47)
The corresponding estimator is called the generalized least-squares estimator and
has the following properties: i) it is unbiased, that is E [ ˆ
β] = β , ii) under the as-
sumption w ∼ N (0, V ) it is the minimum variance estimator among all the unbiased
estimators.
9.1.20 Recursive least-squares
In many analytics tasks, data records are not statically available but have to be
processed and analysed continuously rather than in batches. Examples are the data
streams generated from sensors (notably IoT), financial, business intelligence or
adaptive control applications. In those cases it is useful not to restart from scratch
the model estimation but simply to update the model on the basis of the newly
collected data. One appealing feature of least-squares estimates is that they can be
updated at a lower cost than their batch counterpart.
Let us rewrite the least-squares estimator (9.1.18) for a training set of Nobser-
vations as: ˆ
β(N) = ( XT
(N ) X (N) ) − 1 X T
(N ) Y (N)
where the subscript (N) is added to denote the number of observations used for
the estimation. Suppose that a new data point hxN+1 , yN+1 i becomes available.
9.1. LINEAR REGRESSION 229
Instead of recomputing the estimate ˆ
β(N+1) by using all the N+ 1 available data
we want to derive ˆ
β(N+1) as an update of ˆ
β(N) . This problem is solved by the
so-called recursive least-squares (RLS) estimation [26].
If a single new example hxN+1 , yN+1 i , with xN+1 a [1, p] vector, is added to the
training set the X matrix acquires a new row and ˆ
β(N +1) can be written as:
ˆ
β(N +1) = X (N)
xN+1 T X (N)
xN+1 ! −1 X (N)
xN+1 T Y (N)
yN+1
By defining the [p, p] matrix
S(N) = ( X T
(N ) X (N) )
we have
S(N +1) = ( X T
(N+1)X (N+1) ) = h X T
(N )x T
N+1i X ( N)
xN+1
= XT
(N )X (N)+x T
N+1x N+1 =S (N )+x T
N+1x N+1
(9.1.48)
Since X (N)
xN+1 T Y (N)
yN+1 = X T
(N ) Y (N)+x T
N+1y N+1
and
S(N) ˆ
β(N) = ( XT
(N ) X (N))h (X T
(N ) X (N) ) − 1 X T
(N ) Y (N)i=X T
(N ) Y (N)
we obtain
S(N+1) ˆ
β(N +1) = X (N)
xN+1 T Y (N)
yN+1 = S (N) ˆ
β(N) + xT
N+1y N+1
= S(N +1) − x T
N+1x N+1 ˆ
β(N) + xT
N+1y N+1
=S(N +1) ˆ
β(N) − xT
N+1x N+1 ˆ
β(N) + xT
N+1y N+1
or equivalently
ˆ
β(N +1) =ˆ
β(N) + S−1
(N +1)x T
N+1(y N+1 −x N+1 ˆ
β(N) ) (9.1.49)
9.1.20.1 1st Recursive formulation
From (9.1.48) and (9.1.49) we obtain the following recursive formulation
S(N +1) =S(N) +xT
N+1x N+1
γ(N +1) = S−1
(N+1)x T
N+1
e= yN+1 −xN+1 ˆ
β(N)
ˆ
β(N +1) =ˆ
β(N) + γ(N+1) e
where the term ˆ
β(N +1) can be expressed as a function of the old estimate ˆ
β(N) and
the new observation hxN+1 , yN+1 i. This formulation requires the inversion of the
[p× p ] matrix S(N+1) . This operation is computationally expensive but, fortunately,
using a matrix inversion theorem, an incremental formula for S−1 can be found.
230 CHAPTER 9. LINEAR APPROACHES
9.1.20.2 2nd Recursive formulation
Once defined
V(N) = S−1
(N ) = (X T
(N ) X (N) ) − 1
we have (S(N+1) ) −1 = (S(N) +x T
N+1x N+1 ) − 1 and
V(N+1) = V (N) − V (N) x T
(N+1)(I+ x N +1 V (N)x T
N+1) − 1 x N+1 V ( N)(9.1.50)
=V(N) − V (N) x T
N+1x N+1 V ( N)
1 + xN+1 V (N) x T
N+1
(9.1.51)
From (9.1.50) and (9.1.49) we obtain a second recursive formulation:
V(N+1) = V (N) −V (N)x T
N+1x N+1 V (N)
1+xN+1 V (N) x T
N+1
γ(N +1) = V(N+1) xT
N+1
e= yN+1 − xN+1 ˆ
β(N)
ˆ
β(N +1) =ˆ
β(N) + γ(N+1) e
(9.1.52)
9.1.20.3 RLS initialisation
Both recursive formulations presented above require the initialisation values ˆ
β(0)
and V(0) . One way to avoid choosing these initial values is to collect the first N
data points, to solve ˆ
β(N) and V(N) directly from
V(N) = ( X T
(N ) X (N) ) − 1
ˆ
β(N) = V(N) X T
(N )Y (N)
and to start iterating from the N + 1th point. Otherwise, in case of a generic
initialisation ˆ
β(0) and V(0) we have the following relations
V(N) = (V(0) + XT
(N )X (N ) ) −1
ˆ
β(N) = V(N) ( XT
(N ) Y (N)+V − 1
(0) ˆ
β(0) )
A common choice is to put
V(0) = aI, a > 0
Since V(0) represents the variance of the estimator to choose a very large ais equiv-
alent to consider the initial estimation of βas very uncertain. By setting aequal to
a large number the RLS algorithm will diverge very rapidly from the initialisation
ˆ
β(0) . Therefore, we can force the RLS variance and parameters to be arbitrarily
close to the ordinary least-squares values, regardless of ˆ
β(0) .
In any case, in absence of further information, the initial value ˆ
β(0) is usually
put equal to a zero vector.
9.1.20.4 RLS with forgetting factor
In some adaptive configurations it can be useful not to give equal importance to all
the historical data but to assign higher weights to the most recent data (and then
to forget the oldest one). This may happen when the phenomenon underlying the
data is non stationary or when we want to approximate a nonlinear dependence by
using a linear model which is local in time. Both these situations are common in
adaptive control problems.
9.2. LINEAR APPROACHES TO CLASSIFICATION 231
Figure 9.6: RLS fitting of a nonlinear function where the arrival order of data is
from left to right
RLS techniques can deal with these situations by a modification of the formu-
lation (9.1.52) obtained by adding a forgetting factor µ < 1.
V(N +1) =1
µV ( N)− V (N) xT
N+1x N+1 V ( N)
1+xN+1 V (N) x T
N+1
γ(N +1) = V(N+1) x T
N+1
e= yN+1 − xN+1 ˆ
β(N)
ˆ
β(N+1) =ˆ
β(N) + γ(N+1) e
Note that: (i) the smaller µ, the higher the forgetting, (ii) for µ= 1 we have the
conventional RLS formulation.
R script
The R script lin rls.R implements the RLS fitting of a nonlinear univariate
function. The simulation shows that the fitting evolves as long as data xi , yi ,
i= 1 , . . . , N are collected. Note that the values xi , i= 1 , . . . , N are increas-
ingly ordered. This means that xis not random and that the oldest collected values
are the ones with the lowest xi .
The final fitting for a forgetting factor µ = 0. 9 is shown in Figure 9.6. Note that
the linear fitting concerns only the rightmost points since the values on the left,
which are also the oldest ones, are forgotten.
•
9.2 Linear approaches to classification
The methods presented so far deal with linear regression tasks. Those methods
may be easily extended to classification once we consider that in a binary 0/1
classification case the conditional expectation coincides with conditional probability:
E[y| x ]=1· Prob {y = 1| x}+ 0 ·Prob {y = 0| x}= Prob {y = 1| x}(9.2.53)
232 CHAPTER 9. LINEAR APPROACHES
In other words, by encoding the two classes in 0/ 1 values and estimating the con-
ditional expectation with regression techniques, we estimate as well the conditional
probability. Such value may be used to return the most probable class associated
to a query point x.
This section will present some additional strategies to learn linear boundaries
between classes. The first strategy relies on modelling the class conditional densities
and derive from them the equation of the boundary region. The other strategies
aim to learn directly the equations of separating hyperplanes.
9.2.1 Linear discriminant analysis
Let x∈Rn denote a real valued random input vector and ya categorical random
output variable that takes values in the set {c1 , . . . , cK } such that
K
X
k=1
Prob {y = ck |x} = 1
A classifier can be represented in terms of a set of Kdiscriminant functions
gk ( x ), k = 1 , . . . , K such that the classifier applies the following decision rule [61]:
assigns a feature vector xto a class ˆ y( x) = ck if
k= arg max
jg j (x ) (9.2.54)
Section 7.3 showed that in the case of a zero-one loss function (Equation (7.3.13)),
the optimal classifier corresponds to a maximum a posteriori discriminant function
gk ( x ) = Prob {y = ck | x }. This means that if we are able to define the K functions
gk (· ), k = 1 , . . . , K and we apply the classification rule (9.2.54) to an input x , we
obtain a classifier which is equivalent to the Bayes one.
The discriminant functions divide the feature space into K decision regions Dk ,
where a decision region Dk is a region of the input space Xwhere the discriminant
classifier returns the class ck for each x∈ Dk . The regions are separated by decision
boundaries, i.e. surfaces in the domain of xwhere ties occur among the largest
discriminant functions.
Example
Consider a binary classification problem where y can take values in {c1 , c2 } and
x∈R2 . Let g1 ( x )=3 x1 +x2 + 2 and g2 ( x)=2 x1 + 2 the two discriminant
functions associated to the class x1 and x2 , respectively. The classifier will return
the class c1 if
3x1 +x2 + 2 > 2x1 + 2 ⇔x1 > −x 2
The decision regions D1 and D2 are depicted in Figure 9.7.
•
We can multiply all the discriminant functions by the same positive constant
or shift them by the same additive constant without influencing the decision [61].
More generally, if we replace every gk (z ) by f (gk (z )), where f (· ) is a monotonically
increasing function, the resulting classification is unchanged.
For example in the case of a zero/one loss function, any of the following choices
gives identical classification result:
gk ( x ) = Prob {y = ck | x}= p(x|y= ck ) P (y = ck )
PK
k=1 p(x|y = c k )P(y = c k )(9.2.55)
gk ( x ) = p( x |y= ck ) P (y = ck ) (9.2.56)
gk ( x ) = ln p ( x|y= ck ) + ln P(y = ck ) (9.2.57)
9.2. LINEAR APPROACHES TO CLASSIFICATION 233
Figure 9.7: Decision boundary and decision regions for the binary discrimination
functions g1 (x )=3 x1 +x2 + 2 and g2 ( x)=2 x1 + 2
and returns a Bayes classifier.
9.2.1.1 Discriminant functions in the Gaussian case
Let us consider a binary classification task where the inverse conditional densities
are multivariate normal (Section 3.7), i.e. p (x =x|y =ck ) ∼ N (µk , Σk ) where
x∈Rn , µk is a [ n, 1] vector and Σkis a [ n, n] covariance matrix. Since
p(x = x |y= ck ) = 1
(√2π )n pdet(Σk ) exp − 1
2(x− µk )T Σ −1
k(x− µ k )
from (9.2.57) we obtain
gk ( x ) = ln p( x |y= ck ) + ln P (y = ck ) (9.2.58)
=− 1
2(x− µk )T Σ −1
k(x− µ k )−n
2ln 2π− 1
2ln det(Σ k ) + ln P (y =ck )
(9.2.59)
If we make no assumptions about Σkthe discriminant function is quadratic. Now
let us consider a simpler case where all the distributions have the same diagonal
covariance matrix Σk =σ2 I where I is the [n, n] identity matrix. It follows that
det(Σk ) = σ2n , Σ−1
k= (1/σ 2 )I
are independent of k and can be ignored by the decision rule (9.2.54). From (9.2.58),
we obtain the simpler discriminant function
gk ( x ) = −k x− µ k k 2
2σ2 + ln P (y=ck )
=− (x− µk )T(x− µk )
2σ2 + ln P(y = ck )
=− 1
2σ2 [xT x−2 µT
kx+µ T
kµ k ] + ln P(y = c k )
However, since the quadratic term xT x is the same for all kthis is equivalent to a
linear discriminant function
gk ( x ) = w T
kx+w k0(9.2.60)
where wk is a [n, 1] vector
wk =1
σ2 µ k (9.2.61)
234 CHAPTER 9. LINEAR APPROACHES
and
wk0 =− 1
2σ2 µT
kµ k + ln P (y =c k ) (9.2.62)
In the two-classes problem, the decision boundary (i.e. the set of points where
g1 ( x ) = g2 ( x )) can be obtained by solving the identity
wT
1x+w 10 =w T
2x+w 20 ⇔(w 1 −w 2 ) T x−(w 20 −w 10 )=0
We obtain a hyperplane having equation
wT ( x− x0 ) = 0 (9.2.63)
where
w= µ 1 − µ2
σ2
and
x0 =1
2(µ1 +µ2 )− σ 2
kµ1 −µ2 k2 ln Prob {y=c1 }
Prob {y = c2 } (µ1 −µ2 )
This can be verified by the fact that wT x0 = w20 −w10 . The equation (9.2.63)
defines a hyperplane through the point x0 and orthogonal to the vector w.
9.2.1.2 Uniform prior case
If the prior probabilities P (y = ck ) are identical for the Kclasses, then the term
ln P (y =ck ) is a constant that can be ignored. In this case, it can be shown that
the optimum decision rule is a minimum distance classifier [61]. This means that
in order to classify an input x, it measures the Euclidean distance kx − µk k2 from x
to each of the Kmean vectors, and assign xto the category of the nearest mean. It
can be shown that for the more generic case Σk= Σ, the discriminant rule is based
on minimising the Mahalanobis distance
ˆ c( x ) = arg min
k(x− µ k ) T Σ −1(x− µ k ) (9.2.64)
R script
The R script discri.R considers a binary classification task (c1 =red, c2 =green)
where x∈R2 and the inverse conditional distributions of the two classes are
N(µ1 , σ2 I ) and N(µ2 , σ2 I ), respectively. Suppose that the two a priori proba-
bilities are identical, that σ = 1, µ1 = [− 1,− 2]T and µ2 = [2, 5]T . The positions of
100 points randomly drawn from N (µ1 , σ2 I ), of 100 points drawn from N ( µ2 , σ2 I )
together with the optimal decision boundary computed by (9.2.63) are plotted in
Figure 9.8.
The R script discri2.R shows instead the limitations of the LDA approach
when the assumption of Gaussian unimodal class-conditional distributions is not
respected. Suppose that the two a priori probabilities are identical, but that the
class conditional distribution of the green class is a mixture of two Gaussians. The
positions of 1000 points randomly drawn from the two class-conditional distributions
together with the LDA decision boundary computed by (9.2.63) are plotted in Figure
9.9.
•
9.2. LINEAR APPROACHES TO CLASSIFICATION 235
Figure 9.8: Binary classification problem: distribution of inputs and linear decision
boundary
Figure 9.9: Binary classification problem where one class distribution is bimodal:
distribution of inputs and linear decision boundary. Since the classification task is
not linearly separable the LDA classifier performs poorly.
236 CHAPTER 9. LINEAR APPROACHES
Figure 9.10: Several hyperplanes separating the two classes (blue and red).
9.2.1.3 LDA parameter identification
In a real setting, we do not have access to the quantities µk , Σ and Prob {y =ck }
to compute the boundary (9.2.63) Before applying the discrimination rule above,
we need to estimate those quantities from the dataset DN :
[
Prob {y = ck } =N k
N(9.2.65)
ˆ µk = P i:yi =ck x i
Nk
(9.2.66)
ˆ
Σ = P K
k=1 P i: yi = ck (x i −ˆ µk )( xi − ˆ µk )T
N− K(9.2.67)
where Nk is the number of observations labeled with the class ck and (9.2.67) is
also known as pooled covariance [99].
9.2.2 Perceptrons
Consider a binary classification task (Figure 9.10) where the two classes are denoted
by +1 and −1. The previous section presented a technique to separate input data
by a linear boundary by making assumptions on the class conditional densities and
their covariances. In a generic setting, however, the problem is ill-posed and there
are infinitely many possible separating hyperplanes (Figure 9.10) characterized by
the equation
β0 + xT β= 0 (9.2.68)
If x∈R2 , this equation represents a line. In a generic case (x∈Rn ) some properties
hold for all hyperplanes
•Since for any two points x1 and x2 lying on the hyperplane we have
(x1 −x2 )T β = 0
9.2. LINEAR APPROACHES TO CLASSIFICATION 237
Figure 9.11: Bi-dimensional space (n = 2): vector β∗ normal to the hyperplane and
distance of a point from an hyperplane.
the vector normal to the hyperplane (Figure 9.11) is given by
β∗ = β
kβk
•The signed distance of a point x to the hyperplane (Figure 9.11) is called the
geometric margin and is given by
β∗T ( x− x0 ) = x T β−βxT
0
kβ k= 1
kβ k(xT β+ β0 )
Aperceptron is a classifier that uses the sign of the linear combination h (x, ˆ
β) =
ˆ
β0 +ˆ
βT xto perform classification [98]. The class returned by a perceptron for a
given input xq is
(1 if ˆ
β0 + xT
qˆ
β=ˆ
β0 +Pn
j=1 x qj ˆ
βj >0
−1 if ˆ
β0 + xT
qˆ
β=ˆ
β0 +Pn
j=1 x qj ˆ
βj <0
In other terms the decision rule is given by
h( x ) = sgn( ˆ
β0 + xT ˆ
β) (9.2.69)
For all well classified points in the training set the following relation holds
γi = yi ( x T
iˆ
β+ˆ
β0 ) >0
where the quantity γi is called the functional margin of the pair hxi , yi i with respect
to the hyperplane (9.2.68). Misclassifications in the training set occur when
(y i = 1 but ˆ
β0 +ˆ
βT xi <0
yi =− 1 but ˆ
β0 +ˆ
βT xi >0⇔ y i (ˆ
β0 +ˆ
βT xi ) <0
The parametric identification step of a perceptron learning procedure aims at
finding the values { ˆ
β, ˆ
β0 }that minimise the quantity
SSEemp ( ˆ
β, ˆ
β0 ) = − X
i∈M
yi ( x T
iˆ
β+ˆ
β0 )
238 CHAPTER 9. LINEAR APPROACHES
where M is the subset of misclassified points in the training set. Note that this
quantity is non-negative and proportional to the distance of the misclassified points
to the hyperplane. Since the gradients are
∂SSEemp ( ˆ
β, ˆ
β0 )
∂ˆ
β=−X
i∈M
yi xi ,∂ SSE emp ( ˆ
β, ˆ
β0 )
∂ˆ
β0
=−X
i∈M
yi
a batch gradient descent minimisation procedure (Section 8.6.2.3) or the online
version (Section 8.6.3) can be adopted. This procedure is guaranteed to converge
provided there exists a hyperplane that correctly classifies the data: this configura-
tion is called linearly separable.
Although the perceptron set the foundations for much of the following research
in machine learning, a number of problems with this algorithm have to be men-
tioned [98]:
•When the data are separable, there are many possible solutions, and which
one is found depends on the initialisation of the gradient method.
•When the data are not separable, the algorithm will not converge.
•Also for a separable problem the convergence of the gradient minimisation
can be very slow.
R script
The script hyperplane.R visualises the evolution of the separating hyperplane dur-
ing the perceptron learning procedure. We invite the reader to run the script for
different number of points and different data distributions (e.g. by changing the
mean and the variance of the 2D gaussians).
•
A possible solution to the separating hyperplane problem has been proposed by the
SVM technique.
9.2.3 Support vector machines
This technique relies on an optimisation approach to compute the separating hy-
perplane.
Let us define as geometric margin of a hyperplane with respect to a training
dataset the minimum of the geometric margin of the training points. Also, the
margin of a training set is the maximum geometric margin over all hyperplanes.
The hyperplane attaining such maximum is known as a maximal margin hyperplane.
The SVM approach [186] computes the maximal margin hyperplane for a train-
ing set. In other words, the SVM optimal separating hyperplane is the one which
separates the two classes by maximising the distance to the closest point from both
classes. This approach provides a unique solution to the separating hyperplane
problem and was shown to lead to good classification performance on real data.
The search for the optimal hyperplane is modelled as the optimisation problem
max
β,β0
C(9.2.70)
subject to 1
kβ k y i ( x T
iβ+β 0 )≥Cfor i= 1 , . . . , N (9.2.71)
where the constraint ensures that all the points are at least a distance Cfrom the
decision boundary defined by β and β0 . The SVM parametric identification step
seeks the largest Cthat satisfies the constraints and the associated parameters.
9.2. LINEAR APPROACHES TO CLASSIFICATION 239
Since the hyperplane (9.2.68) is equivalent to the original hyperplane where the
parameters β0 and β have been multiplied by a constant, we can set kβ k = 1/C.
The maximisation problem can be reformulated in a minimisation form
min
β,β0
1
2kβk2 (9.2.72)
subject to yi (x T
iβ+β 0 )≥1 for i = 1, . . . , N (9.2.73)
where the constraints impose a margin around the linear decision of thickness 1/k βk .
This problem is a convex optimisation problem (Appendix ) where the primal La-
grangian is
LP ( β, β0 ) = 1
2kβk2 −
N
X
i=1
αi [ yi (xT
iβ+β 0 )−1] (9.2.74)
and αi ≥ 0 are the Lagrangian multipliers.
Setting the derivatives wrt β and β0 to zero we obtain:
β=
N
X
i=1
αiyi xi , 0 =
N
X
i=1
αiyi (9.2.75)
Substituting these in the primal form (9.2.74) we obtain
LD =
N
X
i=1
αi − 1
2
N
X
i=1
N
X
k=1
αiαk yiyk x T
ix k (9.2.76)
subject to αi ≥ 0.
The dual optimisation problem is now
max
α
N
X
i=1
αi − 1
2
N
X
i=1
N
X
k=1
αiαk yiyk x T
ix k = max
α
N
X
i=1
αi − 1
2
N
X
i=1
N
X
k=1
αiαk yi yk h xi , xk i
(9.2.77)
subject to 0 =
N
X
i=1
αi yi , (9.2.78)
αi ≥ 0 , i = 1 , . . . , N (9.2.79)
where hxi , xk i is the inner product of xi and xk .
Note that the problem formulation requires the computation of all the inner
products hxi , xk i, i = 1, . . . , N, k = 1, . . . , N . This boils down to the computation
of the Gram matrix
G= XXT (9.2.80)
It can be shown that the optimal solution must satisfy the Karush-Kuhn-Tucker
(KKT) condition
αi [yi ( x T
iβ+β 0 )−1] = 0, ∀i
The above condition means that we are in either of these two situations
1. yi (x T
iβ+β 0 ) = 1, i.e. the point is on the boundary of the margin, then α i >0
2. yi (x T
iβ+β 0 )>1, i.e. the point is not on the boundary of the margin, then
αi = 0
240 CHAPTER 9. LINEAR APPROACHES
The training points having an index isuch that αi > 0 are called the support vectors
Given the solution α and obtained βfrom (9.2.75), the term β0 is obtained by
β0 =− 1
2[βx∗ (1) + βx∗ ( −1)]
where we denote by x∗ (1) some (any) support vector belonging to the first class
and we denote by x∗ (− 1) a support vector belonging to the second class.
Now, the decision function can be written as
h( x, β, β0 ) = sign[ xT β+ β0 ]
or equivalently
h( x, β, β0 ) = sign[ X
support vectors
yi αi h xixi+ β0 ] (9.2.81)
This is an attractive property of support vector machines: the classifier can
be expressed as a function of a limited number of points of the training set, the
so called support vectors which are on the boundaries. This means that in SVM
all the points far from the class boundary do not play a major role, unlike the
linear discriminant rule where the mean and the variance of the class distributions
determine the separating hyperplane (see Equation (9.2.63)). It can be shown, also,
that in the separable case
C=1
kβ k= 1
qP N
i=1 α i
(9.2.82)
R script
The R script svm.R considers a binary classification problem. It generates sets of
separable data and builds a separating hyperplane by solving the problem (9.2.74).
The training points belonging to the two classes (in red and blue), the separating
hyperplane, the boundary of the margin and the support vectors (in black) are
plotted for each training set (see Figure 9.12 ).
•
A modification of the formulation (9.2.70) occurs when we suppose that the
classes are nonlinearly separable. In this case the dual problem (9.2.77) is un-
bounded. The idea is still to maximise the margin but by allowing some points to
be misclassified. For each example hxi , yi i we define the slack variable ξi and we
relax the constraints (9.2.71) into
1
kβ ky i (xT
iβ+β 0 )≥C(1 −ξ i ) for i = 1, . . . , N (9.2.83)
ξi ≥0 (9.2.84)
N
X
i=1
ξi ≤ γ(9.2.85)
The value ξi represents the proportional amount by which the quantity yi (x T
iβ+
β0 ) can be lower than Cand the norm k ξkmeasures how much the training set fails
to have a margin C. Note that since misclassifications occur when ξi > 1, the upper
bound γ of P N
i=1 ξ i represents the maximum number of allowed misclassifications
9.2. LINEAR APPROACHES TO CLASSIFICATION 241
Figure 9.12: Maximal margin hyperplane for a binary classification task with the
support vectors in black.
in the training set. It can be shown [98] that the maximisation (9.2.70) with the
above constraints can be put in the equivalent quadratic form
max
α
N
X
i=1
αi − 1
2
N
X
i=1
N
X
k=1
αiαkyiyk x T
ix k (9.2.86)
subject to 0 =
N
X
i=1
αiyi , (9.2.87)
0≤αi ≤γ, i = 1, . . . , N (9.2.88)
The decision function takes again the form (9.2.81) where β0 is chosen so that
yi h( xi ) = 1 for any i such that 0 < αi< γ. The geometric margin takes the value
C= N
X
k=1
αi αk yi yk x T
ix k ! −1/ 2
(9.2.89)
Note that the set of points for which the corresponding slack variables satisfy
ξi >0 are also the points for which αi = γ.
R script
The R script svm.R solves a non separable problem by setting the boolean variable
separable to FALSE. Figure 9.13 plots: the training points belonging to the two
classes (in red and blue), the separating hyperplane, the boundary of the margin,
the support vectors (in black), the points of the red class for which the slack variable
is positive (in yellow) and the points of the blue class for which the slack variable
is positive (in green).
•
Once the value γis fixed, the parametric identification in the SVM approach
boils down to a quadratic optimisation problem for which a large amount of meth-
ods and numerical software exists. The value γ plays the role of capacity hyper-
parameter which bounds the total proportional amount by which classifications fall
242 CHAPTER 9. LINEAR APPROACHES
Figure 9.13: Maximal margin hyperplane for a non separable binary classification
task for different values of C: support vectors are in black, the slack points of the
red class are in yellow and the slack points of the blue class are in green.
on the wrong side of the margin. In practice, the choice of this parameter requires a
structural identification loop where the parameter γis varied through a wide range
of values and assessed through a validation strategy.
9.3 Conclusion
In this chapter we considered input/output regression problems where the rela-
tionship between input and output is linear and classification problems where the
optimal decision boundaries are linear.
The advantage of linear models are numerous:
•the least-squares ˆ
βestimate can be expressed in an analytical form and can
be easily calculated through matrix computation.
•statistical properties of the estimator can be easily defined.
•recursive formulation for sequential updating are available.
Unfortunately, in real problems, it is extremely unlikely that the input and output
variables are linked by a linear relation. Moreover, the form of the relationship is
often unknown, and only a limited amount of observations is available. For this
reason, machine learning proposed a number of nonlinear approaches to address
nonlinear tasks.
9.4 Exercises
1. Consider an input/output regression task where n = 1, E [y | x ] = sin(x ) and p (y| x ) ∼
N(sin(x ), 1). Let N = 100 be the size of the training set and consider a quadratic
loss function.
Let the class of hypothesis be hM (x ) = α0 + P M
m=1 α m x m .
1. Estimate the parameter by least-squares.
2. Compute the error by leave-one-out and by using the PRESS statistic.
3. Plot the empirical error as a function of the degree M for M = 0, 1,...,7.
4. Plot the leave-one-out error as a function of the degree M for M = 0, 1,..., 7.
9.4. EXERCISES 243
2. Consider a univariate linear regression problem. Write a R script which, using
Monte Carlo simulation, validates the formula (9.1.7) for at least three regression
tasks differing in terms of
•parameters β0 , β1 ,
•variance σ2 ,
•number N of observations.
3. Consider a univariate linear regression problem. Write a R script which, using
Monte Carlo simulation, validates the formula (9.1.8) for at least three regression
tasks differing in terms of
•parameters β0 , β1 ,
•variance σ2 ,
•number N of observations.
4. Consider a univariate linear regression problem. Write a R script which, using Monte
Carlo simulation, shows that the least-squares estimates of β0 and β1 minimise the
quantity (9.1.10) for at least three regression tasks differing in terms of
•parameters β0 , β1 ,
•variance σ2 ,
•number N of observations.
5. Consider a regression task with input x and output y . Suppose we observe the
following training set
X Y
0 .1 1
0 0.5
-0.3 1.2
0.2 1
0.4 0.5
0.1 0
-1 1.1
1. Fit a linear model to the dataset.
2. Trace the data and the linear regression function on graph paper.
3. Are the two variables positively or negatively correlated?
Hint:
A= a 11 a12
a12 a22 ⇒ A −1 = 1
a11a22 −a2
12 a 22 −a 12
−a12 a11
Solution:
1. Once we set X=
1 0.1
1 0
1− 0. 3
1 0.2
1 0.4
1 0.1
1−1
we have
X0 X= 7 .0− 0 .50
−0. 5 1. 31
and
β= ( X0X)−1 X0 Y= 0 .725
−0. 456
244 CHAPTER 9. LINEAR APPROACHES
2.
−1.0 − 0.8 −0.6 − 0.4 −0.2 0.0 0.2 0.4
0.0 0.2 0.4 0.6 0.8 1.0 1.2
x
y
3. Since ˆ
β1 <0 the two variables are negatively correlated.
6. Let us consider the dependency where the conditional distribution of yis
y= 1 −x +x2 −x3 + w
and w∼N (0, σ2 ) and σ = 0. 5. Suppose that x∈R takes the values seq (− 1,1,length. out = N)
(with N = 50).
Consider the family of regression models
h(m) ( x) = β0 +
m
X
j=1
βjxj
where p denote the number of weights of the polynomial model h(m) of degree m.
Let \
MISE(m)
emp denote the least-squares empirical risk and MISE the mean integrated
empirical risk. By using Monte Carlo simulation and for m = 0,...,6
•plot E [ \
MISE(m)
emp] as a function of p,
•plot MISE(m) as a function of p,
•plot the difference E [ \
MISE(m)
emp]−MISE (m) as a function of p and compare it
with the theoretical result seen during the class.
For a single observed dataset:
•plot \
MISE(m)
emp as a function of the number of model parameters p,
•plot PSE as a function of p,
•discuss the relation between
arg min
m
\
MISE(m)
emp
and
arg min
mPSE ( m)
Solution: See the file Exercise2.pdf in the directory gbcode/exercises of the
companion R package gbcode (Appendix F).
Chapter 10
Nonlinear approaches
This chapter will present several algorithms proposed in machine learning literature
to deal with nonlinear regression and nonlinear classification tasks. Along the years
statisticians and machine learning researchers have proposed a number of nonlinear
approaches, with the aim of finding approximators able to combine high generalisa-
tion with effective learning procedures. The presentation of these techniques could
be organised according to several criteria and principles. In this chapter, we will
focus on the distinction between global and divide-and-conquer approaches.
A family of models traditionally used in supervised learning is the family of
global models which describes the relationship between the input and the output
values as a single analytical function over the whole input domain (Fig. 10.1). In
general, this makes sense when it is reasonable to believe that a physical-like law
describes the data over the whole set of operating conditions. Examples of well-
known global parametric models in the literature are the linear models discussed in
the previous chapter, generalised linear models and neural networks which will be
presented in Section 10.1.1.
A nice property of global modelling is that, even for huge datasets, the storage of
a parametric model requires a small amount of memory. Moreover, the evaluation
of the model requires a short program that can be executed in a reduced amount
of time. These features have undoubtedly contributed to the success of the global
approach in years when most computing systems imposed severe limitations on
users.
However, for a generic global model, the parametric identification (Section 7.2)
consists of a nonlinear optimisation problem (see Equation 7.2.7) which is not an-
alytically tractable due to the numerous local minima and for which only a sub-
optimal solution can be found through a slow iterative procedure. Similarly, the
problem of selecting the best model structure in a generic nonlinear case cannot be
handled in analytical form and requires time-consuming validation procedures.
For these reasons, alternatives to global modelling techniques, as the divide-
and-conquer approach, gained popularity in the modelling community. The divide-
and-conquer principle consists in attacking a complex problem by dividing it into
simpler problems whose solutions can be combined to yield a solution to the original
problem. This principle presents two main advantages. The first is that simpler
problems can be solved with simpler estimation techniques: in statistical language,
this means to adopt linear techniques, well studied and developed over the years.
The second is that the learning method can better adjust to the properties of the
available dataset. Training data are rarely distributed uniformly in the input space.
Whenever the distribution of patterns in the input space is uneven, a proper local
adjustment of the learning algorithm can significantly improve the overall perfor-
mance.
245
246 CHAPTER 10. NONLINEAR APPROACHES
Figure 10.1: A global model (solid line) which fits the training set (dotted points)
for a learning problem with one input variable (x-axis) and one output variable
(y-axis).
Figure 10.2: Function estimation (model induction + model evaluation) vs. value
estimation (direct prediction from data).
We will focus on two main instances of the divide-and-conquer principle: the
modular approach, which originated in the field of system identification, and the
local modelling approach, which was first proposed in the nonparametric statistical
literature.
Modular architectures are input/output approximators composed of a number
of modules which cover different regions of the input space. This is the idea of
operating regimes which propose a partitioning of the operating range of the system
as a more effective way to solve modelling problems (Section 10.1.3).
Although these architectures are a modular combination of local models, their
learning procedure is still performed on the basis of the whole dataset. Hence,
learning in modular architectures remains a functional estimation problem, with the
advantage that the parametric identification can be made simpler by the adoption
of local linear modules. However, in terms of structural identification, the problem
is still nonlinear and requires the same procedures used for generic global models.
A second example of divide-and-conquer methods are local modelling techniques
(Section 10.1.11), which turn the problem of function estimation in a problem of
value estimation. The goal is not to model the whole statistical phenomenon but
to return the best output for a given test input hereafter called the query . The
motivation is simple: why should the problem of estimating the values of an un-
known function at given points of interest be solved in two stages? Global modelling
techniques first estimate the function (induction) and second estimate the values
of the function using the estimated function (deduction). In this two-stage scheme
one actually tries to solve a relatively simple problem (estimating the values of a
function at given points of interest) by first solving, as an intermediate problem, a
much more difficult one (estimating the function).
Local modelling techniques take an alternative approach, defined as transduc-
tion by Vapnik [186] (Fig. 10.2). They focus on approximating the function only
10.1. NONLINEAR REGRESSION 247
Figure 10.3: Local modelling of the input/output relationship between the input
variable x and the output variable y, on the basis of a finite set of observations
(dots). The value of the variable y for x =q is returned by a linear model (solid
line) which fits the training points in a neighbourhood of the query point (bigger
dots).
in the neighbourhood of the point to be predicted. This approach requires to keep
in memory the dataset for each prediction, instead of discarding it as in the global
modelling case. At the same time, local modelling requires only simple approxi-
mators, e.g. constant and/or linear, to model the dataset in a neighbourhood of
the query point. An example of local linear modelling in the case of a single-input
single-output mapping is presented in Fig. 10.3.
Many names have been used in the past to label variations of the local modelling
approach: memory-based reasoning [174], case-based reasoning [121], local weighted
regression [44], nearest neighbor [47], just-in-time [49], lazy learning [5], exemplar-
based, instance based [4],... These approaches are also called nonparametric in the
literature [96, 170], since they relax the assumptions on the form of a regression
function, and let the data search for a suitable function that describes well the
available data.
In the following, we will present in detail some machine learning techniques for
nonlinear regression and classification.
10.1 Nonlinear regression
A general way of representing the unknown input/output relation in a regression
setting is the regression plus noise form (7.4.21) where f (· ) is a deterministic func-
tion and the term wrepresents the noise or random error. It is typically assumed
that w is independent of x and E [w ] = 0. Suppose that we collect a training
set {hxi , yi i:i = 1, . . . , N } with xi = [xi1 , . . . , xin ]T , generated according to the
model (7.4.21). The goal of a learning procedure is to find a model h (x ) which is
able to give a good approximation of the unknown function f (x ).
Example
Consider an input/output mapping represented by the Dopler function
f( x) = 20p x(1 − x) sin(2 π1 .05
x+ 0 .05 ) (10.1.1)
distorted by additive Gaussian noise wwith unit variance.
248 CHAPTER 10. NONLINEAR APPROACHES
Figure 10.4: Training set obtained by sampling uniformly in the input domain of a
Dopler function distorted with Gaussian noise.
The training set is made of N= 2000 points obtained by sampling the input
domain X = [0. 12, 1] through a random uniform distribution (Fig. 10.4). This
stochastic dependency and the related training dataset (see R script dopler.R ) will
be used to assess the performance of the techniques we are going to present.
•
10.1.1 Artificial neural networks
Artificial neural networks (ANN) (aka neural nets ) are parallel, distributed infor-
mation processing computational models which draw their inspiration from neurons
in the brain. However, one of the most important trends in recent neural computing
has been to move away from a biologically inspired interpretation of neural networks
to a more rigorous and statistically founded interpretation based on results deriving
from statistical pattern recognition theory.
The main class of neural network used in supervised learning for classification
and regression is the feed-forward network, aka as multi-layer perceptron (MLP).
Feed-forward ANNs (FNNs) have been applied to a wide range of prediction tasks
in such diverse fields as speech recognition, financial prediction, image compression,
adaptive industrial control.
10.1.1.1 Feed-forward architecture
Feed-forward NN have a layered architecture, with each layer comprising one or
more simple processing units called artificial neurons or nodes (Figure 10.5). Each
node is connected to one or more other nodes by real-valued weights (in the following
we will refer to them as parameters) but not to nodes in the same layer. All FNN
have an input layer and an output layer. FNNs are generally implemented with an
additional node, called the bias1unit , in all layers except the output layer. This
1Note that this has nothing to do with the estimator bias concept. In neural network literature,
bias is used to denote the intercept term
10.1. NONLINEAR REGRESSION 249
Figure 10.5: Two-layer feed-forward NN
node plays the role of the intercept term β0 in linear models.
For simplicity, henceforth, we will consider only FNN with one single output.
Let
•nbe the number of inputs,
•Lthe number of layers,
•H(l) the number of hidden units of the lth layer (l = 1, . . . , L ) of the FNN,
•w(l)
kv denote the weight of the link connecting the kth node in the l− 1 layer
and the vth node in the l layer,
•z(l)
v,v= 1 , . . . , H ( l)the output of the vth hidden node of the lth layer,
•z(l)
0denote the bias for the l, l = 1, . . . , L layer.
•Let H (0) =n and z (0)
v=x v,v= 0 , . . . , n.
For l≥ 1 the output of the v th, v = 1, . . . , H (l) , hidden unit of the lth layer,
is obtained by first forming a weighted linear combination of the H (l−1) outputs of
the lower level
a(l)
v=
H(l−1)
X
k=1
w(l)
kv z ( l −1)
k+w ( l)
0vz (l− 1)
0, v = 1, . . . , H (l)
and then by transforming the sum using an activation function to give
z(l)
v=g ( l)(a (l )
v), v = 1 , . . . , H ( l)
The activation function g (l) (· ) is typically a nonlinear transformation like the
logistic or sigmoid function
g(l) ( z) = 1
1 + e −z (10.1.2)
250 CHAPTER 10. NONLINEAR APPROACHES
For L = 2 (i.e. single hidden layer or two-layer feed-forward NN), the in-
put/output relation is given by
ˆ y= h( x, αN ) = g (2) ( a(2)
1) = g (2) H
X
k=1
w(2)
k1z k +w (2)
01 z 0 !
where
zk = g (1)
n
X
j=1
w(1)
jk x j +w (1)
0kx 0
, k = 1, . . . , H
Note that if g (1) (· ) and g (2) (· ) are linear mappings, this functional form becomes
linear.
Once given the number of inputs and the form of the function g ( ·) two are the
parameters which remain to be chosen: the value of weights w (l) ,l = 1, 2 and the
number of hidden nodes H. Note that the set of weights of an FNN represents
the set of parameters αN introduced in Section 7.1 when the hypothesis function
h(· ) is modelled by a FNN. The calibration procedure of the weights on the basis
of a training dataset represents the parametric identification procedure in neural
networks. This procedure is normally carried out by a back-propagation algorithm
which will be discussed in the following section.
The number Hof hidden nodes represents the complexity sin Equation (7.9.54).
By increasing the value H, we increase the class of input/output functions that can
be represented by the FNN. In other terms, the choice of the number of hidden
nodes affects the representation power of the FNN approximator and constitutes
the structural identification procedure in FNN (Section 10.1.1.3) .
10.1.1.2 Back-propagation
Back-propagation is an algorithm which, once the number of hidden nodes His
given, estimates the weights αN ={w (l) , l = 1, 2} on the basis of the training set
DN . It is a gradient-based algorithm which aims to minimise the non-convex cost
function
SSEemp(αN ) =
N
X
i=1
(yi − ˆ yi )2 =
N
X
i=1
(yi − h (xi , αN ))2
where αN ={w(l) , l = 1, 2} is the set of weights.
The back-propagation algorithm exploits the network structure and the differen-
tiable nature of the activation functions in order to compute the gradient recursively.
The simplest (and least effective) back-prop algorithm is an iterative gradient
descent which is based on the iterative formula
αN ( k+ 1) = αN ( k)− η∂ SSE emp ( α N ( k ))
∂αN ( k)(10.1.3)
where αN (k ) is the weight vector at the k th iteration and η is the learning rate
which indicates the relative size of the change in weights.
The weights are initialised with random values and are changed in a direction
that will reduce the error. Some convergence criterion is used to terminate the
algorithm. This method is known to be inefficient since many steps are needed to
reach a stationary point, and no monotone decrease of SSEemp is guaranteed. More
effective versions of the algorithm are based on the Levenberg-Marquardt algorithm
(Section 8.6.2.6). Note that this algorithm presents all the typical drawbacks of the
gradient-based procedures discussed in Section 8.6.4, like slow convergence, local
minima convergence, sensitivity to the weights initialisation.
10.1. NONLINEAR REGRESSION 251
Figure 10.6: Single-input single-output neural network with one hidden layer, two
hidden nodes and no bias units.
In order to better illustrate how the derivatives are computed in (10.1.3), let us
consider a simple single-input (i.e. n= 1) single-output neural network with one
hidden layer, two hidden nodes and no bias units (Figure 10.6). Since
a1 ( x ) = w (2)
11 z 1 +w (2)
21 z 2
the FNN predictor takes the form
ˆ y( x) = h ( x, αN ) = g( a1 ( x)) = g( w(2)
11 z 1 +w (2)
21 z 2 ) = g (w (2)
11 g(w (1)
11 x) + w (2)
21 g(w (1)
12 x))
where αN = [w (1)
11 , w (1)
12 , w (2)
11 , w (2)
21 ] The backprop algorithm needs the derivatives of
SSEemp wrt to each weight w∈ αN . Since for each w∈ α N
∂SSEemp
∂w =−2
N
X
i=1
(yi − ˆ y( xi )) ∂ˆ y( xi )
∂w
and the terms (yi − ˆ y( xi )) are easy to be computed, we focus on ∂ˆ y
∂w .
As far as the weights {w (2)
11 , w (2)
21 }of the hidden/output layer are concerned, we
have
∂ˆ y( x)
∂w (2)
v1
=∂g
∂a(2)
1
∂a(2)
1
∂w (2)
v1
=g0 (a (2)
1(x))z v (x) , v = 1 ,..., 2 (10.1.4)
where
g0 ( z) = e −z
(1 + e−z ) 2
As far as the weights {w (1)
11 , w (1)
12 }of the input/hidden layer
∂ˆ y( x)
∂w (1)
1v
=∂g
∂a(2)
1
∂a(2)
1
∂zv
∂zv
∂a(1)
v
∂a(1)
v
∂w (1)
1v
=g0 (a(2)
1(x))w (2)
v1g 0 (a (1)
v(x))x(10.1.5)
252 CHAPTER 10. NONLINEAR APPROACHES
Figure 10.7: Neural network fitting with s= 2 hidden nodes. The red continuous
line represents the neural network estimation of the Dopler function.
where the term g0 ( a2
1(x)) has been already obtained during the computation of (10.1.4).
The computation of the derivatives with respect to the weights of the lower layers
relies on terms which have been used in the computation of the derivatives with
respect to the weights of the upper layers. In other terms, there is a sort of back-
propagation of numerical terms from the upper layer to the lower layers that justifies
the name of the procedure.
R example
Tensorflow [1] is a open-source library developed by Google2 which had a great
success in recent years as flexible environment for building and training of neural
network architectures. In particular this library provides automatic differentiation
functionalities to speed up the backpropagation implementation.
The script tf nn.R uses the R tensorflow package (wrapper over the python li-
brary) to compute the derivatives (10.1.5) and (10.1.4) for the network in Figure 10.6
and checks that the TensorFlow result coincides with the one derived analytically.
•
R example
The FNN learning algorithm for a single-hidden layer architecture is implemented
by the R library nnet . The script nnet.R shows the prediction accuracy for different
number of hidden nodes (Figure 10.7 and Figure 10.8).
•
2https://www.tensorflow.org
10.1. NONLINEAR REGRESSION 253
Figure 10.8: Neural network fitting with s = 7 hidden nodes. The continuous red
line represents the neural network estimation of the Dopler function.
10.1.1.3 Approximation properties
Let us consider a two-layer FNN with sigmoidal hidden units. This has proven to
be an important class of network for practical applications. It can be shown that
such networks can approximate arbitrarily well any functional (one-one or many-
one) continuous mapping from one finite-dimensional space to another, provided
the number H of hidden units is sufficiently large. Note that although this result
is remarkable, it is of no practical use. No indication is given about the number of
hidden nodes to choose for a finite number of observations and a generic nonlinear
mapping.
In practice, the choice of the number of hidden nodes requires a structural identi-
fication procedure (Section 8.8) which assesses and compares several different FNN
architectures before choosing the ones expected to be the closest to the optimum.
Cross-validation techniques or regularisation strategies based on complexity-based
criteria (Section 8.8.2.5) are commonly used for this purpose.
Example
This example presents the risk of overfitting when the structural identification of
a neural network is carried out on the basis of the empirical risk and not on less
biased estimates of the generalisation error.
Consider a dataset DN = {xi , yi } ,i = 1, . . . , N where N = 50 and
x∈ N
[0,0,0],
100
010
001
is a 3-dimensional vector. Suppose that y is linked to x by the input/output rela-
254 CHAPTER 10. NONLINEAR APPROACHES
tionship
y= x2
1+ 4 log(|x 2 |)+5 x3
where xi is the i th component of the vector x. Consider as non-linear model a single-
hidden-layer neural network (implemented by the R package nnet ) with s = 15
hidden neurons. We want to estimate the prediction accuracy on a new i.i.d dataset
of Nts = 50 examples. Let us train the neural network on the whole training set by
using the R script cv.R. The empirical prediction MISE error is
\
MISEemp = 1
N
N
X
i=1
(yi −h(xi , αN ))2 = 1. 6∗ 10−6
where αN is obtained by the parametric identification step. However, if we test
h(· , αN ) on the test set we obtain
\
MISEts = 1
Nts
Nts
X
i=1
(yi − h (xi , αN ))2 = 22.41
This neural network is seriously overfitting the dataset. The empirical error is a
very bad estimate of the MISE.
We perform now a K-fold cross-validation in order to have a better estimate of
MISE, where K = 10. The K = 10 cross-validated estimate of MISE is
\
MISECV = 24.84
This figure is a much more reliable estimation of the prediction accuracy.
The leave-one-out estimate K =N = 50 is
\
MISEloo = 19.47
It follows that the cross-validated estimate could be used to select a more ap-
propriate number of hidden neurons.
•
10.1.2 From shallow to deep learning architectures
Until 2006, FNNs with more than two layers have been rarely used in literature
because of poor training and large generalisation errors. The common belief was
that the solutions returned by deep neural networks were worse solutions than the
ones obtained with shallower networks. This was mainly attributed to two aspects:
i) gradient-based training of deep supervised FNN gets stuck in local minima or
plateaus, and ii) the higher the number of layers in a neural network, the smaller
the impact of the back-propagation on the first layers.
However, an incredible resurgence of the domain occurred from 2006 on when
some teams (notably the Bengio team in Montreal, the Hinton team in Toronto and
the Le Cun team in Facebook)3were able to show that some adaptation of the FNN
algorithm could bring remedy to the above-mentioned problems and lead to major
accuracy improvements with respect to other learning machines. In particular deep
architectures (containing up to hundreds of layers) showed a number of advantages:
•some highly nonlinear functions can be represented much more compactly
with deep architectures than with shallow ones,
3Yoshua Bengio, Geoffrey Hinton, and Yann LeCun were awarded the 2018 Turing Award,
known as the Nobel Prize of computing
10.1. NONLINEAR REGRESSION 255
•the XOR parity function for n -bit inputs4 can be coded by a feed-forward
neural network with O (log n ) hidden layers and O (n ), neurons, while a feed-
forward neural network with only one hidden layer needs an exponential num-
ber of the same neurons to perform the same task,
•DL allows automatic generation and extraction of new features in large di-
mensional tasks with spatial dependency and location invariance,
•DL allows easy management of datasets where inputs and outputs are stored
in tensors (multidimensional matrices),
•DL relies on the learning of successive layers of increasingly meaningful rep-
resentations of input data (layered representation learning) and is a powerful
automatic alternative to time-consuming human crafted feature engineering,
•portions of DL pre-trained networks may be reused for similar, yet different,
tasks (transfer learning) or calibrated in online learning pipelines (continuous
learning),
•iterative gradient optimisation (Section 8.6.3) is a very effective manner of
ingesting huge amount of data in large networks.
•new activation functions and weight-initialisation schemes (e.g. layer-wise
pretraining) improve the training process.
Also, new network architectures were proposed, like auto-encoders or convolu-
tional networks. An auto-encoder is a multi-input multi-output neural network that
maps its input to itself. It has a hidden layer that describes a code used to repre-
sent the input and is composed of two parts: an encoder function and a decoder
that produces a reconstruction. They can be used for dimensionality reduction or
compression (if the number of hidden nodes is smaller than the number of inputs).
Convolutional networks are biologically inspired architectures imitating the pro-
cessing of cortical cells. They are ideal for taking into consideration local and spatial
correlation and consist of a combination of convolution, pooling and normalisation
steps applied to inputs taking the generic form of tensors. The convolution phase
applies a number of filters with shared weights to the same image. It ensures trans-
lation invariance since the weights depend on spatial separation and not on absolute
positions. Pooling is a way to take large images and shrink them down while pre-
serving the most important information in them. This step allows the creation
of new features as the combination of previous level features. The normalisation
ensures that every negative value is set to zero.
Those works had such a major impact on theoretical and applied research that
nowadays, deep learning is a de-facto synonymous of the entire machine learning
domain and more generally of AI. This comeback has been supported by a number
of headlines in the news like the success of a deep learning solution in the ImageNet
Large-Scale Visual Recognition Competition (2012): (bringing down the state-of-
the-art error rate from 26.1% to 15.3%) or the DL program AlphaGo, developed
by the company DeepMind, beating the no.1 human GO player. Other impres-
sive applications of deep learning are near-human-level speech recognition, near-
human-level handwriting transcription, autonomous cars (e.g. traffic sign recog-
nition, pedestrian detection), image segmentation (e.g. face detection), analysis
of particle accelerator data in physics, prediction of mutation effects in bioinfor-
matics and machine translation (LSTM model sequence-to-sequence relationships).
4This is a canonical challenge in classification where the target function is a boolean function
whose value is one if and only if the n-dimensional input vector has an odd number of ones.
256 CHAPTER 10. NONLINEAR APPROACHES
The breakthroughs of DL in the AI community have been acknowledged by the
attribution of the 2019 ACM Turing prize to Bengio, Hinton and Le Cun.
The domain is so large and rich that the most honest recommendation of the
author is to refer the reader, for more details, to seminal books [85] and articles [122]
authored by the pioneers in this domain. Nevertheless, we would like to make a
number of pedagogical considerations about the role of deep learning with respect
to other learning machines:
•DL models are not faithful models of the brain.
•the astonishingly success of DL makes of it a privileged approach in recent
years, but definitely, it should not be considered a machine learning panacea.
•DL, like all machine learning techniques, relies on a number of hyper-parameters
which affect its capacity, its bias/variance trade-off and the expected gener-
alisation power. The setting of those parameters has a major impact on the
generalisation power. An important factor of the recent success of deep net-
works learning is the effective integration of computational strategies already
adopted in other learning approaches, like regularisation, averaging, resam-
pling.
•the success of DL, though fulgurant, if often restricted to some specific percep-
tual tasks, e.g. convolutional networks has been explicitly designed to process
data that come in the form of multiple arrays (1D for signals and sequences,
including language; 2D for images or audio spectrograms; and 3D for video
or volumetric images.)
•there is no evidence that representation learning is by default a better strat-
egy than feature engineering: it is surely less biased but very probably more
variant.
•DL is particularly successful in tasks where it is possible to collect (and label)
huge amounts of examples: nevertheless, there are still a number of chal-
lenging tasks where the number of examples is typically low or scarce (e.g.
bioinformatics or tie series forecasting),
•the success of DL has been amplified by the advent of fast parallel graphics
processing units (GPUs), tensor processing units (TPU) and related libraries
(e.g. TensorFlow, Keras, PyTorch) that are convenient to program and allow
researchers to train networks 10 or 20 times faster.
•any assumption of a priori superiority of DL over other techniques for a given
learning task is more often due to hype consideration than to a scientific
attitude that should instead rely on the validation of a number of alternative
strategies and the pondering of different criteria (accuracy, computational
cost, energy consumption, interpretability).
Exercise
The script keras.regr.R compares a Keras [43] implementation of a DNN and a
Random Forest (Section 11.4) in a very simple nonlinear regression task where a
single input out of nis informative about the target. The default setting of the
DNN is very disappointing in terms of NMSE accuracy (8.10.41) with respect to
the Random Forest. We invite the reader to spend some time performing DNN
model selection (e.g. by changing the architecture, tuning the number of layers
and/or the number of nodes per layers) or increasing the number of training points
to bring the DNN accuracy closer to the RF one. Is that easy? Is that fast? What
is your opinion?
•
10.1. NONLINEAR REGRESSION 257
10.1.3 From global modelling to divide-and-conquer
Neural networks are a typical example of global modelling. Global models have
essentially two main properties. First, they make the assumption that the relation-
ship between the inputs and the output values can be described by an analytical
function over the whole input domain. Second, they solve the problem of learning
as a problem of function estimation: given a set of data, they extract the hypothesis
which is expected to approximate the best the whole data distribution (Chapter 7).
The divide-and-conquer paradigm originates from the idea of relaxing the global
modelling assumptions. It attacks a complex problem by dividing it into simpler
problems whose solutions can be combined to yield a solution to the original prob-
lem. This principle presents two main advantages. The first is that simpler problems
can be solved with simpler estimation techniques; in statistics, this means to adopt
linear techniques, well studied and developed over the years. The second is that the
learning method can better adjust to the properties of the available dataset.
The divide-and-conquer idea evolved in two different paradigms: the modular
architectures and the local modelling approach.
Modular techniques replace a global model with a modular architecture where
the modules cover different parts of the input space. This is the idea of operating
regimes which assume a partitioning of the operating range of the system in order
to solve modelling and control problems [111]. The following sections will introduce
some examples of modular techniques.
10.1.4 Classification and Regression Trees
The use of tree-based classification and regression dates back to the work of Morgan
and Sonquist in 1963. Since then, methods of tree induction from samples have
been an active topic in the machine learning and the statistics community. In
machine learning the most representative methods of decision-tree induction are
the ID3 [158] and the C4 [159] algorithms. Similar techniques were introduced
in statistics by Breiman et al. [37], whose methodology is often referred to as the
CART (Classification and Regression Trees) algorithm.
A decision tree (see Fig. 10.9) partitions the input space into mutually exclusive
regions, each of which is assigned a procedure to characterise its data points (see
Fig. 10.10)
The nodes of a decision tree can be classified in internal nodes and terminal
nodes. An internal node is a decision-making unit that evaluates a decision function
to determine which child node to visit next. A terminal node or leaf has no child
nodes and is associated with one of the partitions of the input space. Note that
each terminal node has a unique path that leads from the root to itself.
In classification trees each terminal node contains a label that indicates the
class for the associated input region. In regression trees the terminal node con-
tains a model that specifies the input/output mapping for the corresponding input
partition.
Hereafter we will focus only on the regression case. Let mbe the number of
leaves and hj (· , αj ) the input/output model associated with the j th leaf. Once a
prediction in a query point qis required, the output evaluation proceeds as follows.
First, the query is presented to the root node of the decision tree; according to the
associated decision function, the tree will branch to one of the root's children. The
procedure is iterated recursively until a leaf is reached, and an input/output model
is selected. The returned output will be the value hj (q, αj ).
Consider for example the regression trees in Fig. 10.9, and a query point
q= ( xq , yq ) so that xq < x1 and yq > y1. The predicted output will be
258 CHAPTER 10. NONLINEAR APPROACHES
Figure 10.9: A binary decision tree.
Figure 10.10: Input space partitioning induced on the input space by the binary
tree in Fig. 10.9
10.1. NONLINEAR REGRESSION 259
yq = h2 ( q, α2 ) where α2 is the vector of parameters of the model localised in region
R2.
When the terminal nodes contain only constant models, the input/output map-
ping results in a combination of several constant-height planes put together with
crisp boundaries. In the case of linear terms, the resulting approximator is instead
a piecewise linear model.
10.1.4.1 Learning in Regression Trees
10.1.4.2 Parameter identification
A regression tree partitions the input space into mutually exclusive regions. In terms
of parametric identification, this requires a two-step procedure. First, the training
dataset is partitioned into m disjoint sets D Nj ; second, a local model hj (· , αj ) is
fitted to each subset DNj . The nature of the local model determines the kind of
procedure (linear or nonlinear) to be adopted for the parameter identification (see
Section 8.6).
R implementation
A regression tree with constant local models is implemented by the R library tree.
The script tree.R shows the prediction accuracy for different minimum number of
observations per leaf. (Figure 10.11 and Figure 10.12).
•
10.1.4.3 Structural identification
This section presents a summary of the CART procedure [37] for structural identifi-
cation in binary regression trees. In this case the structural identification procedure
addresses the problem of choosing the optimal partitioning of the input space.
To construct an appropriate decision tree, CART first grows the tree on the
basis of the training set, and then prunes the tree back based on a minimum cost-
complexity principle. This is an example of the exploratory approach to model
generation described in Section 8.8.1.
Let us see in detail the two steps of the procedure:
Tree growing. CART makes a succession of splits that partition the training data
into disjoint subsets. Starting from the root node that contains the whole
dataset, an exhaustive search is performed to find the split that best reduces
a certain cost function.
Let us consider a certain node t and let D (t ) be the corresponding subset of
the original DN . Consider the empirical error of the local model fitting the
N( t) data contained in the node t:
Remp ( t) = min
αt
N( t)
X
i=1
L( yi , ht ( xi , αt )) (10.1.6)
For any possible split s of node tinto the two children tr and tl , we define the
quantity
∆E (s, t ) = Remp (t )− (R emp (tl ) + R emp (tr ))
with N ( tr ) + N (tl ) = N ( t ) (10.1.7)
260 CHAPTER 10. NONLINEAR APPROACHES
Figure 10.11: Regression tree fitting with a minimum number of points per leaf
equal to s = 7.
Figure 10.12: Regression tree fitting with a minimum number of points per leaf
equal to s = 30.
10.1. NONLINEAR REGRESSION 261
that represents the change in the empirical error due to a further partition of
the dataset. The best split is the one that maximizes the decrease ∆E
s∗ = arg max
s∆E( s, t) (10.1.8)
Once the best split is attained, the dataset is partitioned into the two disjoint
subsets of length N (tr ) and N ( tl ), respectively. The same method is recur-
sively applied to all the leaves. The procedure terminates either when the
error measure associated with a node falls below a certain tolerance level, or
when the error reduction ∆Eresulting from further splitting does not exceed
a threshold value.
The tree that the growing procedure yields is typically too large and presents a
serious risk of overfitting the dataset (Section 7.7). For that reason, a pruning
procedure is often adopted.
Tree pruning. Consider a fully expanded tree Tmax characterized by Lterminal
nodes.
Let us introduce a complexity-based measure of the tree performance
Rλ ( T ) = Remp ( T ) + λ | T| (10.1.9)
where λ is a parameter that accounts for the tree's complexity and |T | is the
number of terminal nodes of the tree T. For a fixed λwe define with T (λ)
the tree structure which minimizes the quantity (10.1.9).
The parameter λ is gradually increased in order to generate a sequence of tree
configurations with decreasing complexity
TL =Tmax ⊃TL−1 ⊃ ··· ⊃ T2 ⊃T1 (10.1.10)
where Ti has i terminal nodes. In practice, this requires a sequence of shrinking
steps where for each step we select the value of λleading from a tree to a tree
of inferior complexity. When we have a tree Tthe next inferior tree is found
by computing for each admissible subtree Tt ⊂ T the value λt which makes of
it the minimizer of (10.1.9). For a generic subtree Tt this value must satisfy
Rλ t (Tt )≤ R λ t ( T ) (10.1.11)
that is
Remp ( Tt ) + λt | Tt | ≤ Remp ( T) + λt | T|
which means
λt ≥ R emp ( T)−Remp (Tt )
|T |−| Tt | (10.1.12)
Hence, λt =R emp (T)−R emp (Tt )
|T|−|Tt | makes of T t the minimising tree. Therefore we
choose among all the admissible subtrees Tt the one with the smallest right-
hand term in Eq. (10.1.12). This implies a minimal increase in λ toward the
next minimising tree.
At the end of the shrinking process we have a sequence of candidate trees
that have to be properly assessed to perform the structural selection. As
far as validation is concerned, either a procedure of cross-validation or of
independent testing can be used. The final structure is then obtained through
one of the selection procedures described in Section 8.8.3.
262 CHAPTER 10. NONLINEAR APPROACHES
Regression trees are a very easy-to-interpret representation of a nonlinear in-
put/output mapping. However, these methods are characterised by rough disconti-
nuity at the decision boundaries, which might bring undesired effects to the overall
generalisation. Dividing the data by partitioning the input space shows typically
small estimator bias but at the cost of increased variance. This is particularly prob-
lematic in high-dimensional spaces where data become sparse. One response to the
problem is the adoption of simple local models (e.g. constant or linear). These
simple functions minimise the variance at the cost of an increased bias.
Another trick is to make use of soft splits, allowing data to lie simultaneously
in multiple regions. This is the approach taken by BFN.
10.1.5 Basis Function Networks
Basis Function Networks (BFN) are a family of modular architectures which are
described by a linear basis expansion, i.e. the weighted linear combination
y=
m
X
j=1
ρj (x)hj (10.1.13)
where the weights are returned by the activations of m local nonlinear basis functions
ρj and where the term hj is the output of a generic module of the architecture.
The basis or activation function ρj is a function
ρj :X → [0 ,1] (10.1.14)
usually designed so that its value monotonically decreases towards zero as the input
point moves away from its centre cj .
The basis function idea arose almost at the same time in different fields and
led to similar approaches, often denoted with different names. Examples are the
Radial Basis Function in machine learning, the Local Model Networks in system
identification and the Neuro-Fuzzy Inference Systems in fuzzy logic. These three
architectures are described in the following sections.
10.1.6 Radial Basis Functions
A well-known example of basis functions are the Radial Basis Functions (RBF) [156].
Each basis function in RBF takes the form of a kernel
ρj = K( x, cj, Bj ) (10.1.15)
where cj is the centre of the kernel and Bj is the bandwidth. An example of
kernel functions is illustrated in Fig. 10.13. Other examples of kernel functions are
available in Appendix E. Once we define with ηj the set {cj , Bj } of parameters of
the basis function, we have
ρj =ρj (· , ηj ) (10.1.16)
If the basis ρj have localised receptive fields and a limited degree of overlap
with their neighbours, the weights hj in Eq. (10.1.13) can be interpreted as locally
piecewise constant models, whose validity for a given input is indicated by the
corresponding activation function for a given input.
10.1.7 Local Model Networks
Local Model Networks (LMN) were first introduced by Johansen and Foss [111].
They are a generalised form of Basis Function Network in the sense that the constant
10.1. NONLINEAR REGRESSION 263
Figure 10.13: A Gaussian kernel function in a two-dimensional input space.
weights hj associated with the basis functions are replaced by local models hj (· , αj ).
The typical form of a LMN is then
y=
m
X
j=1
ρj ( x, ηj ) hj ( x, αj ) (10.1.17)
where the ρj are constrained to satisfy
m
X
j=1
ρj ( x, ηj ) = 1 ∀ x ∈ X (10.1.18)
This means that the basis functions form a partition of unity [137]. This ensures
that every point in the input space has equal weight, so that any variation in the
output over the input space is due only to the models hj .
The smooth combination provided by the LMN formalism enables the repre-
sentation of complex nonlinear mappings on the basis of simpler modules. See the
example in Fig. 10.14 which shows the combination in a two-dimensional input
space of three local linear models whose validity regions is represented by Gaussian
basis functions.
In general, the local models hj (· , α ) in Eq. (10.1.17) can be of any form: linear,
nonlinear, physical models or black-box parametric models.
Note that, in the case of local linear models
hj ( x, αj ) =
n
X
i=1
aji xi + bj (10.1.19)
where the vector of parameters of the local model is αj = [aj1 , . . . , ajn , bj ] and x i
is the i th term of the vector x , a LMN architecture returns one further information
about the input/output phenomenon: the local linear approximation hlin of the
input/output mapping about a generic point x
hlin ( x ) =
m
X
j=1
ρj ( x, ηj ) n
X
i=1
aji xi + bj ! (10.1.20)
10.1.8 Neuro-Fuzzy Inference Systems
Fuzzy modelling consists of describing relationships between variables by means of
if-then rules, such as
If x is high then y is low
264 CHAPTER 10. NONLINEAR APPROACHES
(a) (b)
(c)
Figure 10.14: A Local Model Network with m= 3 local models: the nonlinear
input/output approximator in (c) is obtained by combining the three local linear
models in (a) according to the three basis functions in (b).
10.1. NONLINEAR REGRESSION 265
where the linguistic terms, as high and low , are described by fuzzy sets [200].
The first part of each rule is called the antecedent while the second part is called
the consequent. Depending on the particular form of the consequent proposition,
different types of rule-based fuzzy models can be distinguished [11].
Here we will focus on the fuzzy architecture for nonlinear modelling introduced
by Takagi and Sugeno [178]. A Takagi-Sugeno (TS) fuzzy inference system is a set
of m fuzzy if-then rules having the form:
If x1 is A 11 and x2 is A 21 . . . and xn is An1 then y =h1 (x1 , x2 , ..., xn , α1 )
. . .
If x1 is A 1m and x2 is A 2m . . . and xn is A nm then y =hm (x1 , x2 , ..., xn , αm )
(10.1.21)
The antecedent is defined as a fuzzy AND proposition where A kj is a fuzzy set on
the k th premise variable defined by the membership function µkj : < → [0, 1]. The
consequent is a function hj ( ·, αj ), j = 1, . . . , m , of the input vector [x1 , x2, . . . , xn ].
By means of the fuzzy sets A kj , the input domain is softly partitioned into m
regions where the mapping is locally approximated by the models hj ( ·, αj ).
If the TS inference system uses the weighted mean criterion to combine the local
representations, the model output for a generic query xis computed as
y=Pm
j=1 µ j (x) h j (x, α j )
Pm
j=1 µ j (x)(10.1.22)
where µj is the degree of fulfilment of the j th rule, commonly obtained by
µj ( x ) =
n
Y
k=1
µkj ( x )
This formulation makes of a TS fuzzy system a particular example of LMN
where
ρj ( x) = µ j ( x)
Pm
j=1 µ j (x)(10.1.23)
is the basis function and hj (· , αj ) is the local model of the LMN architecture.
In a conventional fuzzy approach, the membership functions and the consequent
models are fixed by the model designer according to a priori knowledge. In many
cases, this knowledge is not available; however a set of input/output data has been
observed. Once we put the components of the fuzzy system (memberships and
consequent models) in a parametric form, the TS inference system becomes a para-
metric model which can be tuned by a learning procedure. In this case, the fuzzy
system turns into a Neuro-Fuzzy approximator [108]. For a thorough introduction
to Neuro-Fuzzy architecture see [109] and the references therein. Further work on
this subject was presented by the author in [21, 22, 29, 20].
10.1.9 Learning in Basis Function Networks
Given the strong similarities between the three instances of BFN discussed above,
our discussion on the BFN learning procedure does not distinguish between these
approaches.
The learning process in BFN is divided in structural (see Section 8.8) and para-
metric identification (see Section 8.6). The structural identification aims to find
the optimal number and shape of the basis functions ρj ( · ). Once the structure of
the network is defined, the parametric identification searches for the optimal set of
parameters ηj of the basis functions (e.g. centre and width in the Gaussian case)
266 CHAPTER 10. NONLINEAR APPROACHES
and the optimal set of parameters αj of the local models (e.g. linear coefficients in
the case of local linear models).
Hence, the classes of parameters to be identified are two: the parameters of the
basis function and the parameters of the local model.
10.1.9.1 Parametric identification: basis functions
The relationship between the model output and the parameters ηj of the basis func-
tion is typically nonlinear, hence, methods for nonlinear optimisation are currently
employed. A typical approach consists in decomposing the identification procedure
into two steps: first, an initialisation step, which computes the initial location and
width of the basis functions, then a nonlinear optimisation procedure which uses
the outcome η (0)
jof the previous step as initial value.
Since the methods for nonlinear optimisation have already been discussed in
Section 8.6.2.2, here we will focus on the different initialisation techniques for Basis
Function Networks.
One method for placing the centres of the basis functions is to locate them at the
interstices of some coarse lattice defined over the input space [39]. If we assume the
lattice to be uniform with d divisions along each dimension, and the dimensionality
of the input space to be n, a uniform lattice requires dn basis functions. This
exponential growth makes the use of such a uniform lattice impractical for high
dimensional space.
Moody and Darken [134] suggested a K-means clustering procedure in the input
space to position the basis functions. The K-means method, described into detail in
Appendix A.2, takes as input the training set and returns mgroups of input vectors
each parameterized by a centre cj and a width σj . This method generally requires
a much smaller number of basis functions than the uniform partition, nevertheless
the basis location concerns only that part of the input space actually covered by
data. The assumption underlying this method is that similar inputs should produce
similar outputs and that these similar input pairs should be bundled together into
clusters in the training set. This assumption is reasonable but not necessarily
true in real problems. Therefore, the adoption of K-means clustering techniques
for supervised learning is essentially a heuristic technique and finding a dataset to
which this technique cannot be applied satisfactorily is not uncommon.
An alternative to K-means clustering for initialisation has been proposed in the
Neuro-Fuzzy literature [11, 12]. The initialisation of the architecture is provided by
ahyperellipsoidal fuzzy clustering procedure. This procedure clusters the data in
the input/output domain, obtaining a set of hyperellipsoids which are a preliminary
rough representation of the mapping. The parameters of the ellipsoids (eigenval-
ues) are used to initialise the parameters αj of the consequent models, while the
projection of their barycenters on the input domain determines the initial positions
of the membership functions (see Fig. 10.15).
10.1.9.2 Parametric identification: local models
A common approach to the optimisation of the parameters αj of local models is the
least-squares method (see Eq. (8.6.2) and (8.6.4)).
If the local models are nonlinear, some nonlinear optimisation technique is re-
quired (Section 8.6.2.2). Such a procedure is typically computationally expensive
and does not guarantee the convergence to the global minimum.
However, in the case of local linear models (Eq. 10.1.19), the parametric identi-
fication can take advantage of linear techniques. Assume that the local models are
linear, i.e.
hj ( x, αj ) = hj ( x, βj ) = xT βj (10.1.24)
10.1. NONLINEAR REGRESSION 267
Figure 10.15: The hyperellipsoidal clustering initialisation procedure for a single-
input single-output mapping. The training points (dots) are grouped in three el-
lipsoidal clusters after a procedure of fuzzy clustering in the input/output domain.
The projection of the resulting clusters in the input domain (x-axis) determines the
centre and the width of the triangular membership functions.
There are two possible variants for the parameter identification [137, 138]:
Local optimisation. The parameters of each local model are estimated indepen-
dently.
A weighted least squares optimisation criterion can be defined for each local
model, where the weighting factor is the current activation of the correspond-
ing basis function. The parameters of each model hj ( ·, βj ), j = 1, . . . , m , are
then estimated using a set of locally weighted estimation criteria
Jj ( βj ) = 1
N( y−Xβj )T Qj ( y− Xβj ) (10.1.25)
where Qj is a [N× N ] diagonal weighting matrix, having as diagonal elements
the weights ρj (x1 , ηj ), . . . , ρj (xN , ηj ). The weight ρj ( xi , ηj ) represents the
relevance of the ith example of the training set in the definition of the j th
local model.
The locally weighted least squares estimate ˆ
βj of the local model parameter
vector βj is
ˆ
βj = ( XT Qj X )−1 XT Qj y (10.1.26)
Global optimisation. The parameters of the local models are all estimated at the
same time. If the local models are assumed to be linear in the parameters, the
optimisation is a simple least-squares problem. We get the following regression
model:
y=
m
X
j=1
ρj ( x, ηj ) xT βj = ΦΘ (10.1.27)
where Φ is a matrix [N× (n + 1)m]
Φ =
Φ1
.
.
.
ΦN
(10.1.28)
268 CHAPTER 10. NONLINEAR APPROACHES
with
Φi = [ρ1 (xi , ηj )x T
i, . . . , ρ m (x i , η j )x T
i] (10.1.29)
and Θ is a matrix [(n + 1)m× 1]
Θ =
β1
.
.
.
βm
(10.1.30)
The least-squares estimate ˆ
Θ returns the totality of parameters of the local models.
Note that the two approaches differ both in terms of predictive accuracy and
final interpretation of the local models. While the first approach aims to obtain
local linear models hj somewhat representative of the local behaviour of the target
in the region described by ρj [138], the second approach disregards any qualita-
tive interpretation by pursuing only a global prediction accuracy of the modular
architecture.
10.1.9.3 Structural identification
The structure of a BFN is characterised by many factors: the shape of the basis
functions, the number mof modules and the structure of the local models. Here,
for simplicity, we will consider a structural identification procedure which deals
exclusively with the number mof local models.
The structural identification procedure consists in adapting the number of mod-
ules to the complexity of the process underlying the data. According to the process
described in Section 8.8, different BFN architectures with different number of mod-
els are first generated, then validated and finally selected.
Analogously to Neural Networks and Regression Trees, two are the possible
approaches to the generation of BFN architectures:
Forward: the number of local models increases from a minimum mmin to a maxi-
mum value mmax .
Backward: we start with a large number of models and we proceed gradually by
merging basis functions. The initial number must be set sufficiently high such
that the nonlinearity can be captured accurately enough.
Once a set of BFN architectures has been generated, first a validation measure
is used to assess the generalisation error of the different architectures and then a
selection of the best structure is performed. An example of structural identification
of Neuro-Fuzzy Inference Systems based on cross-validation is presented in [21].
Note that BFN structural identification, unlike the parametric procedure de-
scribed in Section 10.1.9.2, is a non convex problems and cannot take advantage of
any linear validation technique. This is due to the fact that a BFN architecture,
even if composed of local linear modules, behaves globally as a nonlinear approxi-
mator. The resulting learning procedure is then characterised by an iterative loop
over different model structures as illustrated in the flow chart of Fig. 10.16.
10.1.10 From modular techniques to local modelling
Modular techniques are powerful engines but leave still some problems unsolved.
While these architectures have efficient parametric identification algorithms, they
are inefficient in terms of structural optimisation. If the parametric identification
takes advantage of the adoption of local linear models, the validation of the global
architecture remains a nonlinear problem which can be addressed only by compu-
tationally expensive procedures.
10.1. NONLINEAR REGRESSION 269
Figure 10.16: Flow-chart of the BFN learning procedure. The learning procedure
is made of two nested loops: the inner one (made of a linear and nonlinear step) is
the parametric identification loop which minimises the empirical error J, the outer
one searches for the model structure which minimises the validation criterion.
270 CHAPTER 10. NONLINEAR APPROACHES
The learning problem for modular architectures is still a problem of function
estimation formulated in terms of minimisation of the empirical risk over the whole
training set. The modular configuration makes the minimisation simpler but in
theoretical terms the problem appears to be at the same level of complexity as in a
generic nonlinear estimator.
Once also the constraint of global optimisation is relaxed, the divide-and-conquer
idea leads to the local modelling approach.
10.1.11 Local modelling
Local modelling is a popular nonparametric technique, which combines excellent
theoretical properties with a simple and flexible learning procedure.
This section will focus on the application of local modelling to the regression
problem. The idea of local regression as a natural extension of parametric fitting
arose independently at different points in time and in different countries in the
19th century. The early literature on smoothing by local fitting focused on one
independent variable with equally spaced values. For a historical review of early
work on local regression see [46].
The modern view of smoothing by local regression has origins in the 1950's and
1960's in the kernel methods introduced in the density estimation setting . As far
as regression is concerned, the first modern works on local regression were proposed
by Nadaraya [140] and Watson [193].
10.1.11.1 Nadaraya-Watson estimators
Let K (x, q, B ) be a nonnegative kernel function that embodies the concept of vicin-
ity. This function depends on the query point q, where the prediction of the target
value is required, and on a parameter B∈ (0,∞ ), called bandwidth , which represents
the radius of the neighbourhood. The function Ksatisfies two conditions:
0≤K (x, q, B )≤ 1 (10.1.31)
K( q, q, B ) = 1 (10.1.32)
For example, in the simplest one-dimensional case (dimension n= 1 of the in-
put space) both the rectangular vicinity function (also called uniform kernel) (Fig.
10.17)
K( x, q, B ) = ( 1 if kx −q k< B
2
0 otherwise (10.1.33)
and the soft threshold vicinity function (Fig. 10.18)
K( x, q, B ) = exp − ( x−q) 2
B2 (10.1.34)
satisfy these requirements. Other examples of kernel functions are reported in
Appendix E.
The Nadaraya-Watson kernel regression estimator is given by
h( q ) = P N
i=1 K(x i , q, B )y i
PN
i=1 K(x i , q, B )(10.1.35)
where N is the size of the training set. The idea of kernel estimation is simple.
Consider the case of a rectangular kernel in one dimension (n = 1). In this case,
the estimator (10.1.35) is a simple moving average with equal weights: the estimate
10.1. NONLINEAR REGRESSION 271
Figure 10.17: Hard-threshold kernel function.
Figure 10.18: Soft-threshold kernel function.
at point q is the average of observations yi corresponding to the xi 's belonging to
the window [q− B, q +B ].
If B → ∞ then the estimator tends to the average h = P N
i=1 y i
Nand thus for
mappings f ( · ) which are far from constant the bias become large.
If B is smaller than the pairwise distance between the sample points xi then the
estimator reproduces the observations h (xi ) = yi . In this extreme case, the bias
tends to zero at the cost of high variance. In general terms, by increasing B we
increase the bias of the estimator, while by reducing Bwe obtain a larger variance.
The optimal choice for B corresponds to an equal balance between bias and variance
(Section 7.7).
From a function approximation point of view, the Nadaraya-Watson estimator
is a least-squares constant approximator. Suppose we want to approximated lo-
cally the unknown function f ( ·) by a constant θ . The local weighted least-squares
estimate is
ˆ
θ= arg min
θ
N
X
i=1
wi (yi − θ )2 = P N
i=1 w i y i
PN
i=1 w i
(10.1.36)
It follows that the kernel estimator is an example of locally weighted constant
approximator with wi =K (xi , q, B ).
The Nadaraya-Watson estimator suffers from a series of shortcomings: it has
large bias particularly in regions where the derivative of the regression function
f( x) or of the density π ( x ) is large. Further, it does not adapt easily to nonuniform
π( x).
An example is given in Fig. 10.19 where the Nadaraya-Watson estimator is used
to predict the value of the function f (x )=0 .9 + x2 in q = 0.5. Since most of the
272 CHAPTER 10. NONLINEAR APPROACHES
Figure 10.19: Effect of asymmetric data distribution on the Nadaraya-Watson esti-
mator: the plot reports in the input/output domain the function f = 0. 9 + x2 to be
estimated, the available points (crosses), the values of the kernel function (stars),
the value to be predicted in q = 0. 5 (dotted horizontal line) and the value predicted
by the NW estimator (solid horizontal line).
observations (crosses) are on the left of q, the estimate is biased downwards.
A more severe problem is the large bias which occurs when estimating at a
boundary region. In Fig. 10.20 we wish to estimate the value of f (x )=0 .9 + x2 at
q= 0 .5. Here the regression function has positive slope and hence the Nadaraya-
Watson estimate has substantial positive bias.
10.1.11.2 Higher order local regression
Once the weakness of the local constant approximation was recognised, a more
general local regression appeared in the late 1970's [44, 175, 114]. Work on local
regression continued throughout the 1980's and 1990's, focusing on the application
of smoothing to multidimensional problems [45].
Local regression is an attractive method both from the theoretical and the prac-
tical point of view. It adapts easily to various kinds of input distributions π(e.g.
random, fixed, highly clustered or nearly uniform). See in Fig. 10.21 the local
regression estimation in q = 0. 5 for a function f (x ) = 0. 9 + x2 and an asymmetric
data configuration.
Moreover, there are almost no boundary effects: the bias at the boundary stays
at the same order as in the interior, without use of specific boundary kernels (com-
pare Fig. 10.20 and Fig. 10.22).
10.1.11.3 Parametric identification in local regression
Given two variables x ∈ X ⊂ <n and y ∈ Y ⊂ < , let us consider the mapping
f:Rn →R , known only through a set of Nexamples {h xi , yi i}N
i=1 obtained as
follows:
yi = f( xi ) + wi , (10.1.37)
10.1. NONLINEAR REGRESSION 273
Figure 10.20: Effect of a boundary on the Nadaraya-Watson estimator: the plot
reports in the input/output domain the function f = 0. 9 + x2 to be estimated, the
available points (crosses), the values of the kernel function (stars), the value to be
predicted in q = 0. 5 (dotted horizontal line) and the value predicted by the NW
estimator (solid horizontal line).
Figure 10.21: Local linear regression in asymmetric data configuration: the plot
reports in the input/output domain the function f = 0. 9 + x2 to be estimated, the
available points(crosses), the values of the effective kernel (stars), the local linear
fitting, the value to be predicted in q = 0. 5 (dotted horizontal line) and the value
predicted by the local regression (solid horizontal line).
274 CHAPTER 10. NONLINEAR APPROACHES
Figure 10.22: Local linear regression in boundary configuration: the plot reports in
the input/output domain the function f = 0. 9 + x2 to be estimated, the available
points (crosses), the values of the effective kernel (stars), the local linear fitting, the
value to be predicted in q = 0. 5 (dotted horizontal line) and the value predicted by
the local regression (solid horizontal line).
where ∀i , wi is a random variable such that Ew [ wi ] = 0 and Ew [ wiwj ] = 0, ∀j6 = i,
and such that Ew [w m
i] = µ m (x i ), ∀m ≥ 2, where µ m (·) is the unknown m th moment
(Eq. (3.3.35)) of the distribution of wi and is defined as a function of xi . In particular
for m = 2, the last of the above-mentioned properties implies that no assumption
of global constant variance (homoscedasticity) is made.
The problem of local regression can be stated as the problem of estimating the
value that the regression function f (x ) = Ey [y |x ] takes for a specific query point q,
using information pertaining only to a neighbourhood of q.
By using the Taylor's expansion truncated to the order p , a generic smooth
regression function f (· ) can be approximated by
f( x)≈
p
X
j=0
f(j) ( q)
j!( x−q)j(10.1.38)
for x in a neighbourhood of q. Given a query point q, and under the hypothesis of a
local homoscedasticity of wi , the parameter vector ˆ
βof a local linear approximation
of f (· ) in a neighbourhood of qcan be obtained by solving the locally weighted
regression (LWR)
ˆ
β= arg min
β
N
X
i=1 ny i −x T
iβ 2 K(x i , q, B )o , (10.1.39)
where K (· ) is a kernel function, Bis the bandwidth, and a constant value 1 has been
appended to each vector xi in order to consider a constant term in the regression.
In matrix notation, the weighted least squares problem (10.1.39) can be written as
ˆ
β= arg min
β(y− Xβ) T W(y− Xβ ) (10.1.40)
10.1. NONLINEAR REGRESSION 275
where X denotes the [N× (n + 1)] input matrix whose ith row is x T
i,yis a [ N×1]
vector whose ith element is yi and W is a [N× N ] diagonal matrix whose ith diagonal
element is wii =p K (xi , q, B ). From least-squares theory, the solution of the above
stated weighted least squares problem is given by the [(n + 1) × 1] vector:
ˆ
β= ( XT WT W X )−1 XT WT W y = ( ZTZ )−1 ZT v= P Z T v, (10.1.41)
where Z = W X ,v = W y , and the matrix XT WT W X =ZT Z is assumed to be
non-singular so that its inverse P = (ZT Z )−1 is defined.
Once obtained the local linear polynomial approximation, a prediction of f (q),
is finally given by:
ˆ yq = qT ˆ
β. (10.1.42)
10.1.11.4 Structural identification in local regression
While the parametric identification in a local regression problem is quite simple
and reduces to a weighted least-squares, there are several choices to be made in
terms of model structure. These are the most relevant parameters in local structure
identification:
•the kernel function K,
•the order of the local polynomial,
•the bandwidth parameter,
•the distance function,
•the localised global structural parameters.
In the following sections, we will present in detail the importance of these structural
parameters and finally we will discuss the existing methods for tuning and selecting
them.
10.1.11.5 The kernel function
Under the assumption that the data to be analysed are generated by a continu-
ous mapping f ( ·), we want to consider positive kernel functions K ( ·,·, B ) that are
peaked at x =q and that decay smoothly to 0 as the distance between x and q
increases. Examples of different kernel functions are reported in Appendix E.
Some considerations can be made on how relevant is the kernel shape for the final
accuracy of the prediction. First, it is evident that a smooth weight function results
in a smoother estimate. On the other side, for hard-threshold kernels (10.1.33),
as q changes, available observations abruptly switch in and out of the smoothing
window. Second, it is relevant to have kernel functions with nonzero values on a
compact bounded support rather than simply approaching zero for |x −q | → ∞ .
This allows faster implementations, since points further from the query than the
bandwidth can be ignored with no error.
10.1.11.6 The local polynomial order
The choice of the local polynomial degree is a bias/variance trade-off. Generally
speaking, a higher degree will generally produce a less biased, but a more variable
estimate than a lower degree one.
Some asymptotic results in literature assert that good practice in local polyno-
mial regression is to adopt a polynomial order which differs of an odd degree from
the order of the terms to be estimated [70]. In practice, this means that if the goal
276 CHAPTER 10. NONLINEAR APPROACHES
Figure 10.23: Too narrow bandwidth ⇒ overfitting ⇒ large prediction error e..
of local polynomial regression is to estimate the value of the function in the query
point (degree zero in the Taylor expansion (10.1.38)), it is advisable to use orders
of odd degree; otherwise, if the purpose is to estimate the derivatives in the query
point it is better to fit with even degrees. However, others suggest in practical
applications not to rule out any type of degree [46].
In the previous sections, we already introduced some consideration on degree
zero fitting. This choice very rarely appears to be the best choice in terms of
prediction, even if it presents a strong advantage in computational terms. By using
a polynomial degree greater than zero we can typically increase the bandwidth by a
large amount without introducing intolerable bias. Despite the increased number of
parameters the final result is smoother thanks to an increased neighbourhood size.
A degree having an integer value is generally assumed to be the only possible
choice for the local order. However, the accuracy of the prediction results to be
highly sensitive to discrete changes of the degree.
A possible alternative is polynomial mixing, proposed in global parametric fitting
by Mallows [129] and in local regression by Cleveland and Loader [46]. Polynomial
mixings are polynomials of fractional degree p =m +c where m is an integer and
0< c < 1. The mixed fit is a weighted average of the local polynomial fits of degree
mand m+ 1 with weight 1 − cfor the former and weight cfor the latter
fp (· ) = (1 − c) fm (· ) + cfm+1 (· ) (10.1.43)
We can choose a single mixing degree for all xor we can use an adaptive method
by letting p vary with x.
10.1.11.7 The bandwidth
A natural question is how wide the local neighbourhood should be so that the local
approximation (10.1.38) holds. This is equivalent to asking how large the bandwidth
parameter should be in (10.1.33). If we take a small bandwidth B, we are able to
cope with the eventual nonlinearity of the mapping, that is, in other terms, we keep
the modelling bias small . However, since the number of data points falling in this
local neighbourhood is also small, we cannot average out the noise and the variance
of the prediction will be consequently large (Fig. 10.23).
On the other hand, if the bandwidth is too large, we could smooth excessively
the data, then introducing a large modelling bias (Fig. 10.24). In the limit case of
an infinite bandwidth, for example, a local linear model turns to be a global linear
fitting which, by definition, cannot take into account any type of nonlinearity.
10.1. NONLINEAR REGRESSION 277
Figure 10.24: Too large bandwidth ⇒ underfitting ⇒ large prediction error e.
A vast amount of literature has been devoted to the bandwidth selection prob-
lem. Various techniques for selecting smoothing parameters have been proposed
during the last decades in different setups, mainly in kernel density estimation [112]
and kernel regression.
Two are the main strategies for the bandwidth selection:
Constant bandwidth selection. The bandwidth B is independent of the train-
ing set DN and the query point q.
Variable bandwidth selection. The bandwidth is a function B ( DN ) of the
dataset DN . For a variable bandwidth a further distinction should be made
between the local and global approach.
1. A local variable bandwidth B ( DN , q ) is not only function of the training
data DN but also changes with the query point q. An example is the
nearest neighbour bandwidth selection where the bandwidth is set to be
the distance between the query point and the k th nearest point [175].
2. A global variable bandwidth is a function B (DN ) of the data set but
is the same for all the queries. However, a further degree of distinction
should be made between the point-based case where the bandwidth B (xi )
is a function of the training point xi , and the uniform case where Bis
constant.
A constant bandwidth is easy to interpret and can be sufficient if the unknown
curve is not too wiggly, i.e. has a high smoothness. Such a bandwidth, however,
fails to do a good job when the unknown curve has a rather complicated structure.
To capture the complexity of such a curve a variable bandwidth is needed. A
variable bandwidth allows for different degrees of smoothing, resulting in a possible
reduction of the bias at peaked regions and of the variance at flat regions. Further,
a variable local bandwidth can adapt to the data distribution, to different level
of noise and to changes in the smoothness of the function. Fan and Gijbels [68]
argue for point-based in favour of query-based local bandwidth selection mainly for
computational efficiency reasons.
10.1.11.8 The distance function
The performance of any local method depends critically on the choice of the distance
function d :Rn × Rn → R . In the following, we define some distance functions for
ordered inputs:
278 CHAPTER 10. NONLINEAR APPROACHES
Unweighted Euclidean distance
d( x, q) = v
u
u
t
n
X
j=1
(xj − qj )2 = q (x− q )T (x− q ) (10.1.44)
Weighted Euclidean distance
d( x, q) = q ( x− q )T MT M( x− q ) (10.1.45)
The unweighted distance is a particular case of the weighted case for Mdiag-
onal with mjj = 1.
Unweighted Lp norm (Minkowski metric)
d( x, q) =
N
X
j=1 |x j −q j |
1
p
(10.1.46)
Weighted Lp norm It is computed through the unweighted norm d (M x, Mq).
It is important to remark that when an entire column of Mis zero, all points along
the corresponding direction get the same relevance in the distance computation.
Also, notice that once the bandwidth is selected, some terms in the matrix Mcan
be redundant parameters of the local learning procedure. The redundancy can be
eliminated by requiring the determinant of Mto be one or fixing some element of
M.
Atkeson et al. [10] distinguish between three ways of using distance functions:
Global distance function. The same distance is used at all parts of the input
space.
Query-based distance function. The distance measure is a function of the cur-
rent query point. Examples are in [174, 97, 76].
Point-based local distance functions. Each training point has an associated
distance metric [174]. This is typical of classification problems where each
class has an associated distance metric [3, 4].
10.1.11.9 The selection of local parameters
As seen in the previous sections, there are several parameters that affect the accu-
racy of the local prediction. Generally, they cannot be selected and/or optimised
in isolation as the accuracy depends on the whole set of structural choices. At the
same time, they do not all play the same role in the determination of the final
estimation. It is a common belief in local learning literature that the bandwidth
and the distance function are the most important parameters. The shape of the
weighting function, instead, plays a secondary role.
In the following, we will mainly focus on the methods existing for bandwidth
selection. They can be classified in
Rule of thumb methods. They provide a crude bandwidth selection which in
some situations may result sufficient. Examples of rule of thumb is provided
in [69] and in [95].
10.1. NONLINEAR REGRESSION 279
Data-driven estimation. It is a selection procedure which estimates the gener-
alisation error directly from data. Unlike the previous approach, this method
does not rely on the asymptotic expression, but it estimates the values directly
from the finite data set. To this group belong methods like cross-validation,
Mallow's Cp , Akaike's AIC and other extensions of methods used in classical
parametric modelling.
There are several ways in which data-driven methods can be used for structural
identification. Atkeson et al. [10] distinguish between
Global tuning. The structural parameters are tuned by optimising a data driven
assessment criterion on the whole data set. An example is the General Memory
Based Learning (GMBL) described in [135].
Query-based local tuning. The structural parameters are tuned by optimising
a data driven assessment criterion query-by-query. An example is the lazy
learning algorithm proposed by the author and colleagues in [24, 31, 30].
Point-based local tuning. A different set of structural parameters is associated
with each point of the training set.
R implementation
A local linear algorithm for regression is implemented by the R library lazy [23].
The script lazy.R shows the prediction accuracy in the Dopler dataset for different
number of neighbours. (Figure 10.25 and Figure 10.26).
•
10.1.11.10 Bias/variance decomposition of the local constant model
An interesting aspect of local models is that it is easy to derive an analytical ex-
pression of the bias/variance decomposition.
In the case of a constant local model the prediction in qis
h( q, αN ) = 1
k
k
X
i=1
y[i]
280 CHAPTER 10. NONLINEAR APPROACHES
Figure 10.25: Locally linear fitting with a rectangular kernel and a bandwidth made
of 10 neighbors.
Figure 10.26: Locally linear fitting with a rectangular kernel and a bandwidth made
of 228 neighbours.
10.2. NONLINEAR CLASSIFICATION 281
computed by averaging the value of y for the k closest neighbours x [i] ,i = 1, . . . , k
of q.
The bias/variance decomposition takes the form discussed in Equation (5.5.15)
that is
MSE(q ) = σ 2
w+ 1
k
k
X
i=1
f( x[i] )− f( q)! 2
+σ2
w/k (10.1.47)
where σ 2
wis the variance of the noise and σ 2
w/k is the variance of a sample average
estimator based on kpoints (Equation (5.5.10). Note the behaviour of the MSE
term as a function of k. By increasing k(i.e. larger neighbourhood) the first
term is invariant, the bias is likely to increase (since farther points are potentially
uncorrelated with q) and the variance decreases.
10.2 Nonlinear classification
In Section 9.2.1, we have shown that optimal classification is possible only if the
quantities Prob {y = ck |x} ,k = 1, . . . , K are known. What happens if this is not
the case? Three strategies are generally used.
10.2.1 Direct estimation via regression techniques
If the classification problem has K= 2 classes and if we denote them by y = 0 and
y= 1
E[y| x] = 1 · Prob {y = 1| x}+ 0 ·Prob {y = 0 | x} = Prob {y = 1| x}
Then a binary classification problem can be put in the form of a regression prob-
lem where the output takes value in { 0,1} . This means that, in principle, all the
regression techniques presented so far could be used to solve a classification task.
In practice, most of those techniques do not make any assumption about the fact
that the outcome in a classification task should satisfy the probabilistic constrains,
e.g. 0 ≤ Prob {y = 1|x } ≤ 1. This means that only some regression algorithms (e.g.
local constant models) are commonly used for binary classification as well.
10.2.1.1 The nearest-neighbour classifier
The nearest-neighbour algorithm is an example of local modelling (Section 10.1.11)
algorithm for classification.
Let us consider a binary { 0,1} classification task where a a training set is avail-
able and the classification is required for an input vector q (query point). The
classification procedure of a k-NN classifier can be summarised in the following
steps:
1. Compute the distance between the query qand the training examples accord-
ing to a predefined metric.
2. Rank the observed inputs on the basis of their distance to the query.
3. Select a subset {x [1] , . . . , x[k] } of the k≥ 1 nearest neighbors. Each of these
neighbours x[k] has an associated class y[k] .
4. Compute the estimation of the conditional probability of the class 1 by con-
stant fitting
ˆ p1 ( q) = P k
i=1 y [ i ]
k(10.2.48)
282 CHAPTER 10. NONLINEAR APPROACHES
202 4
0.0 0.2 0.4 0.6 0.8 1.0
X
Y
Number of neighbors= 1
202 4
0.0 0.2 0.4 0.6 0.8 1.0
X
Y
Number of neighbors= 11
202 4
0.0 0.2 0.4 0.6 0.8 1.0
X
Y
Number of neighbors= 21
202 4
0.0 0.2 0.4 0.6 0.8 1.0
X
Y
Number of neighbors= 31
202 4
0.0 0.2 0.4 0.6 0.8 1.0
X
Y
Number of neighbors= 41
202 4
0.0 0.2 0.4 0.6 0.8 1.0
X
Y
Number of neighbors= 51
202 4
0.0 0.2 0.4 0.6 0.8 1.0
X
Y
Number of neighbors= 61
202 4
0.0 0.2 0.4 0.6 0.8 1.0
X
Y
Number of neighbors= 71
202 4
0.0 0.2 0.4 0.6 0.8 1.0
X
Y
Number of neighbors= 101
202 4
0.0 0.2 0.4 0.6 0.8 1.0
X
Y
Number of neighbors= 201
202 4
0.0 0.2 0.4 0.6 0.8 1.0
X
Y
Number of neighbors= 301
202 4
0.0 0.2 0.4 0.6 0.8 1.0
X
Y
Number of neighbors= 391
Figure 10.27: kNN prediction (blue line) of the conditional probability (green line)
for different values of k . Dotted-lines represent the class-conditional densities.
or linear fitting
ˆ p1 ( q) = ˆ aq +ˆ
b
where the parameters ˆ aand ˆ
bare locally fitted by least-squares regression.
5. Return the prediction, either by majority vote or according to the conditional
probability.
It is evident that the hyperparameter kplays a key role in the trade-off between
bias and variance. Figure 10.27 illustrates the trade-off in a n = 1 and N= 400
binary classification task where the two class-conditional distributions are Normal
with means in −1 and 3, respectively. Note that by increasing kthe prediction
profile becomes smoother and smoother.
Figure 10.28 shows the trade-off in a n= 2 classification task. Note that, though
the separating region becomes closer to the optimal for large k, an extrapolation
bias occurs in regions far from the observed examples.
It is interesting to see that the kNN classifier can be justified in terms of the
Bayes theorem. Suppose that the dataset has the form DN = { (x1 , y1 ),..., (xN , yN ) }
where y ∈ {c1 , . . . , cK } and q∈Rn is the query point where we want to compute
the a-posteriori probability. Suppose that the dataset contains Nj points labeled
with the class cj , i.e.
K
X
j=1
Nj = N
Let us consider a region Raround the input x having volume V . If the volume
10.2. NONLINEAR CLASSIFICATION 283
10 5 0 5 10
10 5 0 5 10
x1
x2
Number of neighbors= 1
10 5 0 5 10
10 5 0 5 10
x1
x2
Number of neighbors= 5
10 5 0 5 10
10 5 0 5 10
x1
x2
Number of neighbors= 40
Figure 10.28: kNN class predictions in the n = 2 input space for different values of k.
Dots represent the training points. Continuous black line is the optimal separating
hyperplane.
284 CHAPTER 10. NONLINEAR APPROACHES
is small enough, we may consider the density constant over the entire region5 . It
follows that the probability of observing a point within this volume is
P=ZR
p( z) dz ≈ p( x ) V⇒ p( x)≈ P
V
Given a training dataset of size N, if we observe NR example in a region Rwe can
approximate P with ˆ
P=NR
Nand consequently obtain
ˆ p( x ) = N R
NV (10.2.49)
Consider now the query point qand a neighbouring volume V containing k
points of which kj ≤ k are labeled with the class cj . From (10.2.49) we obtain the
k-NN density estimate (A.1.11) of the class-conditional probability
ˆ p( q| cj ) = k j
Nj V
and of the unconditional density
ˆ p( q ) = k
NV
Since the class priors can be estimated by [
Prob {y = cj } =N j
Nfrom (7.3.16) it
follows
[
Prob {y =cj | q} =k j
kj= 1, . . . , K
This implies that in a binary { 0,1} case the computation (10.2.48) estimates the
conditional probability of the class 1.
10.2.2 Direct estimation via cross-entropy
The approach consists in modelling the conditional distribution Prob {y = cj | x} , j =
1, . . . , K with a set of models ˆ
Pj ( x, α), j = 1, . . . , K satisfying the constraints
ˆ
Pj ( x, α) > 0 and
K
X
j=1
ˆ
Pj ( x, α )=1 .
Parametric estimation boils down to the minimisation of the cross-entropy cost
function (8.6.5). Typical approaches are logistic regression and neural networks.
In logistic regression for a two-class task we have
ˆ
P1 ( x, α) = exp x T α
1 + expx T α = 1
1 + exp−xT α , ˆ
P2 ( x, α) = 1
1 + expx T α (10.2.50)
where x and α are [p, 1] vectors. This implies
log ˆ
P1 ( x, α)
ˆ
P2 ( x, α)= x T α
where the transformation log p
1−p is called the logit transformation and the func-
tion (10.1.2) is the logistic function. In a nonlinear classifier (e.g. neural net)
log ˆ
P1 ( x, α)
ˆ
P2 ( x, α)= h( x, α)
5For a discussion about the validity of this assumption in large dimensional settings, refer to
Section 12.1
10.2. NONLINEAR CLASSIFICATION 285
where h (x, α ) is the output of the learner.
In a binary case (c1 = 1, c2 = 0) the cross-entropy function (to minimise) be-
comes
J( α) = −
N
X
i=1
log ˆ
Py i ( xi , α) =
−
N
X
i=1 y i log ˆ
P1 ( xi , α) + (1 − yi ) log(1 − ˆ
P1 ( xi , α))=
=
N
X
i=1 −y i h(x i , α) + log(1 + exp h(xi,α) (10.2.51)
In the logistic regression case (linear h) the cost function is minimised by iteratively
reweighted least squares. For a generic h, a gradient-based iterative approach is
required.
Another formulation of the binary case is (c1 = 1, c2 = − 1) with
ˆ
P( y| x) = 1
1 + exp−yh(x,α)
that satisfies ˆ
P( y= 1| x) = 1 − ˆ
P( y=− 1| x). In this case the classification rule is
the sign function sign[h (x )] and the cost function to minimise is
J( α) =
N
X
i=1
log(1 + exp−yih(xi,α) ) (10.2.52)
also known as the log-loss function which is a monotone decreasing function of the
terms yi h (xi , α ) = yi hi called the margins. Minimise (10.2.52) is then equivalent
to minimise the set of training points for which yi and the prediction h (xi , α ) have
a different sign. The decreasing nature of the function exp−yh(x,α) is such that
negative margins are much more penalised than positive ones. Note that this is not
a property of the least-squares criterion (used in regression) (y− h (x, α))2 where in
some cases a positive margin may be more penalised than a negative one. This is a
reason why regression techniques are not recommended in classification tasks.
10.2.3 Density estimation via the Bayes theorem
Since
Prob {y=ck |x }= p(x |y=ck ) Prob {y= ck }
p( x)
an estimation of Prob {x|y = ck } allows an estimation of Prob {y = ck | x} . Several
techniques exist in literature to estimate Prob {x |y = ck } .
We will present two of them in the following section. The first makes the as-
sumption of conditional independence to make easier the estimation. The second
relies on the construction of optimal separating hyperplanes to create convex regions
containing set of x points sharing the same class label.
10.2.3.1 Naive Bayes classifier
The Naive Bayes (NB) classifier has shown in some domains a performance compa-
rable to that of neural networks and decision tree learning.
Consider a classification problem with ninputs and a random output variable y
that takes values in the set {c1 , . . . , cK } . The Bayes optimal classifier should return
c∗ ( x ) = arg max
j=1 ,...,K Prob {y= c j |x}
286 CHAPTER 10. NONLINEAR APPROACHES
We can use the Bayes theorem to rewrite this expression as
c∗ ( x ) = arg max
j=1,...,K
Prob {x|y = cj } Prob {y = cj }
Prob {x}
= arg max
j=1 ,...,K Prob {x |y=c j }Prob {y = c j }
How to estimate these two terms on the basis of a finite set of data? It is easy
to estimate each of the a priori probabilities Prob {y = cj } simply by counting the
frequency with which each target class occurs in the training set. The estimation
of Prob {x|y = cj } is much harder. The NB classifier is based on the simplifying
assumption that the input values are conditionally independent given the target
value (see Section 3.5.4):
Prob {x|y = cj } = Prob {x1 , . . . , xn |y = cj } =
n
Y
h=1
Prob {xh |y = cj }
The NB classification is then
cNB ( x ) = arg max
j=1 ,...,K Prob {y=c j }
n
Y
h=1
Prob {xh |y =cj }
If the inputs xh are discrete variables the estimation of Prob {xh |y =cj } boils down
to the counting of the frequencies of the occurrences of the different values of x h
for a given class cj .
Example
Obs G1 G2 G3 G
1 P.LOW P.HIGH N.HIGH P.HIGH
2 N.LOW P.HIGH P.HIGH N.HIGH
3 P.LOW P.LOW N.LOW P.LOW
4 P.HIGH P.HIGH N.HIGH P.HIGH
5 N.LOW P.HIGH N.LOW P.LOW
6 N.HIGH N.LOW P.LOW N.LOW
7 P.LOW N.LOW N.HIGH P.LOW
8 P.LOW N.HIGH N.LOW P.LOW
9 P.HIGH P.LOW P.LOW N.LOW
10 P.HIGH P.LOW P.LOW P.LOW
Let us compute the NB classification for the query { G1=N.LOW G2= N.HIGH
G3=N.LOW Since
Prob {y = P.HIGH } = 2/ 10, Prob {y = P.LOW } = 5/10
Prob {y = N.H IGH } = 1/ 10, Prob {y = N .LOW } = 2/10
Prob {G 1 = N.LO W |y = P.HI GH } = 0/ 2, Prob {G 1 = N .LOW |y = P.LOW } = 1/5
Prob {G 1 = N.LO W |y = N.HIGH } = 1/ 1, Prob { G 1 = N .LOW |y = N.LO W } = 0/2
Prob {G 2 = N.H IGH |y = P.H IGH } = 0/ 2, Prob {G 2 = N.HI GH |y = P .LOW } = 1/5
Prob {G 2 = N.H IGH |y = N.HI GH } = 0/ 1, Prob {G 2 = N.HIGH|y = N .LOW } = 0/2
Prob {G 3 = N.LO W |y = P.HI GH } = 0/ 2, Prob {G 3 = N .LOW |y = P.LOW } = 3/5
Prob {G 3 = N.LO W |y = N.HIGH } = 0/ 1, Prob {G 3 = N .LOW |y = N.LO W } = 0/2
10.2. NONLINEAR CLASSIFICATION 287
it follows that
cNB ( x ) =
arg max
P.H,P.L,N.H,N.L{2/10 ∗0∗0∗ 0,5/10 ∗1/ 5∗ 1/ 5∗ 3/ 5, 1/10∗1∗0 ∗ 1,2/10 ∗0∗0 ∗ 0}=
=P.LOW
•
The NB classifier relies on the naive (i.e. simplistic) assumption that the inputs
are independent given the target class. But why is this assumption done and when
may it be considered as realistic? There are essentially two reasons underlying the
NB approach, one of statistical nature and the second of causal nature. From a
statistical perspective the conditional independence assumption largely reduces the
capacity of the classifier by reducing the number of parameters (Section 4.1). This
is a variance reduction argument which makes the algorithm effective in large di-
mensional classification tasks. However, there are classification tasks which by their
own nature are more compliant with the NB assumptions than other. Those are
tasks where the features used to predict the class are descriptors of the phenomenon
represented by the class. Think for instance to the classification task where a doc-
tor predicts if a patient got a flu by means of symptomatic information (does she
cough? has he fever?). All those measures are correlated but they become indepen-
dent once we know the latent state. In a causal perspective (Chapter 13 ) NB makes
the assumption that the considered input features are effects of a common variable
(the target class) (Figure 13.9 left). Another example where this assumption holds
is fraud detection [52, 50] where the observed features (e.g. place and amount of
transaction) are consequences of the fraudulent action and then informative about
it.
10.2.3.2 SVM for nonlinear classification
The extension of the Support Vector (SV) approach to nonlinear classification re-
lies on the transformation of the input variables and the possibility of effectively
adapting the SVM procedure to a transformed input space.
The idea of transforming the input space by using basis functions is an intuitive
manner of extending linear techniques to a nonlinear setting.
Consider for example an input output regression problem where x ∈ X ⊂ Rn .
Let us define mnew transformed variables zj = zj (x ), j = 1, . . . , m , where zj ( ·)
is a pre-defined nonlinear transformation (e.g. zj (x ) = log x1 + log x2 ). This is
equivalent to mapping the input space Xinto a new space, also known as feature
space,Z = {z= z( x ) |x ∈ X}. Note that, if m < n, this transformation boils down
to a dimensionality reduction and it is an example of feature selection (Chapter 12).
Let us now fit a linear model y =P m
j=1 β m z m to the training data in the new
input space z∈Rm . By doing this, we carry out a nonlinear fitting of data simply
by using a conventional linear technique.
This procedure can be adopted for every learning procedure. However, it is still
more worthy being used in a SV framework. Before discussing it, we introduce the
notion of dot-product kernel.
Definition 2.1 (Dot-product kernel) . A dot-product kernel is a function K, such
that for all xi , xj ∈ X K (xi , xj ) = h z (xi ), z (xj )i (10.2.53)
where h z1 , z2 i =z T
1z 2 stands for the inner product and z(· ) is the mapping from
the original to the feature space Z.
288 CHAPTER 10. NONLINEAR APPROACHES
Let us suppose now that we want to perform a binary classification by SVM
in a transformed space z ∈ Z . For the sake of simplicity, we will consider a sepa-
rable case. The parametric identification step requires the solution of a quadratic
programming problem in the space Z
max
α
N
X
i=1
αi − 1
2
N
X
i=1
N
X
k=1
αiαkyiyk z T
iz k = (10.2.54)
= max
α
N
X
i=1
αi − 1
2
N
X
i=1
N
X
k=1
αiαkyiyk hzi , zk i = (10.2.55)
= max
α
N
X
i=1
αi − 1
2
N
X
i=1
N
X
k=1
αiαkyiyk K( xi , xk ) (10.2.56)
subject to 0 =
N
X
i=1
αiyi , (10.2.57)
αi ≥0 , i = 1 , . . . , N (10.2.58)
What is interesting is that the resolution of this problem differs from the linear
one (Equation (9.2.70)) by the replacement of the quantities hxi , xk i with h zi , zk i =
K( xi , xk ).
This means that whatever the feature transformation z (x ) and whatever the
dimensionality m , the SVM computation requires only the availability of the Gram
matrix in the feature space, also referred to as the kernel matrix K . What is
interesting is that once we know how to derive the kernel matrix we do not even
need to know the underlying feature transformation function z (x).
The use of a kernel function is an attractive computational short-cut. In practice,
the approach consists in defining a kernel function directly, hence implicitly defining
the feature space.
10.3 Is there a best learner?
A vast amount of literature in machine learning served the purpose of showing the
superiority of some learning methods over the others. To support this claim, quali-
tative considerations and tons of experimental simulations have been submitted to
the scientific community. Every machine learning researcher dreams of inventing the
most accurate algorithm, without realising that the attainment of such an objective
would necessarily mean the end of machine learning... But is there an algorithm to
be universally preferred over others in terms of prediction accuracy?
If there was a universally best learning machine, research on machine learning
would be unnecessary: we would use it all the time. (Un)fortunately, the theoretical
results on this subject are not encouraging [58]. For any number Nof observations,
there exist an input/output distribution for which the estimate of generalisation
error is arbitrarily poor. At the same time, for any learning machine L1 it exists a
data distribution and another learning machine L2 such that for all N , L2 is better
than L1 .
It can be shown that there is no learning algorithm which is inherently superior
to any other, or even to random guessing. The accuracy depends on the match
between the (unknown) target distribution and the (implicit or explicit) inductive
bias of the learner.
This (surprising ?) result has been formalised by the No Free Lunch (NFL)
theorems by D. Wolpert [197]. In his seminal work, Wolpert characterises in prob-
abilistic terms the relation between target function, dataset and hypothesis. The
10.3. IS THERE A BEST LEARNER? 289
main difference with respect to other research on generalisation is that he does not
consider the generating process as constant (e.g. ffixed as in the bias/variance
decomposition 7.7.36), but he supposes the existence of a probability distribution
p( f) over the target functions fand that a learning algorithm implements a prob-
ability distribution p ( h) over the hypothesis h. For instance p (h = h|DN ) denotes
the probability that a learning algorithm will return the hypothesis hgiven the
training set DN6 . Based on this formalism, he encodes the following assumptions
in a probabilistic language:
•the target distribution is completely outside the researcher's control[197] ,
•the learning algorithm designer has no knowledge about fwhen guessing a
hypothesis function.
This means that, over the input space region where we observed no training ex-
amples (off-training region), the hypothesis his conditionally independent (Sec-
tion 3.5.4) of f given the training set:
p( h| f, DN ) = p( h| DN )
which in turn is equivalent to
p( f| h, DN ) = p( f| DN )
In other terms the only information about the target process that a hypothesis may
take advantage of, is the one contained in the training set.
He then derives the generalisation error of a learning algorithm Lconditioned
on a training set DN and computed on input values which do not belong to the
training set (i.e. off-training set region) as
X
x6∈ DN
Ef,h [L(f , h)| DN ]( x ) =
X
x6∈ DN Z f,h
L(h( x ) , f ( x )) p( f, h| DN ) df dh = X
x6∈ DN Z f,h
L(h( x ) , f ( x)) p ( f | DN ) p (h|DN )dfdh
(10.3.59)
It follows that the generalisation error7depends on the alignment (or match)
between the hypothesis hreturned by Land the target fwhich is represented by
the inner-product p (f| DN )p ( h|DN ). Since the target is unknown this match may
only be assessed a posteriori: a priori there is no reason to consider a learning
algorithm better than another. For any learning algorithm which is well aligned
with the distribution p (f| DN ) in the off-training set it is possible to find another
distribution for which the match is much worse. Equation (10.3.59) is one of the
several NFL results stating that there is no problem-independent reason to favour
one learning algorithm Lover another (not even random guessing) if
1. we are interested only in generalisation accuracy,
2. we make no a priori assumption on the target distribution,
3. we restrain to the accuracy over the off-training set region.
6Note that throughout this book we have only considered deterministic learning algorithms,
for which p (h| H ) is a Dirac function
7Note the differences between the definitions (7.7.36) and (10.3.59) of generalisation error.
In (7.7.36) fis fixed and DN is random: in (10.3.59) fis random and DN is fixed
290 CHAPTER 10. NONLINEAR APPROACHES
x1x2x3 yˆ y1 ˆ y2 ˆ y3 ˆ y4
0 0 0 1 1 1 1 1
0 0 1 0 0 0 0 0
0 1 0 1 1 1 1 1
0 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0
1 1 0 ? 0 0 1 1
1 1 1 ? 0 1 0 1
Table 10.1: Off-training set prediction binary example
The presumed overall superiority of a learning algorithm (no matter the number of
publications or the H-factor of the author) is apparent and depends on the specific
task and underlying data generating process. The NFL theorems are then a modern
probabilistic version of the Hume skeptical argument (Section 2.4): there is no log-
ical evidence that the future will behave like the past. Any prediction or modelling
effort demands (explicitly or implicitly) an assumption about the data generating
process and the resulting accuracy is strictly related to the validity of such assump-
tion. Note that such assumptions underlie also learning procedures that seem to be
general purpose and data-driven like holdout or cross-validation: for instance, an
holdout strategy makes the assumption that the relation between the training por-
tion of the dataset and the validation one is informative about the relation between
the observed dataset and future query points (in off-training regions).
A NFL example
Let us consider a classification example from [61] where we have three binary in-
puts and one binary target which is a deterministic function of the inputs. Let us
suppose that the value of the target is known for 6 input configurations and that
we want to predict the value for the 2 remaining ones (Table ). Let us consider
4 classifiers which have identical behaviour for the training set yet they differ in
terms of their predictions for the off-training region. Which one is the best one in
the off-training set region? May we discriminate between them on the basis of the
training behaviour, if we make no additional assumption about the input/output
relationship in the off-training set region? The No Free Lunch answer is no. If
we assume no a priori information about the conditional distribution, we have 4
equiprobable off-training behaviours. On average the four predictors have the same
accuracy. Note that each predictor could have a justification in terms of nearest
neighbour (Table 10.2). For instance, the first classifier (in black) relies on the in-
ductive bias that the first and the third features are the most informative about the
target, i.e. if we consider in the dataset the nearest neighbours of the off-tranining
inputs we obtain two zeros as predictions. This is not the case of the fourth (red)
which makes implicitly a different hypothesis, (i.e. the target depends on x2 only)
and returns two ones accordingly.
•
10.4 Conclusions
A large part of machine learning research in the last forty years has been devoted to
the quest for the Holy Grail of generalisation. This chapter presented a number of
learning algorithms and their rationale. Most of those algorithms made the history
of the machine learning and were undeniably responsible for the success of the
10.4. CONCLUSIONS 291
Table 10.2: Off-training set prediction binary example: nearest-neighbour interpre-
tation of the four classifiers. The colours show which training points have been used
according to a nearest-neighbour strategy to return the off-training set predictions.
discipline. When they were introduced, and every time they were used afterwards,
they were shown to be competitive and often outperform other algorithms. So, how
may all this be compatible with the No Free Lunch result? First of all the NFL
does not deny that some algorithms may generalise well under some circumstances.
It simply states that there is not a single algorithm outperforming consistently all
the others. Also, NFL results assume that the off-training set is the most pertinent
measure for assessing algorithms. Last but not least, the success of statistics (and
ML) is probably indicative that the prediction tasks we are commonly confronted
with belong to not such a wide and uniform distribution but that some of them are
more probable than others.
Nevertheless, the NFL results may appear as frustrating to a young researcher
aiming to pursue a career in machine learning (and incidentally finding the Holy
Grail): this is not necessarily so if we define in a less utopian, yet more scientific
way, the mission of a data scientist. The mission of a data scientist should not
be the promotion of a specific algorithm (or family of algorithms) but acting as
a scientist through the analysis of data. This means (s)he should use his know-
how NOT to return information about the merits of his preferred algorithm BUT
about the nature of the data generating distribution (s)he is dealing with. The
outcome of a ML research activity (e.g. publication) should be additional insight
about the observed reality (or Nature) and not a contingent statement about the
temporary superiority of an algorithm. Newtons aim was to use differential calculus
to model and explain dynamics in Nature and not to promote a fancy differential
equation tool8. Consider also that every ML algorithm (also the least fashionable
and the least performing) might return some information (e.g. about the degree of
noise, nonlinearity) about the phenomenon we are observing. For instance, a low
accurate linear model tells us a lot about the lack of validity of the (embedded)
linear assumption in the observed phenomenon. In that sense wrong models might
play a relevant role as well since they might return important information about the
phenomenon under observation, notably (non)linearity, (non)stationarity, degree of
stochasticity, relevance of features and nature of noise.
8...or patent it !
292 CHAPTER 10. NONLINEAR APPROACHES
10.5 Exercises
1. Suppose you want to learn a classifier for detecting spam in emails. Let the binary
variables x1 , x2 and x3 represent the occurrence of the words "Viagra", "Lottery"
and "Won", respectively, in a email.
Let the dataset of 20 emails being summarized as follows
Document x1 (Viagra) x2 (Lottery) x3 (Won) y(Class)
E1 0 0 0 NOSPAM
E2 0 1 1 SPAM
E3 0 0 1 NOSPAM
E4 0 1 1 SPAM
E5 1 0 0 SPAM
E6 1 1 1 SPAM
E7 0 0 1 NOSPAM
E8 0 1 1 SPAM
E9 0 0 0 NOSPAM
E10 0 1 1 SPAM
E11 1 0 0 NOSPAM
E12 0 1 1 SPAM
E13 0 0 0 NOSPAM
E14 0 1 1 SPAM
E15 0 0 1 NOSPAM
E16 0 1 1 SPAM
E17 1 0 0 SPAM
E18 1 1 1 SPAM
E19 0 0 1 NOSPAM
E20 0 1 1 SPAM
where
•0 stands for the case-insensitive absence of the word in the email.
•1 stands for the case-insensitive presence of the word in the email.
Let y = 1 denote a spam email and y= 0 a no-spam email.
The student should
2. Estimate on the basis of the data of exercise 1:
•Prob {x1 = 1, x2 = 1}
•Prob {y= 0 |x2 = 1, x3 = 1}
•Prob {x1 = 0 |x2 = 1}
•Prob {x3 = 1 |y= 0, x2 = 0}
•Prob {y= 0 |x1 = 0,x2 = 0,x3 = 0}
•Prob {x1 = 0 |y= 0}
•Prob {y= 0}
Solution:
•Prob {x1 = 1, x2 = 1 }= 0.1
•Prob {y= 0 |x2 = 1,x3 = 1 } = 0
•Prob {x1 = 0 |x2 = 1 }= 0 .8
•Prob {x3 = 1 |y= 0, x2 = 0 }= 0 .5
•Prob {y = 0 |x1 = 0, x2 = 0,x3 = 0 } = 1
•Prob {x1 = 0 |y = 0 } = 0.875
•Prob {y = 0 } = 0.4
10.5. EXERCISES 293
3. Answer to the following questions (Yes or No) on the basis of the data of exercise 1:
•Are x1 and x2 independent?
•Are x1 and y independent?
•Are the events x1 = 1 and x2 = 1 mutually exclusive?
Solution:
•Are x1 and x2 independent? NO
•Are x1 and y independent? NO
•Are the events x1 = 1 and x2 = 1 mutually exclusive? NO
4. Consider the following three emails
•M1: "Lowest Viagra, Cialis, Levitra price".
•M2: "From Google Promo (GOOGLEPROMOASIA) Congratulation! Your
mobile won 1 MILLION USD in the GOOGLE PROMO"
•M3: "This is to inform you on the release of the EL-GORDO SWEEPSTAKE
LOTTERY PROGRAM. Your name is attached to ticket number 025-11-464-
992-750 with serial number 2113-05 drew the lucky numbers 13-15 which con-
sequently won the lottery in the 3rd category."
Use a Naive Bayes Classifier to compute for email M1 on the basis of the data of
exercise 1::
•the input x
•Prob {y= SPAM| x} Prob {x}
•Prob {y= NOSPAM| x} Prob { x}
•the email class
Solution:
•the input x = [1, 0, 0]
•Prob {y= SPAM| x} Prob {x } = 1/ 180 = 0.0055
•Prob {y= NOSPAM| x} Prob { x} = 1/40 = 0.025
•the email class: NOSP
5. Use a Naive Bayes Classifier to compute for email M2 on the basis of the data of
exercise 1::
•the input x
•Prob {y= SPAM| x} Prob {x}
•Prob {y= NOSPAM| x} Prob { x}
•the email class.
Solution:
•the input x = [0, 0, 1]
•Prob {y= SPAM| x} Prob {x } = 1/18 = 0.055
•Prob {y= NOSPAM| x} Prob { x} = 7/40 = 0.175
•the email class is NOSPAM.
6. Use a Naive Bayes Classifier to compute for email M3 on the basis of the data of
exercise 1:
•the input x
•Prob {y= SPAM| x} Prob {x}
294 CHAPTER 10. NONLINEAR APPROACHES
•Prob {y= NOSPAM|x } Prob { x}
•the email class
Solution:
•the input x = [0, 1, 1]
•Prob {y= SPAM| x} Prob { x} = 5/18 = 0.27
•Prob {y= NOSPAM|x } Prob { x} = 0
•the email class is SPAM
7. Consider a classification task with two binary inputs and one binary target y ∈
{−1,+1 } where the conditional distribution is
x1x2 P( y= 1| x1 , x2 )
0 0 0.8
0 1 0.1
1 0 0.5
1 1 1
Suppose that all the input configurations have the same probability.
Let the classifier be the rule:
IF x2 = 0 THEN ˆ y=−1 ELSE ˆ y= 1.
Consider a test set of size N = 10000.
For this classifier compute:
•the confusion matrix,
•the precision,
•the specificity (true negative rate)
•the sensitivity (true positive rate)
Solution:
•the confusion matrix,
ˆ y=−1 ˆ y= 1
y=− 1 TN=1750 FP=2250
y= 1 FN=3250 TP=2750
•the precision TP/(TP+FP)=2750/5000=0.55
•the specificity (true negative rate) TN/(TN+FP)=1750/4000=0.4375
•the sensitivity (true positive rate) TP/(TP+FN)=2750/6000=0.458
8. Consider a classification task with two binary inputs and one binary target y ∈
{−1,+1 } where the conditional distribution is
x1x2 P( y= 1| x1 , x2 )
0 0 0.8
0 1 0.1
1 0 0.5
1 1 1
Suppose that all the input configurations have the same probability.
Let the classifier be the rule:
IF x2 = 0 THEN ˆ y=−1 ELSE ˆ y= 1.
Consider a test set of size N = 10000.
For this classifier compute:
10.5. EXERCISES 295
•the confusion matrix,
•the precision,
•the specificity (true negative rate)
•the sensitivity (true positive rate)
Solution:
•the confusion matrix,
ˆ y=− 1 ˆ y= 1
y=− 1 TN=1750 FP=2250
y= 1 FN=3250 TP=2750
•the precision, TP/(TP+FP)=2750/5000=0.45
•the specificity (true negative rate) TN/(TN+FP)=1750/4000=0.4375
•the sensitivity (true positive rate) TP/(TP+FN)=2750/6000=0.458
9. Consider a regression task with input x and output y and the following training set
X Y
0 0.5
-0.3 1.2
0.2 1
0.4 0.5
0.1 0
-1 1.1
Consider the three following models:
•constant
•1NN, Nearest Neighbour with K=1
•3NN, Nearest Neighbour with K=3
Compute for the constant model
•the vector of training errors ei =yi −ˆ yi
•the vector of leave-one-out errors e −i
i=y i −ˆ y−i
i
•the mean-squared training error ,
•the mean-square leave-one-out error.
Compute for the 1NN model
•the vector of training errors ei = yi −ˆ yi
•the vector of leave-one-out errors e−i = yi − ˆ y−i
i
•the mean-squared training error ,
•the mean-squared leave-one-out error.
Compute for the 3NN model
•the vector of training errors ei = yi −ˆ yi
•the vector of leave-one-out errors e−i = yi − ˆ y−i
i
•the mean-squared training error ,
•the mean-squared leave-one-out error.
Solution: Constant model
296 CHAPTER 10. NONLINEAR APPROACHES
•the vector of training errors ei = yi −ˆ yi = [− 0 .2167 ,0 .4833 ,0 .2833 ,−0 .2167 ,−0 .7167 ,0 .3833]
•the vector of leave-one-out errors e−i
i=y i −ˆ y−i
i= [−0. 26, 0 .58 , 0 . 34 , − 0.26, −0 .86 ,0 . 46]
•the mean-squared training error = 0.178
•the mean-square leave-one-out error = 0.2564
1NN model:
•the vector of training errors ei =yi −ˆ yi = [000000]
•the vector of leave-one-out errors e−i =yi − ˆ y−i
i= [0. 5, 0. 7, 1,− 0 . 5, − 0.5,− 0 .1]
[0.5, 0. 7, 1,− 0 . 5, − 1, − 0. 1]
•the mean-squared training error =0
•the mean-squared leave-one-out error =0.375 or 0.5
3NN model
•the vector of training errors ei = yi −ˆ yi = [0 ,0 .6333 ,0 .5 ,0 ,−0 .5 ,0 .1667]
•the vector of leave-one-out errors e −i = yi − ˆ y−i
i= [−0. 2333, 0. 7, 0. 6667, 0, −0 .6667 ,0 .5333]
•the mean-squared training error =0.1548
•the mean-squared leave-one-out error=0.2862
10. Consider a classification task with three binary inputs and one binary target where
the conditional distribution is
x1x2x3 P( y= 1| x1 , x2, x3 )
0 0 0 0.8
0 0 1 0.9
0 1 0 0.5
0 1 1 1
1 0 0 0.8
1 0 1 0.1
1 1 0 0.1
1 1 1 0
Suppose that all the input configurations have the same probability.
Let the classifier be the rule:
IF x1 = 0 OR x2 = 0 THEN ˆ y= 1 ELSE ˆ y= 0.
Suppose we have a test set of size N = 10000.
Considering the class 1 as the positive class, for this classifier compute:
•the confusion matrix,
•the precision,
•the specificity (true negative rate) and
•the sensitivity (true positive rate)
Solution:
•the confusion matrix,
x1x2x3 y = 1 ˆ y= 1 TP FP TN FN
0 0 0 1000 1250 1000 250 0 0
0 0 1 1125 1250 1125 125 0 0
0 1 0 625 1250 625 625 0 0
0 1 1 1250 1250 1250 0 0 0
1 0 0 1000 1250 1000 250 0 0
1 0 1 125 1250 125 1125 0 0
1 1 0 125 0 0 0 1125 125
1 1 1 0 0 0 0 1250 0
10.5. EXERCISES 297
ˆ y= 1 ˆ y= 0
y= 1 TP=5125 FN=125
y= 0 FP=2375 TN=2375
•the precision = 5125/(5125+2375)=0.68
•the specificity (true negative rate) = 2375/(2375+2375)=0.5
•the sensitivity (true positive rate) = 5125/(5125+125)=0.976
11. Let us consider the following classification dataset where yis the binary target.
x1 x2 y
-4 7.0 1
-3 -2.0 1
-2 5.0 0
-1 2.5 1
1 1.0 0
2 4.0 1
3 6.0 0
4 3.0 1
5 -1.0 0
6 8.0 0
•Consider the 1st classifier: IF x1 > h THEN ˆ y= 1 ELSE ˆ y= 0
Trace its ROC curve (considering 1 as the positive class)
•Consider the 2nd classifier: IF x2 > k THEN ˆ y= 0 ELSE ˆ y= 1
Trace its ROC curve (considering 1 as the positive class)
•Which classifier is the best one (1st/2nd)?
Solution:
•1st classifier ROC curve (considering 1 as the positive class):
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
FPR
TPR
•2nd classifier ROC curve (considering 1 as the positive class):
298 CHAPTER 10. NONLINEAR APPROACHES
0.0 0.2 0.4 0.6 0.8 1.0
0.2 0.4 0.6 0.8 1.0
FPR
TPR
•Which classifier is the best one (1st/2nd)? The 2nd.
12. Consider a binary classification task and the training set
x1 x2y
1 1 -1
2 0.5 -1
1.5 2.5 -1
3 1.5 1
2.5 3 1
4 2.5 1
Consider a linear perceptron initialised with the boundary line x2 = 2 which classifies
as positive the points over the line. The student should:
1. Perform one step of gradient descent with stepsize 0.1 and compute the updated
coefficients of the perceptron line with equation
β0 + β1x1 + β2 x2 = 0
2. Trace the initial boundary line (in black), the updated boundary line (in red)
and the training points.
Solution:
In the initial perceptron β0 =− 2, β1 = 0 and β2 = 1. The misclassified points are
the third and the fourth (opposite label). Since
∂R
∂β =− X
miscl
yixi = 1 .5
2.5 − 3
1. 5 = − 1 . 5
1
and ∂R
∂β0
=−X
miscl
yi = 0
after one iteration β0 remains the same while
β t+1
1
βt+1
2= β t
1
βt
2−0. 1∗ −1. 5
1 = 0
1 + 0.15
−0.1 = 0.15
0. 9
Updated coefficients of the perceptron line are then
•β0 = −2
•β1 = 0 .15
•β2 = 0 .9
10.5. EXERCISES 299
13. Consider the data set in exercise 9 and fit to it a Radial Basis Function with 2 basis
functions having as parameters µ (1) =− 0. 5 and µ (2) = 0. 5. The equation of the
basis function is
ρ( x, µ) = exp−(x −µ)2
The student should
1. write in matrix notation the linear system to be solved for obtaining the weights
of the radial basis function
2. compute the weights of the radial basis function
Hint:
A= a 11 a12
a12 a22 ⇒ A −1 = 1
a11a22 −a2
12 a 22 −a 12
−a12 a11
Solution:
1. matrix notation w = (X0X ) −1 X0 Y where
X=
0. 779 0.105
0. 961 0.527
0. 779 0.779
0. 698 0.852
0. 613 0.914
0. 445 0.990
2. weights of the radial basis function : w = [1. 25,− 0 .27]
14. Let us consider a classification task with 3 binary inputs and one binary output.
Suppose we collected the following training set
x1x2x3 y
0 1 0 1
0 0 1 0
0 1 0 0
0 1 1 0
1 0 0 0
1 0 1 1
1 1 0 0
0 1 1 0
1 0 1 0
1 0 0 0
1 1 0 0
0 1 1 0
1. Estimate the following quantities by using the frequency as estimator of prob-
ability
•Prob {y= 1}
•Prob {y= 1 |x1 = 0}
•Prob {y= 1 |x1 = 0, x2 = 0, x3 = 0}
2. Compute the classification returned by using the Naive Bayes Classifier for the
input x1 = 0,x2 = 0,x3 = 0.
3. Suppose we test a classifier for this task and that we obtain a misclassification
error equal to 20%. Is it more accurate than a zero classifier, i.e. a classifier
returning always zero?
Solution: Let us note that N = 12
1. • [
Prob {y = 1}= 2/12 = 1/6
•[
Prob {y = 1|x1 = 0} = 1
6
300 CHAPTER 10. NONLINEAR APPROACHES
•[
Prob {y = 1|x1 = 0,x2 = 0, x3 = 0} cannot be estimated using the fre-
quency since there is no observation where x1 = 0,x2 = 0, x3 = 0
2. Since
[
Prob {y = 1|x1 = 0, x2 = 0,x3 = 0} ∝
[
Prob {x1 = 0|y = 1} [
Prob {x2 = 0|y = 1} [
Prob { x3 = 0|y = 1} [
Prob {y = 1}=
(0. 5∗ 0. 5 ∗ 0. 5∗ 1/ 6) = 0.02
and
[
Prob {y = 0|x1 = 0, x2 = 0,x3 = 0} ∝
[
Prob { x1 = 0|y = 0} [
Prob {x2 = 0|y = 0} [
Prob { x3 = 0|y = 0} [
Prob {y = 0}=
= (5/ 10 ∗ 4 / 10 ∗ 5/ 10 ∗ 5/ 6) = 0.08
the NB classification is 0
3. A zero classifier would return always the class with the highest a priori prob-
ability, that is the class 0. Its misclassification error would be then 1/6. Since
1/ 5> 1 / 6 the classifier is less accurate than the zero classifier.
15. Let us consider a classification task with 3 binary inputs and one binary output.
Suppose we collected the following training set
x1x2x3 y
0 1 0 1
0 0 1 0
0 1 0 0
0 1 1 0
1 0 0 0
1 0 1 1
1 1 0 0
0 1 1 0
1 0 1 0
1 0 0 0
1 1 0 0
0 1 1 0
1. Estimate the following quantities by using the frequency as estimator of prob-
ability
•Prob {y= 1}
•Prob {y= 1 |x1 = 0}
•Prob {y= 1 |x1 = 0, x2 = 0, x3 = 0}
2. Compute the classification returned by using the Naive Bayes Classifier for the
input x1 = 0,x2 = 0,x3 = 0.
3. Suppose we test a classifier for this task and that we obtain a misclassification
error equal to 20%. Is it working better than a zero classifier, i.e. a classifier
ignoring the value of the inputs?
Solution: Let us note that N = 12
1. • [
Prob {y = 1}= 2/12 = 1/6
•[
Prob {y = 1|x1 = 0} = 1
6
•[
Prob {y = 1|x1 = 0,x2 = 0, x3 = 0} cannot be estimated using the fre-
quency since there is no observation where x1 = 0,x2 = 0, x3 = 0
10.5. EXERCISES 301
2. Since
[
Prob {y = 1|x1 = 0,x2 = 0,x3 = 0} ∝
[
Prob {x1 = 0|y = 1} [
Prob {x2 = 0|y = 1} [
Prob {x3 = 0|y = 1} [
Prob {y = 1}=
(0. 5∗ 0. 5∗ 0. 5 ∗ 1/ 6) = 0.02
and
[
Prob {y = 0|x1 = 0,x2 = 0,x3 = 0} ∝
[
Prob {x1 = 0|y = 0} [
Prob {x2 = 0|y = 0} [
Prob {x3 = 0|y = 0} [
Prob {y = 0}=
(5/ 10 ∗ 4/ 10 ∗ 5/ 10 ∗ 5/ 6) = 0.08
the NB classification is 0
3. A zero classifier would return always the class with the highest a priori prob-
ability, that is the class 0. Its misclassification error would be then 1/6. Since
1/ 5> 1 / 6 the classifier is less accurate than the zero classifier.
16. Consider a regression task with input x and output y . Suppose we observe the
following training set
X Y
0 .1 1
0 0.5
-0.3 1.2
0.2 1
0.4 0.5
0.1 0
-1 1.1
and that the prediction model is constant. Compute an estimation of its mean
integrated squared error by leave-one-out.
Solution: Since the leave-one-out error is
e−i
i=y i −P N
j=1 ,j6= i y j
N−1
we can compute the vector of errors in leave-one-out
e−1
11- 0.716=0.283
e−2
20.5- 0.8= -0.3
e−3
31.2- 0.683=0.516
e−4
41- 0.716=0.283
e−5
50.5- 0.8= -0.3
e−6
60- 0.883=-0.883
e−7
71.1- 0.7=0.4
and then derive the MISE estimation
\
MISEloo = P N
i=1(e −i
i) 2
N= 0 .22
17. Consider a regression task with input x and output y . Suppose we observe the
following training set
X Y
0 .1 1
0 0.5
-0.3 1.2
0.3 1
0.4 0.5
0.1 0
-1 1.1
302 CHAPTER 10. NONLINEAR APPROACHES
and that the prediction model is a KNN (nearest neighbour) where K = 1 and the
distance metric is euclidean. Compute an estimation of its mean squared error by
leave-one-out.
Solution:
The leave-one-out error is
e−i
i=y i −y ∗
i
where y ∗
iis the value of the target associated to x∗
iand x∗
iis the nearest neighbour
of xi . Once we rank the training set according to the input value
X Y
-1 1.1
-0.3 1.2
0 0.5
0.1 1
0.1 0
0.3 1
0.4 0.5
we can compute the vector of errors in leave-one-out
e−1
11.1-1.2=-0.1
e−2
21.2- 0.5= 0.7
e−3
30.5- 1=-0.5
e−4
41- 0=1
e−5
50- 1= -1
e−6
61-0.5=0.5
e−7
70.5- 1=-0.5
and then derive the MISE estimation
\
MISEloo = P N
i=1(e −i
i) 2
N= 0 .464
18. Consider a regression task with input x and output y . Suppose we observe the
following training set
X Y
0.5 1
1 1
-1 1
-0.25 1
0 0.5
0.1 0
0.25 0.5
Trace the estimation of the regression function returned by a KNN (nearest neighbor)
where K = 3 on the interval [− 2, 1].
Solution: The resulting graph is piecewise constant and each piece has an ordinate
equal to the mean of three points. Once ordered the points according to the abscissa
X Y
x1 -1 1
x2 -0.25 1
x3 0 0.5
x4 0.1 0
x5 0.25 0.5
x6 0.5 1
x7 1 1
10.5. EXERCISES 303
these are the five sets of 3 points
x1 , x2, x3 ⇒ ˆ y= 2. 5/ 3 (10.5.60)
x2 , x3, x4 ⇒ ˆ y= 0 .5 (10.5.61)
x3 , x4, x5 ⇒ ˆ y= 1 /3 (10.5.62)
x4 , x5, x6 ⇒ ˆ y= 0 .5 (10.5.63)
x5 , x6, x7 ⇒ ˆ y= 2 .5 /3 (10.5.64)
The transitions from xi , xi+1 , xi+2 to xi+1 , xi+2 , xi+3 , i = 1,..., 4 occur at the x = q
points where q− xi = xi+3 − q⇒q =x i+3+x i
2
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
0.0 0.5 1.0 1.5 2.0
x
y
19. Consider a supervised learning problem, a training set of size N= 50 and a neural
network predictor with a single hidden layer. Suppose that we are able to compute
the generalisation error for different number H of hidden nodes and we discover
that the lowest generalisation error occurs for H= 3. Suppose now that the size of
the training set increases (N= 500). For which value of Hwould you expect the
lowest generalisation error? Equal, larger or smaller than 3? Justify your answer by
reasoning on the bias/variance trade-off in graphical terms (Figure).
Solution:
According to 7.7.46 the MISE generalisation error may be decomposed as the sum
of the squared bias, the model variance and the noise variance.
In Figure 10.29 we depict the first setting in black and the second one (i.e. increased
training set size) in red.
The relationship between the squared bias and the capacity of the model (number
H) is represented by the dashed line and the relationship between the variance
and the capacity is represented by the continuous thin line. The MISE (taking its
minimum in H = 3) is represented by the black thick line. Note that in the figure
we do not consider the noise variance since we are comparing two models for the
same regression task and then the noise variance is in this case an irrelevant additive
term.
If the training set size increases we can expect a variance reduction. This means
that the minimum of the MISE term will move to right. We should then expect that
the optimal number of hidden layers is H > 3 .
Note that additional observations have no impact on the squared bias while they
contribute to reduce the variance (red thin line). From the red thick line denoting
the MISE of the second setting, it appears that arg minH MSE (H ) moved to the
right.
20. Consider a feedforward neural network with two inputs, no hidden layer and a
logistic activation function. Suppose we want to use backpropagation to compute
the weights w1 and w2 and that a training dataset is collected. The student should
304 CHAPTER 10. NONLINEAR APPROACHES
0123456
0.0 0.5 1.0 1.5 2.0
complexity
MSE
Figure 10.29:
1. Write the equation of the mapping between x1 , x2 and y.
2. Write the two iterative backpropagation equations to compute w1 and w2 .
Solution:
1. ˆ y= g( z) = g( w1 x1 + w2x2 ) where g( z) = 1
1+e −z and g 0 (z ) = e −z
(1+e−z )2
10.5. EXERCISES 305
2. The training error is
E=PN
i=1(y i −ˆ yi )2
N
For j = 1,2
∂E
∂wj
=− 2
N
N
X
i=1
(yi − ˆ yi ) ∂ˆ yi
∂wj
where ∂ ˆ yi
∂wj
=g0 (zi )xij
where zi =w1 x1i +w2 x2i
The two backpropagation equations are then
wj ( k + 1) = wj ( k ) + η 2
N
N
X
i=1
(yi − ˆ yi ) g0 ( zi ) xij , j = 1 , 2
21. Consider a binary classification problem and the following estimations of the condi-
tional probability [
Prob {y = 1| x} vs. the real value of the target.
Trace a precision recall and the AUC curve
[
Prob {y = 1| x} CLASS
0.6 1
0.5 -1
0.99 1
0.49 -1
0.1 -1
0.26 -1
0.33 1
0.15 -1
0.05 -1
Solution: Let us first order the dataset in terms of ascending score
[
Prob {y = 1|x } CLASS
0.05 -1
0.10 -1
0.15 -1
0.26 -1
0.33 1
0.49 -1
0.50 -1
0.60 1
0.99 1
We let the threshold range over all the values of the score. For each value of the
threshold we define as positively classified the terms having a score bigger than the
threshold and negatively classified the terms having a score lower equal than the
threshold.
For instance for Thr=0.26 this is the returned classification
[
Prob {y = 1|x } ˆ yCLASS
0.05 -1 -1
0.10 -1 -1
0.15 -1 -1
0.26 -1 -1
0.33 1 1
0.49 1 -1
0.50 1 -1
0.60 1 1
0.99 1 1
306 CHAPTER 10. NONLINEAR APPROACHES
Then we measure the quantity of TP, FP, TN and FN and FPR = F P / (T N + F P ),
T P R = T P /(T P +F N )
Threshold TP FP TN FN FPR TPR
0.05 3 5 1 0 5/6 1
0.10 3 4 2 0 2/3 1
0.15 3 3 3 0 1/2 1
0.26 3 2 4 0 1/3 1
0.33 2 2 4 1 1/3 2/3
0.49 2 1 5 1 1/6 2/3
0.50 2 0 6 1 0 2/3
0.60 1 0 6 2 0 1/3
0.99 0 0 6 3 0 0
0.0 0.2 0.4 0.6 0.8
0.0 0.2 0.4 0.6 0.8 1.0
FPR
SE
22. Let us consider a classification task with 3 binary inputs and one binary output.
Suppose we collected the following training set
x1x2x3 y
1 1 0 1
0 0 1 0
0 1 0 0
1 1 1 1
0 0 0 0
0 1 0 0
0 1 1 0
0 0 1 0
0 0 0 0
0 1 0 0
1 1 1 1
1. Estimate the following quantities by using the frequency as estimator of prob-
ability
•Prob {y= 1}
•Prob {y= 1 |x1 = 0}
•Prob {y= 1 |x1 = 0, x2 = 0, x3 = 0}
2. Consider a Naive Bayes classifier and compute its classifications if the same
dataset is used also for testing
3. Trace the ROC curve associated to the Naive Bayes classifier if the same dataset
is used also for testing. (Hint: make the assumption that the denominator of
the Bayes formula is 1 for all test points)
Solution:
1.
10.5. EXERCISES 307
•Prob {y= 1 }= 3/11
•Prob {y= 1 |x1 = 0 }= 0
•Prob {y= 1 |x1 = 0, x2 = 0, x3 = 0 }= 0
2. Note that the values of x1 are identical to the ones of y . Then Prob { x1 = A|y =¬ A} =
0. It follows that if use a Naive Bayes and the test dataset is equal to the train-
ing set all the predictions will coincide with the values of x1 . The training error
is then zero
3. Since all the predictions are correct the ROC curve is equal to 1 for all FPR
values
23. Let us consider a binary classification task where the input x∈R2 is bivariate and
the categorical output variable ymay take two values: 0 (associated to red) and
1 (associated to green). Suppose that the a-priori probability is p (y =1)=0 .2
and that the inverse (or class-conditional) distributions are the bivariate Gaussian
distributions p (x |y = 0) = N (µ0 , Σ0 ) and p (x|y = 1) = N (µ1 , Σ1 ) where
•µ0 = [0 ,0]T
•µ1 = [1 ,1]T
and both Σ0 and Σ1 are diagonal identity matrices. The student should
1. by using the R function rmvnorm , sample a dataset of N = 1000 input/output
observations according to the conditional distribution described above,
2. visualise in a 2D graph the dataset by using the appropriate colors,
3. fit a logistic classifier to the dataset (see details below),
4. plot the evolution of the cost function J (α ) during the gradient-based minimi-
sation,
5. plot in the 2D graph the decision boundary.
Logistic regression estimates
ˆ
P(y = 1| x) = exp x T α N
1 + expx T α N = 1
1 + exp−xT αN , ˆ
P(y = 0| x) = 1
1 + expx T α N
where
αN = arg min
αJ(α )
and
J( α) =
N
X
i=1 −y i x T
iα+ log(1 + exp x T
iα )
Note that αis the vector [α0 , α1, α2 ]T and that xi = [1, xi1 , xi2 ]T , i = 1,...,N.
The value of αN has to be computed by gradient-based minimisation of the cost
function J (α ) by performing I= 200 iterations of the update rule
α(τ) =α(τ−1) − ηdJ ( α (τ− 1))
dα , τ = 1,...,I
where α (0) = [0, 0, 0]T and η = 0.001.
Solution:
See the file Exercise5.pdf in the directory gbcode/exercises of the companion R
package (Appendix F).
24. Consider a binary classification task where the input x∈R2 is bivariate and the
categorical output variable y may take two values: 0 (associated to red) and 1
(associated to green). Suppose that the a-priori probability is p (y =1)=0 .2 and
that the inverse (or class-conditional) distributions are
308 CHAPTER 10. NONLINEAR APPROACHES
•green/cross class : mixture of three Gaussians
p( x|y= 1) =
3
X
i=1
wi N( µ1i , Σ)
where in µ 11 = [1, 1]T , µ12 = [− 1,− 1], µ13 = [3, − 3]T , and w1 = 0. 2, w2 = 0.3.
•red/circel class: bivariate Gaussian p ( x |y= 0) = N(µ0 , Σ) where µ0 = [0, 0]T
The matrix Σ is a diagonal identity matrix.
The student should
•by using the R function rmvnorm, sample a dataset of N = 1000 input/output
observations according to the conditional distributions described above,
•visualise in a 2D graph the dataset by using the appropriate colours/marks,
•plot the ROC curves of the following classifiers
1. linear regression coding the two classes by 0 and 1,
2. Linear Discriminant Analysis where σ2 = 1,
3. Naive Bayes where the univariate conditional distributions are Gaussian,
4. k Nearest Neighbour with k = 3, 5, 10.
The classifiers should be trained and tested on the same training set.
•Choose the best classifier on the basis of the ROC curves above.
No R package should be used to implement the classifiers.
Solution:
See the file Exercise6.pdf in the directory gbcode/exercises of the companion R
package (Appendix F).
Chapter 11
Model averaging approaches
All the techniques presented so far require a model selection procedure where dif-
ferent model structures are assessed and compared in order to attain the best repre-
sentation of the data. In model selection the winner-takes-all approach is intuitively
the approach that should work the best. However, recent results in machine learning
show that the final accuracy can be improved not by choosing the model structure
which is expected to predict the best but by creating a model combining the output
of models with different structures. The reason is that every hypothesis h ( ·, αN ) is
only an estimate of the real target and, like any estimate, is affected by a bias and a
variance term. The theoretical results of Section 5.10 show that a variance reduction
can be obtained by combining uncorrelated estimators. This simple idea underlies
some of the most effective techniques recently proposed in machine learning. This
chapter will sketch some of them.
11.1 Stacked regression
Suppose we have mdistinct predictors hj (· , αN ), j = 1, . . . , m obtained from a
given training set DN . For example, a predictor could be a linear model fit on some
subset of the variables, a second one a neural network and a third one a regression
tree. The idea of averaging models is to design an average estimator
m
X
j=1
βj hj ( · , αN )
by linear combination which is expected to be more accurate than each of the
estimators taken individually.
A simple way to estimate the weights ˆ
βj is to perform a least-squares regression
of the output y on the m inputs hj (·, αN ). The training set for this regression is
then made by DN = {hi , yi }
y=
y1
y2
.
.
.
yN
H=
h1
h2
.
.
.
hN
=
h1 ( x1 , αN ) h2 ( x1 , αN ) . . . hm ( x1 , αN )
h1 ( x2 , αN ) h2 ( x2 , αN ) . . . hm ( x2 , αN )
.
.
..
.
..
.
..
.
.
h1 ( xN , αN ) h2 ( xN , αN ) . . . hm ( xN , αN )
where hi ,i = 1, . . . , N is a vector of mterms.
Once computed the least-squares solution ˆ
βthe combined estimator is
hcm ( x ) =
m
X
j=1
ˆ
βj hj ( x, αN )
309
310 CHAPTER 11. MODEL AVERAGING APPROACHES
Despite its simplicity, the least-squares approach might produce poor results
since it does not take into account the correlation existing among the hj and induced
by the fact that all of them are estimated on the same training set DN .
Wolpert [196] presented an interesting idea, called stacked generalisation for
combining estimators without suffering of the correlation problem. This proposal
was translated in statistical language by Breiman who introduced the stacked re-
gression principle [36].
The idea consists in estimating the m parameters ˆ
βj by solving the following
optimisation task
ˆ
β= arg min
β
N
X
i=1
y i −
m
X
j=1
βj h(−i )
j(x i )
2
where h(−i)
j(x i ) is the leave-one-out estimate (8.8.2.3) of the jth model.
In other terms the parameters are obtained by performing a least-squares regres-
sion of the output y on the m inputs hj (· , α(−i)
N). The training set for this regression
is then made by DN = {h−
i, y i },i = 1, . . . , N
y=
y1
y2
.
.
.
yN
H=
h−
1
h−
2
.
.
.
h−
N
=
h1 ( x1 , α(−1)
N)h 2 (x 1 , α (−1)
N). . . h m (x 1 , α (−1)
N)
h1 ( x2 , α(−2)
N)h 2 (x 2 , α (−2)
N). . . h m (x 2 , α (−2)
N)
.
.
..
.
..
.
..
.
.
h1 ( xN , α(−N )
N)h 2 (x N , α (−N )
N). . . h m (x N , α (−N )
N)
where hj ( xi , α(−i )
N) is the predicted outcome in x i of the jth model trained on
DN with the ith observation ( xi , yi ) set aside.
By using the cross-validated predictions hj (xi , α(−i )
N) stacked regression avoids
giving unfairly high weight to models with higher complexity. It was shown by
Breiman, that the performance of the stacked regressor improves when the coeffi-
cients ˆ
βare constrained to be non-negative. There is a close connection between
stacking and winner-takes-all model selection. If we restrict the minimisation to
weight vectors w that have one unit weight and the rest zero, this leads to the
model choice returned by the winner-takes-all based on the leave-one-out. Rather
than choose a single model, stacking combines them with estimated optimal weights.
This will often lead to better prediction, but less interpretability than the choice of
only one of the m models.
11.2 Bagging
A learning algorithm is informally called unstable if small changes in the training
data lead to significantly different models and relatively large changes of accuracy.
Unstable learners can have low bias but have typically high variance. Unstable
methods can have their accuracy improved by perturbing (i.e. generating multiple
versions of the predictor by perturbing the training set or learning method) and
combining. Breiman calls these techniques P&C methods.
The bagging technique is a P&C technique which aims to improve accuracy for
unstable learners by averaging over such discontinuities. The philosophy of bagging
11.2. BAGGING 311
Figure 11.1: Histogram of misclassification rates of resampled trees: the vertical
line represents the misclassification rate of the bagging predictor.
is to improve the accuracy by reducing the variance: since the generalisation error of
a predictor h ( ·,αN ) depends on its bias and variance, we obtain an error reduction if
we remove the variance term by replacing h (·,αN ) with ED N [h(·,αN )]. In practice,
since the knowledge of the sampling distribution of the predictor is not available, a
non-parametric estimation is required.
Consider a dataset DN and a learning procedure to build a hypothesis αN from
DN . The idea of bagging or bootstrap aggregating is to imitate the stochastic process
underlying the realisation of DN . A set of Brepeated bootstrap samples D(b)
N,
b= 1 , . . . , B are taken from DN . A model α(b)
Nis built for each D ( b)
N. A final
predictor is built by aggregating the B models α(b)
N. In the regression case, the
bagging predictor is
hbag ( x ) = 1
B
B
X
b=1
h( x, α(b)
N)
In the classification case a majority vote is used.
R script
The R script bagging.R shows the efficacy of bagging as a remedy against overfit-
ting.
Consider a dataset DN = {xi , yi } ,i = 1, . . . , N of N = 100 i.i.d. normally dis-
tributed inputs x ∼ N ([0, 0, 0], I ). Suppose that y is linked to x by the input/output
relation
y= x2
1+ 4 log(|x 2 |)+5 x3 +
where ∼ N (0, 0 . 25) represents the noise. Let us train a single-hidden-layer neural
network with s = 25 hidden neurons on the training set (Section 10.1.1). The
prediction accuracy on the test set (N ts = 100) is \
MISEts = 70. 86. Let us apply a
bagging combination with B = 50 (R-file. The prediction accuracy on the test set
of the bagging predictor is \
MISEts = 6. 7. This shows that the bagging combination
reduces the overfitting of the single neural network. Below there is the histogram
of the \
MISEts accuracy of each bootstrap repetition. Figure (11.1) shows that the
bagging predictor is much better than average.
•
312 CHAPTER 11. MODEL AVERAGING APPROACHES
Tests on real and simulated datasets showed that bagging can give a substantial
gain of accuracy. The vital element is the instability of the prediction method.
If perturbing the learning set can cause significant changes in the predictor con-
structed, then bagging can improve accuracy. On the other hand, it can slightly
degrade the performance of stable procedures. There is a cross-over point between
instability and stability at which bagging stops improving.
Bagging demands the repetition of B estimations of h (·, α(b)
N) but avoids the
use of expensive validation techniques (e.g. cross-validation). An open question,
as in bootstrap, is to decide how many bootstrap replicates to carry out. In his
experiments, Breiman suggests that B≈ 50 is a reasonable figure.
Bagging is an ideal procedure for parallel computing. Each estimation of h (· , α(b)
N),
b= 1 , . . . , B can proceed independently of the others. At the same time, bagging
is a relatively easy way to improve an existing method. It simply needs adding
1. a loop that selects the bootstrap sample and sends it to the learning machine
and
2. a back-end to perform the aggregation.
Note however that if the original learning machine has an interpretable structure
(e.g. classification tree), this is lost for the sake of increased accuracy.
11.3 Boosting
Boosting is one of the most powerful learning ideas introduced in the last ten years.
Boosting is a general method which attempts to boost the accuracy of any given
learning algorithm. It was originally designed for classification problems, but it
can profitably be extended to regression as well. Boosting [75, 168] encompasses a
family of methods. The focus of boosting methods is to produce a series of weak
learners in order to produce a powerful combination. A weak learner is a learner
that has accuracy only slightly better than chance.
The training set used for each member of the series is chosen based on the
performance of the earlier classifier(s) in the series. Examples that are incorrectly
predicted by previous classifiers in the series are chosen more often than examples
that were correctly predicted.
Thus Boosting attempts to produce new classifiers that are better able to predict
examples for which the current ensemble's performance is poor. Unlike Bagging,
the resampling of the training set is dependent on the performance of the earlier
classifiers. The two most important types of boosting algorithms are the Ada Boost
(Adaptive Boosting) algorithm (Freund, Schapire, 1997) and the Arcing algorithm
(Breiman, 1996).
11.3.1 The Ada Boost algorithm
Consider a binary classification problem where the output take values in {− 1,1}.
Let DN be the training set. A classifier is a predictor h (· ) which given an input
x, produces a prediction taking one of the values {− 1 ,1} . A weak classifier is one
whose misclassification error rate is only slightly better than random guessing.
The purpose of boosting is to sequentially apply the weak classification algo-
rithm to repeatedly modified versions of the data, thereby producing a sequence of
classifiers hj (· ), j = 1, . . . , m . The predictions of the m weak classifiers are then
combined through a weighted majority vote to produce the final prediction
hboo = sign
m
X
j=1
αj hj ( x, αN )
11.3. BOOSTING 313
The weights αj of the different classifiers are computed by the algorithm. The
idea is to give stronger influence to the more accurate classifiers in the sequence.
At each step, the boosting algorithm samples Ntimes from a distribution won the
training set which put a weight wi on each example (xi , yi ), i = 1, . . . , N of DN
Initially, the weights are all set to wi = 1/N so that the first step simply trains
the classifier in the standard manner. For each successive iteration j = 1, . . . , m
the probability weights are individually modified, and the classification algorithm
is re-applied to the resampled training set.
At the generic jth step the observations that were misclassified by the classifier
hj−1 (· ) trained at the previous step, have their weights wi increased, whereas the
weights are decreased for those that were classified correctly. The rationale of the
approach is that, as the iterations proceed, observations that are hard to classify
receive ever-increasing influence and the classifier is forced to concentrate on them.
Note the presence in the algorithm of two types of weights: the weights αj ,j =
1, . . . , m that measure the importance of the classifiers and the weights wi ,i =
1, . . . , N that measure the importance of the observations.
Weak learners are added until some desired low training error has been achieved.
This is the algorithm in detail:
1. Initialise the observation weights wi = 1/N ,i = 1, . . . , N .
2. For j = 1 to m:
(a) Fit a classifier hj ( ·) to the training data obtained by resampling DN
using weights wi .
(b) Compute the misclassification error on the training set
\
MME(j)
emp =P N
i=1 w i I(y i 6=h j (x i ))
PN
i=1 w i
(c) Compute
αj = log((1 − \
MME(j)
emp)/\
MME(j)
emp)
Note that αj > 0 if \
MME(j)
emp ≤1/ 2 (otherwise we stop or we restart)
and that αj gets larger as \
MME(j)
emp gets smaller.
3. (d) For i = 1, . . . , N set
wi ← wi ( exp[− α j ] if correctly classified
exp[αj ] if incorrectly classified
(e) The weights are normalised to ensure that wi represents a true distribu-
tion.
4. Output of the weighted majority vote
hboo = sign
m
X
j=1
αjhj ( x, αN )
R script
The R script boosting.R tests the performance of the Ada Boost algorithm in a
classification task. Consider the medical dataset Pima obtained by a statistical
survey on women of Pima Indian heritage. This dataset reports the presence of
314 CHAPTER 11. MODEL AVERAGING APPROACHES
diabetes in Pima Indian women together with other clinical measures (blood pres-
sure, insulin, age,...). The classification task is to predict the presence of diabetes
as a function of clinical measures. We consider a training set of N= 40 and a
test set of 160 points. The classifier is a simple classification tree which returns a
misclassification rate \
MMEts = 0. 36. We use a boosting procedure with m = 15
to improve the performance of the weak classifier. The misclassification rate of the
boosted classifier is \
MMEts = 0.3.
•
Boosting has its roots in a theoretical framework for studying machine learning
called the PAC learning model. Freund and Scapire proved that the empirical error
of the final hypothesis hboo is at most
m
Y
j=1 "2 r\
MME(j)
emp ∗(1 − \
MME(j)
emp)#
They showed also how to bound the generalisation error.
11.3.2 The arcing algorithm
This algorithm was proposed as a modification of the original Ada Boost algorithms
by Breiman. It is based on the idea that the success of boosting is related to the
adaptive resampling property where increasing weight is placed on those examples
more frequently misclassified. ARCing stays for Adaptive Resampling and Combin-
ing. The complex updating equations of Ada Boost are replaced by much simpler
formulations. The final classifier is obtained by unweighted voting. This is the
ARCing algorithm in detail:
1. Initialise the observation weights wi = 1/N ,i = 1, . . . , N .
2. For j = 1 to m:
(a) Fit a classifier hj to the training data obtained by resampling DN using
weights wi .
(b) Let ei the number of misclassifications of the ith example by the jclas-
sifiers h1 , . . . , hj .
(c) The updated weights are defined by
wi =1 + e 4
i
PN
i=1(1 + e 4
i)
3. The output is obtained by unweighted voting of the m classifiers hj .
R script
The R file arcing.R tests the performance of the ARCing algorithm in a classifica-
tion task. Consider the medical dataset Breast Cancer obtained by Dr. William H.
Wolberg (physician) at the University of Wisconsin Hospital in USA. This dataset
reports the class of cancer (malignant and benign) and other properties (clump
thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion,...).
The classification task is to predict the class of breast cancer on the basis of clinical
measures. We consider a training set of size N= 400 and a test set of size 299.
The classifier is a simple classification tree which returns a misclassification rate
\
MMEts = 0. 063. We use an arcing procedure with m= 15. It gives a misclassifica-
tion rate \
MMEts = 0.010.
11.4. RANDOM FORESTS 315
•
Boosting is a recent and promising technique which is simple and easy to program.
Moreover, it has few parameters (e.g. max number of classifiers) to tune. Boosting
methods advocate a shift in the attitude of the learning-system designer: instead
of trying to design a learning algorithm which should be accurate over the entire
space, she can instead focus on finding weak algorithms that only need to be better
than random.
Furthermore, a nice property of Ada Boost is its ability to identify outliers.
11.3.3 Bagging and boosting
This section makes a short comparison of bagging and boosting techniques. First of
all, in terms of bias/variance trade-off, it is important to stress that the rationale of
bagging is to reduce variance of low bias (and then high variance) learners trained
on i.d. (identically distributed) data while boosting aims to reduce sequentially the
bias of weak-learners trained on non i.d. data.
Like bagging, boosting avoid the cost of heavy validation procedures and, like
bagging, boosting trades accuracy for interpretability. As for bagging, the main
effect of boosting is to reduce variance and it works effectively for high variance
classifiers. However, unlike bagging, boosting cannot be implemented in parallel,
since it is based on a sequential procedure.
In terms of experimental accuracy, several research works (e.g. Breiman's work)
show that boosting seems outperforms bagging. Also, a number of recent theoretical
results show that boosting is fundamentally different from bagging [98].
Some caveats are notwithstanding worth to mention: the actual performance of
boosting on a particular problem is dependent on the data and the nature of the
weak learner. Also boosting can fail to perform well given insufficient data, overly
complex weak hypothesis, and definitely too weak hypothesis.
11.4 Random Forests
Ensemble learning is efficient when it combines low bias and independent estimators,
like non pruned decision trees.
Random Forests (RF) is an ensemble learning technique proposed by Breiman [38]
which combines bagging and random feature selection by using a large number of
non pruned decision trees. The rationale of RF is to reduce the variance by decor-
relating as much as possible the single trees. This is achieved in the tree-growing
process through a random selection of the input variables. In a nutshell, the algo-
rithm consists in:
1. generating by bootstrap a set of Btraining sets,
2. fitting to each of them a decision tree hb ( ·, αb ), b = 1, . . . , B where the set of
variables considered for each split (Section 10.1.4.3) is a random subset of size
n0 of the original one (feature bagging),
3. storing at each split for the corresponding split variable, the improvement of
the cost function,
4. returning as the final prediction the average of the Bpredictions
hrf ( x ) = 1
B
B
X
b=1
hb ( x, αb )
in a regression task and the majority vote in a classification task,
316 CHAPTER 11. MODEL AVERAGING APPROACHES
5. returning for each variable an importance measure.
Suppose that the B trees in the forest are almost unbiased, have a comparable
variance Var [hb ] = σ2 and a mutual correlation ρ . The RF regression predictor hrf
is then almost unbiased and from (3.10.88) its variance is
Var [hrf ] = (1 −ρ2 )
B+ ρσ2
It appears then by increasing the forest size Band making the trees as uncorrelated
as possible, a Random Forest strategy reduces the resulting variance.
A rule of thumb consists of setting the size of the random subset to n0 =√ n . The
main hyperparameters of RF are the hyperparameters of single trees (e.g. depth,
max number of leaves), the number Bof trees and the size n0 of the random feature
set. Note that by reducing n0 we make the trees more decorrelated, yet we increase
the bias of each single tree (and then of the RF) by constraining its number of
features. In particular, a too small number n0 may be detrimental to accuracy in
configurations with very large nand small number of informative features.
11.4.1 Why are Random Forests successful?
Random Forests are often considered among the best "off-the-shelf" learning al-
gorithms since they do not require complex tuning to perform reasonably well on
chhallenging tasks. There are many reasons for their success [73]: (i) they use an
out-of-bag (Section 7.10.1) strategy to effectively manage the bias/variance trade-off
and to assess the importance of input variables, (ii) since based on trees, they easily
cope with mixtures of numeric and categorical predictor variables, (iii) they are
resilient to input outliers and invariant under monotone input transformation, (iv)
they embed a feature ranking mechanism based on an importance measure related
to the average cost function decrease during splitting, (v) they are fast to construct
and can be made massively parallel and (vi) there exist a number of very effec-
tive implementations (e.g. in the R package randomForest) and enhanced version
(notably gradient boosting trees).
11.5 Gradient boosting trees
Gradient boosting (GB) trees are an enhanced version of averaging algorithms which
rely on combining mtrees according to a forward stage-wise additive strategy [98].
The strategy consists of adding one component (e.g. a tree) at the time: after m
iterations, the resulting model is the sum of the Mindividual trees
hm ( x ) =
m
X
j=1
T( x, αj )
Given j− 1 trees, the jth tree is learned such to compensate the error between
the target and the current ensemble prediction hj−1 (x ). This means that
αj = arg min
α
N
X
i=1
L( yi , hj−1 ( xi ) + T ( xi , α)) (11.5.1)
where αj contains the jth tree parameters, e.g. the set of disjoint regions and the
local model holding in each region. Note that in the forward stage-wise philosophy,
no adjustment of the previously added trees is considered.
11.6. CONCLUSION 317
It can shown that, for a regression task with a squared error loss function L , the
solution αj corresponds to the regression tree that best predicts the residuals
ri = yi − hj−1 ( xi ) , i = 1 , . . . , N
Gradient based versions exists for other differentiable loss criteria and for clas-
sification tasks. Also weighted versions of (11.5.1) exist
(αj , wj ) = arg min
α,w
N
X
i=1
L( yi , hj−1 ( xi ) + wT ( xi , α))
where the contribution αj of each new tree is properly tuned.
A stochastic version of gradient boosting has been proposed in [77] where at
each iteration only a subsample of the training set is used to train the new tree.
Though gradient-boosting algorithms are considered ones of the most promising
in complex learning tasks, it is recommended to remember that their accuracy
depends, like all learning algorithms, on a number of hyperparameters, notably the
size of the constituent trees, the number mof iterations, contribution wj of each
tree, loss function degree and subsample size.
11.6 Conclusion
The averaging of ensembles of estimators relies on the counterintuitive principle
that combining predictors is (most of the time) more convenient than selecting
(what seems to be) the best. This principle is (probably together with the idea of
regularisation) one of the most genial and effective ideas proposed by researchers in
Machine Learning1. Most state-of-the-art learning strategies do owe a considerable
part of their success to the integration of the combination principle. Such principle
is so powerful that some authors suggest nowadays not to include combination in
the assessment of learning strategies (e.g. in new publications) given the risk that
the only visible beneficial effect is the one due to the combination.
The fact that this idea might appear counterintuitive sheds light on the stochas-
tic nature of the learning problem and the importance of taking a stochastic per-
spective to really grasp the problem of learning and generalising from a finite set of
observations.
11.7 Exercises
1. Verify by Monte Carlo simulation the relations (5.10.31) and 5.10.30 concerning the
combination of two unbiased estimators.
Hint: define an estimation task (e.g. estimate the expected value of a random
variable) and choose two unbiased estimators.
1...and a note of distinction should be here definitely attributed to the seminal work of re-
searchers like Jerome H. Friedman and Leo Breiman.
318 CHAPTER 11. MODEL AVERAGING APPROACHES
Chapter 12
Feature selection
In many challenging learning tasks, the number of inputs (or features) may be
extremely high: this is the case of bioinformatics [167] where the number of variables
(typically markers of biological activity at different functional levels) may go up to
hundreds of thousands. The race to high-throughput measurement techniques in
many domains allows us to easily foresee that this number could grow by several
orders of magnitude.
Using such a large number of features in learning may negatively affect general-
isation performance, especially in the presence of irrelevant or redundant features.
Nevertheless, traditional supervised learning algorithms techniques have been de-
signed for supervised tasks where the ratio between the input dimension and the
training size is small, and most inputs (or features) are informative. As a conse-
quence, their accuracy may rapidly degrade when used in tasks with few observa-
tions and a huge number of inputs.
At the same time, it is common to make the assumption that data are sparse
or possess an intrinsic low dimensional structure. This means that most input
dimensions are correlated, only a few of them contain information or equivalently
that most dimensions are irrelevant for the learning task.
For this reason, learning pipelines include more and more a feature selection
phase aiming to select a small subset of informative (or relevant) features to capture
most of the signal and avoid variance and instability issues during learning. In that
sense, feature selection can be seen as an instance of model selection problem where
the alternative models do not differ in terms of functional representation but in
terms of the used subset of inputs.
Example
This example illustrates the impact of the number of features on the model variance
in a learning task with a comparable number of features and observations. Let us
consider a linear regression dependency
y=β0 +β1 x1 +β2 x2 +β3 x3 +β4 x4 + w
where Var [w ]=0 .5, β0 = 0 . 5, β1 =− 0 . 5, β2 = 0.5, β3 =− 0 .5, β4 = 0 . 5. Suppose
we collect a dataset of N= 20 input/output observations where the input set
(n = 8) contains, together with the four variables x1 ,...,x4 , a set of 4 irrelevant
variables x5 ,..., x8 .
Let us consider a set of linear regression models with an increasing number of
features, ranging from zero (constant model) to 8.
The script bv linfs.R illustrates the impact of the number of features on the
average bias (estimated by Monte Carlo) and the average variance (both analytical
319
320 CHAPTER 12. FEATURE SELECTION
Figure 12.1: Trade-off bias/variance for different number of features. Bias and
variance are averaged over the set of Ninputs.
and Monte Carlo estimated) of the predictors. Figure 12.1 shows that the larger is
the number of features, the higher is the prediction variance. Note that the analyt-
ical form of the variance of a linear model prediction is presented in Section 9.1.14.
The bias has the opposite trend, reaching zero once the 4 inputs x 1,...,x4 are
included in the regression model. Overall, the more variables are considered, the
more bias is reduced at the cost of an increased variance. If a variable has no pre-
dictive value (e.g. it belongs to the set x5 ,...,x8 ), considering it merely increases
the variance with no benefit in terms of bias reduction. In general, if the addition
of a variable has a small impact on bias then the increase in prediction variance
may exceed the benefit from bias reduction [132]. The role of a feature selection
technique should be to detect those variables and remove them from the input set.
•
The benefits of feature selection have been thoroughly discussed in literature [89,
91]:
•facilitating data visualisation and data understanding,
•reducing the measurement and storage requirements,
•reducing training and utilisation times of the final model,
•defying the curse of dimensionality to improve prediction performance.
At the same time, feature selection implies additional time for learning since it
introduces an additional layer to the search in the model hypothesis space.
12.1 Curse of dimensionality
Feature selection addresses what is known in several scientific domains as the curse
of dimensionality. This term, coined by R E Bellman, refers to all computational
problems related to large dimensional modelling tasks.
The main issue in supervised learning is that the sparsity of data increases
exponentially with the dimension n. This can be illustrated by several arguments.
Let us consider a ndimensional space and a unit volume around a query point
xq ∈Rn (Figure 12.2) [98]. Let V < 1 be the volume of a neighbourhood hypercube
of edge d . It follows that dn =V and d =V 1/n . Figure 12.3 illustrates the link
between neighbourhood volume Vand edge size dfor different values of n . It
12.1. CURSE OF DIMENSIONALITY 321
n=1 d=1/2 V=1/2
n= 2 d=1/2 V=1/4
n=3 d=1/2 V=1/8
Figure 12.2: Locality and dimensionality of the input space for different values of
n: unit volume (in black) around a query point (circle) containing a neighbourhood
(in red) of volume Vand edge d.
322 CHAPTER 12. FEATURE SELECTION
Figure 12.3: Neighbourhood volume vs. edge size for different values of n.
appears that for a given neighbourhood volume V, the edge length increases by
increasing n while for a given edge length d, the neighbourhood volume decreases
by increasing n . For instance if V = 0. 5 we have d = 0. 7,0.87,0. 98 for n = 2, 5,50;
if V = 0. 1 we have d = 0. 3, 0.63,0. 95 for n = 2, 5, 50. This means that for n = 50
we need to have an edge length which is 95% of the unit length if we want to barely
cover 10% of the total volume.
Let us now assess the impact of dimensionality on the accuracy of a local learn-
ing algorithm (e.g. knearest neighbour) by considering the relation between the
training set size N, the input dimension n and the number of neighbours k . If
the N points are uniformly distributed in the unit volume around the query point,
the number of neighbours kin the neighbourhood Vamounts to roughly k = NV .
Given the value of N and k (and consequently the local volume V) the edge dof the
neighbourhood increases with the dimension nand converges rapidly to one (Fig-
ure 12.4). This implies that if we use a kNN (nearest neighbour) learner for two
supervised learning tasks with same Nbut different n, the degree of locality of the
learner (represented by the length of d) is the smaller the larger is n . Analogously
if N and 0 < d < 1 are fixed, the number k = Ndn of neighbours in Vdecreases by
increasing n . In other terms, as nincreases the amount of local data goes to zero
(Figure 12.5) or equivalently all data sets are sparse for large n.
Let us now consider the case where k > 0 and 0 < d < 1 (degree of locality) are
fixed and Nmay be adjusted (e.g. by observing more points). Since
N= k/dn
we need to exponentially grow the size of the training set Nto guarantee a constant
kfor increasing n. Suppose that k= 10, d= 0 .1 and N= 100 for n= 1. If we
want to preserve the same number kof neighbours for increasing n then N has to
grow according to the following law
N= k/dn =10
(1/10)n = 10 n+1
For instance we need to observe N = 106 observations for n = 5 if we want the same
degree of locality we had for n= 1. This implies that given two supervised learning
tasks (one with n = 1 and the other with n >> 1), the second should be trained
with a number Nof a much higher order of magnitude (Figure 12.6) to guarantee
the same degree of locality of the n = 1 configuration.
Another interesting result about the impact of dimensionality on data distribu-
tion is the following: given N observations uniformly distributed in a ndimensional
unit ball centred at the origin, the median of the distance from the origin to the
closest data point is (1 − 1/ 2 1/N ) 1/n (Figure 12.7).
12.1. CURSE OF DIMENSIONALITY 323
Figure 12.4: Neighbourhood edge size vs. dimension n(for fixed N and k)
Figure 12.5: Number of neighbours K vs. dimension n for fixed N and d
324 CHAPTER 12. FEATURE SELECTION
Figure 12.6: Number of training examples required to preserve the same kind of
locality obtained for n = 1 with k= 10 and d= 0.1
020406080100
0.0 0.2 0.4 0.6 0.8
N=1M points
n
median distance
Figure 12.7: Median nearest neighbour as a function of nfor very large N.
12.2. APPROACHES TO FEATURE SELECTION 325
All those considerations should sound like a warning for those willing to extend
local learning approaches to large dimensional settings where familiar notions of
distance and closeness lose their meaning and relevance. Large dimensionality in-
duces high sparseness with negative impact on predictive accuracy as shown by the
bias/variance decomposition in (10.1.47). For a fixed Nand by increasing nthe
algorithm is more and more exposed to one of those two low generalisation con-
figurations: i) too small k, i.e. too few points are close to the query points (with
negative impact in terms of variance) or ii) too large d implying that the nearest
neighbours are not sufficiently close the query point (with negative impact on bias).
Though, from a bias/variance perspective, the curse of dimensionality is partic-
ularly harmful for local learning strategies, the other learning strategies should not
be considered immune either. A too large n/N ratio implies an overparametriza-
tion of the learned hypothesis and a consequent increase of the variance term in the
generalisation error which is hardly compensated by the related bias reduction. For
this reason, the adoption of a feature selection step is more and more common in
modern machine learning pipelines.
12.2 Approaches to feature selection
There are three main approaches to feature selection:
•Filter methods: they are preprocessing methods. They attempt to assess
the merits of features from the data, ignoring the effects of the selected feature
subset on the learning algorithm's performance. Examples are methods that
select variables by ranking them through compression techniques (like PCA
or clustering) or computing correlation with the output.
•Wrapper methods: these methods assess subsets of variables according to
their usefulness to a given predictor. The method searches a good subset using
the learning algorithm itself as part of the evaluation function. The problem
boils down to a problem of stochastic state-space search. Examples are the
stepwise methods proposed in linear regression analysis (notably the leaps
subset selection algorithm available in R [15]).
•Embedded methods: they perform variable selection as part of the learn-
ing procedure and are usually specific to given learning machines. Examples
are classification trees, random forests, and methods based on regularisation
techniques (e.g. lasso)
Note that, in practice, hybrid strategies combining the three approaches above
are often considered as well. For instance in the case of a huge dimensional task
(e.g. n > 1000K as in epigenetics) it would make sense to first reduce the size of
features to a more reasonable size (e.g. some thousands or hundreds of features) by
filtering and then use some search approaches within this smaller space.
12.3 Filter methods
Filter methods are commonly used in very large dimensional tasks (e.g. n > 2000)
for the following reasons: they easily scale to very high-dimensional datasets, they
are quick because computationally simple, and they are independent of the classi-
fication algorithm. Also, since feature selection needs to be performed only once,
they can be integrated into validation pipelines comparing several classifiers.
However, they are not perfect. Filter methods, by definition, ignore any interac-
tion with the classifier and are often univariate or low-variate. The relevance of each
326 CHAPTER 12. FEATURE SELECTION
Figure 12.8: Two first principal components for a n= 2 dimensional Gaussian
distribution.
feature is assessed separately, thereby ignoring feature dependencies. This may be
detrimental in case of complex multivariate dependencies.
12.3.1 Principal component analysis
Principal component analysis (PCA) is one of the oldest and most popular pre-
processing methods to perform dimensionality reduction. It returns a set of linear
combinations of the original features so as to retain most of their variance and their
information. Those combinations may be used as compressed (or latent) versions
of the original features and used to perform learning in a lower dimensional space.
The method consists of projecting the data from the original orthogonal space
Xinto a lower-dimensional space Z , in an unsupervised manner, maximising the
variance and minimising the loss due to the projection. The new space is orthog-
onal (as the original) and its axes, called principal components, are specific linear
combinations of the original ones.
The first principal component (i.e. the axis z1 in Figure 12.8) is the axis
along which the projected data have the greatest variation. Its direction a∗ =
[a ∗
1, . . . , a ∗
n]∈R n is obtained by maximising the variance of
z=a1 x·1 + · ·· + an x ·n = aT x
a linear combination of the original features. It can be shown that a∗ is also the
eigenvector of the covariance matrix Var [x] associated to the largest eigenvalue [56].
The procedure for finding the other principal components is based on the same
principle of variance maximisation. The second principal component (i.e. the axis
z2 in Figure 12.8) is the axis, orthogonal to the first, along which the projected data
have the largest variation, and so forth.
12.3. FILTER METHODS 327
12.3.1.1 PCA: the algorithm
Consider the training input matrix Xhaving size [N, n]. The PCA consists of the
following steps:
1. the matrix Xis normalised and transformed to a matrix ˜
Xsuch that each
column ˜
X[ , j], j = 1, . . . , n, has null mean and unit variance1,
2. the Singular Value Decomposition (SVD) [83] (Appendix B.5.10) of ˜
Xis com-
puted ˜
X= UDV T
where U is a [N, N ] matrix with orthonormal columns, Dis a [N, n] rectan-
gular diagonal matrix with diagonal singular values d1 ≥d2 ≥ ··· ≥ dn ≥ 0,
dj = p λj with λj eigenvalue of XT X and V is a [n, n] matrix whose or-
thonormal columns are the eigenvectors of XT X ,
3. the matrix ˜
Xis replaced by the linear transformation
Z=˜
XV = UD (12.3.1)
whose columns (also called eigen-features) are a linear combination of the
original features and the related variances are sorted in a decreasing order,
4. a truncated version of Zmade of the first h < n columns (associated to the h
largest singular values) is returned.
But how do we select the convenient number hof eigen-features? In the litera-
ture, three main strategies are considered:
1. fix a threshold αon the proportion of variance to be explained by the principal
components, e.g. choose h such that
λ1 +· ·· + λh
Pn
j=1 λ j ≥α
where λj is the jth largest eigenvalue and P h
j=1 λ j is the amount of variance
retained by the first h components,
2. plot the decreasing values of λj as a function of j(scree plot) and choose the
value of h corresponding to a knee in the curve,
3. select the value of h as if it was a hyperparameter, e.g. by cross-validation.
The outcome of PCA is a rotated, compressed and lower dimension version of
the original input set { x1 ,...,xn } made of h < n orthogonal features {z1 ,...,zh },
sorted by decreasing variance. In that sense, PCA can be considered as a linear auto-
encoder where the encoding step is performed by (12.3.1) and the reconstruction
of the coded data to the original space is obtained by ˜
X= ZV T . It can also
be shown [56] that the PCA implements an optimal linear auto-encoder since it
minimises the average reconstruction error
N
X
i=1 kx i −V T V x i k 2 (12.3.2)
which amounts, for hcomponents, to P n
j= h+1 λ j /N .
1A R dataframe may be easily normalised by using the R command scale.
328 CHAPTER 12. FEATURE SELECTION
Figure 12.9: A separable n= 2 dimensional binary classification task reduced to a
non separable one because of PCA dimensionality reduction.
PCA works in a completely unsupervised manner since the entire algorithm is
independent of the target y. Though such unsupervised nature reduces the risk
of overfitting, in some cases, it may cause a deterioration of the generalisation
accuracy since there is no reason that principal components be associated with y.
For instance, in the classification example of Figure 12.9, the choice of the first PCA
component would reduce the accuracy of the classifier instead of increasing it. In
order to account both for input variation and correlation with the target, supervised
versions of PCA exist, like principal component regression or partial least squares.
Another limitation of PCA is that it does not return a subset but a weighted
average of the original features (eigen-feature). In some cases, e.g. in bioinformatics
gene selection, PCA is then not recommended since it may hinder the interpretabil-
ity of the resulting model.
R script
The scripts pca.R and pca3D.R illustrate the PCA decomposition in the n = 2
and n = 3 case for Gaussian distributed data and compute the reconstruction
error (12.3.2).
The script pca uns.R illustrates the limits of PCA due to its unsupervised na-
ture. Consider a binary classification task with n= 2 and a separating boundary
between the two classes which is directed as the first component. In this case a
dimensional reduction is rather detrimental to the final accuracy since it transforms
the separable n= 2 problem into a non separable n= 1 problem (Figure 12.9).
•
PCA is an example of linear dimensionality reduction. In the machine learning
literature, however, there are several examples of nonlinear versions of PCA: among
the most important we mention the kernel-based version of PCA (KPCA) and
(deep) neural auto-encoders (Section 10.1.2).
12.3. FILTER METHODS 329
12.3.2 Clustering
Clustering, also known as unsupervised learning, is presented in Appendix A. Here
we will discuss how it plays a role in dimensionality reduction by determining groups
of features or observations with similar patterns (e.g. patterns of gene expressions
in microarray data).
The use of a clustering method for feature selection requires the definition of a
distance function between variables and the definition of a distance between clusters.
The two most common methods are
•Nearest-neighbour clustering: the number of clusters is set by the user,
then each variable is assigned to a cluster at the end of an iterative procedure.
Examples are Self Organizing Maps (SOM) and K-means.
•Agglomerative clustering: it is a bottom-up method where clusters are ini-
tially empty and sequentially filled with variables. An example is hierarchical
clustering (R command hclust) which starts by considering all the variables
as belonging to separate clusters. Next, it joins pairs of similar features in the
same cluster and then it proceeds hierarchically by merging the closest pairs
of clusters. The algorithm requires a measure of dissimilarity between sets of
features and a linkage criterion that quantifies the set dissimilarity as a func-
tion of the set elements pairwise distances. The visual output of hierarchical
clustering is a dendrogram, a tree diagram used to illustrate the arrangement
of the clusters. Figure 12.10 illustrates the dendrogram returned by a clus-
tering of features in a bioinformatics task. Note that the dendrogram returns
different clusters of features (and a different number of clusters) at different
heights. The choice of the optimal height cut is typically done by means of a
cross-validation strategy [126].
Clustering and PCA are both unsupervised dimensionality reduction techniques,
which are commonly used in several domains (notably bioinformatics). However, the
main advantage of clustering resides in the higher interpretability of the outcome.
Unlike the PCA linear weighting, the grouping of the original features is much more
informative and may return useful insights to the domain expert (e.g. about the
interaction of a group of genes in a pathology [92]).
12.3.3 Ranking methods
Unlike PCA and clustering, ranking methods are supervised filters since they take
into account the relation between inputs and target yto proceed with the selection.
Ranking methods consist of three steps: i) they first assess the importance (or
relevance) of each variable for the output by using a univariate measure, ii) they
rank them in decreasing order of relevance and iii) select the top k variables.
Relevance measures commonly used in assessing a feature are:
•the Pearson linear correlation (the larger, the more relevant);
•in case of binary classification tasks, the p-value of hypothesis tests like t-test
or Wilcoxon (the lower, the more relevant).
•mutual information (Section 3.8) (the larger the more relevant).
Ranking methods are fast (complexity O (n )), and their output is intuitive and
easy to understand. At the same time, they disregard redundancies and higher-
order interactions between variables. Two typical situations where ranking does
not perform well are complementary and highly redundant configurations. In the
complementary case, two input features are very low informative about the target,
330 CHAPTER 12. FEATURE SELECTION
Figure 12.10: Dendrogram.
yet they are very informative if taken together (see the XOR configuration later).
Because of their low univariate relevance, ranking methods will rank them low and
consequently discard them. Otherwise, two variables could be both highly relevant
about the target but very similar (or identical). In this redundant case, both will
be ranked very high and selected, despite their evident redundancy.
Feature selection in a gene expression dataset
A well-known high-dimensional classification task is gene expression classification
in bioinformatics, where the variables correspond to genomic features (e.g. gene
probes), the observations to patients and the targets are biological phenotypes (e.g.
cancer grade). Because of the growing capabilities of sequencing technology, the
number of genomic features is typically much larger than patient cohorts' size.
In the script featsel.R we analyze the microarray dataset from [84]. This dataset
contains the genome expressions of n= 7129 genes for N = 72 patients, and V = 11
related phenotype variables. The expression matrix Xand the phenotype vector
Yare contained in the dataset data(golub). The script studies the dependency
between the gene expressions and the binary phenotype ALL.AML indicating the
leukaemia type: lymphoblastic leukaemia (ALL) or acute myeloid leukaemia (AML).
Relevant features are selected by correlation ranking and the misclassification errors
are computed for different sizes of the feature set.
•
12.4 Wrapping methods
Wrapper methods combine a search in the space of possible feature subsets with
an assessment phase relying on a learner and a validation (often cross-validation)
12.4. WRAPPING METHODS 331
technique. Unlike filter methods, wrappers take into consideration the interaction
between features, and this in a supervised manner. Unfortunately, this implies a
much higher computational cost, especially in the case of expensive training phases.
Also, the dependance of the final result on the learner choice could be considered as
anuisance factor confounding the impact of the feature set on the final accuracy2.
In other terms, the issue is: was the feature set returned by the wrapper because it
was good in general or only for that specific learner (e.g. a neural network)?
The wrapper search can be seen as a search in a space W = { 0, 1}n where a
generic vector w∈ W is such that
w[ j] = ( 0 if the input jdoes NOT belong to the set of features
1 if the input jbelongs to the set of features
Wrappers look for the optimal vector w∗ ∈ { 0, 1}n such that
w∗ = arg min
w∈ W
\
MISEw(12.4.3)
where \
MISEw is the estimation of the generalisation error of the model based on
the set of variables encoded by w. Since in real-settings the actual generalisation
error is not directly observable, the computation of \
MISEw requires the definition
of a learner and of a validation strategy.
Note that the number of vectors in Wis equal to 2n, that it doubles for each
new feature and that for moderately large n (e.g. n > 20), the exhaustive search
is no more affordable. For this reason, wrappers typically rely on heuristic search
strategies.
12.4.1 Wrapping search strategies
Three greedy strategies are commonly used to avoid the exponential complexity
O(2n ) of the exhaustive approach:
•Forward selection: the procedure starts with no variables and progressively
incorporates features. The first selected input is the one that returns the
lowest generalisation error. The second input selected is the one that, together
with the first, has the lowest error, and so on, until no further improvement is
made or the required number of features is attained. An example of forward
selection is implemented in the R script fs wrap.R.
•Backward selection: it works in the opposite direction of the forward ap-
proach by progressively removing features from the original feature set. The
procedure starts by learning a model using all the nvariables and, therefore,
requires at least N > n. Then the impact of dropping one feature at a time
from the current subset is assessed. The feature which is actually removed
is the one that yields the lowest generalisation error after deletion. In other
terms, it is the one whose absence causes the lowest increase (or highest de-
crease) of the generalisation error. The procedure iterates until the desired
number of features is attained.
•Stepwise selection: it combines the previous two techniques by testing for
each set of variables, first the removal of features belonging to the set, then
the addition of variables not in the set.
2This is the reason why a blocking factor approach to control the variability due to the learner
algorithm and improve the robustness of the solution has been proposed in [28].
332 CHAPTER 12. FEATURE SELECTION
It can be shown that the forward and the backward strategies have a O (n2 ) time
complexity in the case of n steps: since the i th step (i = 0, . . . , n − 1) requires
n− iassessments to select (or remove) the (i + 1)th feature, the computational
complexity for n steps amounts to P n−1
i=0 (n− i) = n(n+1)
2.
Nevertheless, since such complexity cannot be affordable either in case of very
large n , it is common usage to reduce first the number of features by using a fast
filter method (e.g. ranking) and then apply a wrapper strategy on the remaining
number of features. Another trick consists of limiting the maximum size of the
feature set, then reducing the computational cost.
12.4.2 The Cover and van Campenhout theorem
The rationale of forward and backward greedy heuristics is that an optimal set of
size k should contain the optimal set of size k− 1. Though this seems intuitive, in
the general case there is no reason why this relation should hold. A formal result
in that sense is provided by the Cover and van Campenhout theorem [58], which
contains a negative result about the aim of wrapper search techniques to find the
optimal subset by local procedures.
Let us consider a learning problem and denote by R∗ (w ) the lowest functional
risk (7.2.6) for the subset of variables w . Cover and van Campenhout proved that
the only generally valid (i.e. which holds for all data distributions) monotonic
relation linking feature size and generalisation is :
w2 ⊂ w1 ⇒ R∗ ( w1 )≤ R∗ ( w2 ) (12.4.4)
i.e. by adding variables we reduce the minimal risk3.
Given n features, any ordering of the 2nsubsets which is consistent with the
above constraint is indeed possible. This means for any possible ordering, there
exists a distribution of the data that is compatible with that. If the three variables
optimal set is
{x ·1 ,x ·3 , , x ·13 }
there is no guarantee that the best set of four variables is a superset of w1 (as it is
assumed in forward selection). According to this theorem, there exists a distribution
for which the best set of 4 features could well be
{x·2 ,x·6 ,x·16 , x ·23 }
since this is not in contradiction with the constraint (12.4.4). In other words, the
Cover and van Campenhout theorem states that there are data distributions for
which forward/backward strategies could be arbitrarily bad.
12.5 Embedded methods
They are typically less computationally intensive than wrapper methods but are
specific to a learning machine. Well-known examples are classification trees, Ran-
dom Forests (Section 11.4), Naive Bayes (Section 10.2.3.1), shrinkage methods and
kernels.
3Note that this relation refers to the optimal model that could be learned with the input subset
wand that the notion of lowest functional risk does not take into consideration the model family
nor the finite-size setting. In other terms, this inequality refers only to the bias and not the
variance component of the generalisation error. So in practice though in theory R∗ (w1 )≤R∗ (w2 )
it could happen that GN (w1 )≥GN (w2 ) where GN is the generalisation error of the model learned
with N observations
12.5. EMBEDDED METHODS 333
12.5.1 Shrinkage methods
Shrinkage is a technique to improve a least-squares estimator by regularisation
and consists of reducing the model variance by adding constraints on the value
of coefficients. In what follows, we present two shrinkage approaches that penalise
the least-squares solutions having a large number of coefficients with values different
from zero. The rationale is that only those variables, whose impact on the empirical
risk is considerable, deserve a coefficient different from zero and should appear in
the fitted model. Shrinkage is an implicit (and more continuous) embedded manner
of doing feature selection since only a subset of variables contributes to the final
predictor.
12.5.1.1 Ridge regression
Ridge regression is an example of shrinkage method applied to least squares regres-
sion
ˆ
βr = arg min
b{
N
X
i=1
(yi − x T
ib) 2 +λ
p
X
j=1
b2
j}=
= arg min
b(Y− Xb ) T (Y− Xb) + λb T b
where λ > 0 is a complexity parameter that controls the amount of shrinkage:
the larger the value of λ, the greater the amount of shrinkage. Note that if λ = 0
the approach boils down to a conventional unconstrained least-squares.
An equivalent formulation of the ridge problem is
ˆ
βr = arg min
b
N
X
i=1
(yi − x T
ib) 2 ,
subject to
p
X
j=1
b2
j≤L
where there is a one-to-one correspondence between the parameter λ and L[98].
It can be shown that the ridge regression solution is
ˆ
βr = ( XT X+ λIp ) −1 XT Y(12.5.5)
where Ip is the [p, p ] identity matrix (p =n + 1) and it is typically recommended
that the Xcolumns are normalised (zero mean and unit variance) [132]. In algebraic
terms, a positive λensures that the matrix to be inverted be symmetric and strictly
positive definite.
If n >> N it is recommended to take advantage of the SVD decomposition (B.5.11)
to avoid the inversion of a too large matrix [100]. If we set X = U DV T then we
obtain from (12.5.5) and (B.9.15)
ˆ
βr = ( V DU T U DV T + λIp )−1 V DU T Y= V ( RT R+ λIN ) −1 RT Y
where R = UD is a [N, N ] matrix and IN is the [N, N ] identity matrix.
In general, ridge regression is beneficial in numerical, statistical and inter-
pretability terms. From a numerical perspective, it is able to deal with rank deficient
matrices X and reduces the ill-conditioning of the matrix XT X . From a statistical
perspective, it reduces the variance of the least-squares solution ˆ
βr (Section (9.1.14))
at the cost of a slight bias increase. Given the predominance of the variance term
in high-dimensional tasks, ridge regression enables a reduction of the generalisation
error. Last but not least, pushing the absolute value of many coefficients to zero, it
allows the identification of a small (then interpretable) number of input features.
334 CHAPTER 12. FEATURE SELECTION
12.5.1.2 Lasso
Another well-known shrinkage method is lasso which estimates the linear parame-
ters by
ˆ
βr = arg min
b
N
X
i=1
(yi − x T
ib) 2 ,(12.5.6)
subject to
p
X
j=1 |b j | ≤ L(12.5.7)
If on one hand the 1-norm penalty of the lasso approach allows a stronger constraint
on the coefficients, on the other hand it makes the solution nonlinear and demands
the adoption of a quadratic programming algorithm (details in Appendix B.8).
To formulate the problem (12.5.6) in the form (B.8.12) with linear constraints,
we may write the bj terms as the sum of two non-negative numbers
bj =b+
j−b −
j=|b j |+b j
2−|bj | − b j
2
The function to optimize becomes
J( b) = bT XT Xb − 2 YT Xb = ( b+ −b− )T XTX ( b+ −b− )− 2 YT X( b+ −b− ) =
b + b − X T X−XT X
−XT X X T X b +
b− + −2YT X 2YT X b +
b− (12.5.8)
with the constraints
1 1 . . . 1
−1 0 . . . 0
0− 1. . . 0
0 0 . . . − 1
b+
b− ≤
L
0
···
0
where the left-hand matrix is [2p + 1, 2 p ]. The first line of the inequality is (12.5.7)
since P p
j=1(b +
j+b −
j) = P p
j=1 |b j | ≤ L.
Note that if L > P p
j=1 | ˆ
βj |the lasso returns the common least-squares solution.
The penalty factor Lis typically set by having recourse to cross-validation strategies.
Though the difference between ridge regression and lasso might seem negligible,
the use of a 1-norm penalty instead of a 2-norm has a sensible impact on the number
of final coefficients which are set to zero. Figure 12.11 visualises this in a bivariate
case: ˆ
βdenotes the least-squares solution which would be returned by both methods
if λ = 0. Since λ > 0, the minimisation combines the empirical risk function (whose
contour lines are the ellipsoids around ˆ
β) and the regularisation term (whose contour
lines are around the origin). Note that the only difference between the two figures
is the shape of the regularisation contour lines (related to the used norm). The
minimisation solution is a bivariate vector which lies somewhat (depending on the
λvalue) at the intersection of an empirical risk contour line and a regularisation
one. The figure shows that this intersection in the lasso case tends to be closer
to the axis β1 = 0, this meaning that the first estimated coefficient is set to zero.
Because of the circular shape of regularisation contours, this is much less probable
in the ridge regression case.
R script
The R script lasso.R implements the quadratic programming minimisation in (12.5.8)
by using the R library quadprog. The script applies the lasso strategy to a regression
12.5. EMBEDDED METHODS 335
Figure 12.11: Ridge regression vs lasso [98].
task where the number of features nis comparable to the number of observations N
and only a small number of features is relevant. The results show the impact of the
constraint L on the empirical risk and the evolution of the lasso solution moving
towards one of the axis. In particular the smaller L, the less importance is given to
minimise J , the larger the empirical risk and the smaller is the number of estimated
parameters different from zero.
•
The shrinkage approach has been very successful in recent years and several
variants of the methods mentioned above exist in literature: some adopt different
penalty norms, some combine different norms (e.g. Elastic-net) and some combine
shrinkage with greedy search (e.g. Least Angle Regression).
12.5.2 Kernel methods
Many learning algorithms, such the perceptron, support vector machine (SVM) and
PCA, process data in a linear manner through inner products (Section B.2). Those
techniques are exposed to two main limitations: the linear nature of the model and
the curse of dimensionality for large n.
Kernel methods [171] adapt those techniques by relying on the combination of
two smart ideas: i) address large dimension nproblems by solving a dual problem
in a space of dimension Nii) generalise the notion of inner product by adopting a
user-specified kernel function, i.e., a similarity function over pairs of data points.
Kernel functions operate in a high-dimensional, implicit feature space without
computing the coordinates of the data in that space. This allows to take advantage
of high nonlinear dimensional representations without actually having to work in
the high dimensional space.
336 CHAPTER 12. FEATURE SELECTION
Figure 12.12: Implicit transformation of the problem to a high-dimensional space.
12.5.3 Dual ridge regression
We introduced the dual formulation of the linear least-squares problem in Sec-
tion 9.1.18. Consider now a ridge regression problem (Section 12.5.1.1) with pa-
rameter λ ∈ <+ . The conventional least-squares solution is the [n, 1] parameter
vector
ˆ
β= ( X0 X+ λIn ) −1 X0 y
where In is the identity matrix of size n. Since from (B.9.15)
(X0 X + λIn )−1 X0 =X0 (XX0 + λIN )−1
where IN is the identity matrix of size N, the dual formulation is
ˆ
β= X0 ( XX0 +λIN )−1 y= X0 α
where
α= ( K+ λIN )−1 y
is the [N, 1] vector of dual variables and K = XX0 is the Kernel or Gram [N , N]
matrix. Note that all the information required to compute α is this matrix of inner
products.
The prediction for a test [Nts , n ] dataset Xts is
ˆ yts = Xts ˆ
β= Xts X0 α= Kts ( K+ λIN ) −1 y
where Kts is a [N ts , N ] matrix with k j,i = hxj , xi i, j = 1, . . . , Nts , , i = 1, . . . , N .
This derivation allows transforming a ndimensional linear task into a Ndimen-
sional one. This is of course very relevant if n >> N. However, the model remains
linear. What about non-linear models?
12.5.4 Kernel function
Suppose to apply the nonlinear transformation Φ : x ∈ <n → Φ(x ) ∈ <M to
the inputs of the ridge regression problem discussed above (Figure 12.12). The
prediction for an input xwould now be
ˆ y= y0 ( K+ λIN )−1 k
where
Ki,j = hΦ( xi ) , Φ( xj ) i , ki =h Φ( xi ) , Φ( x )i
12.6. SIMILARITY MATRIX AND NON NUMERIC DATA 337
The rational of kernel methods is that those inner products can be computed
efficiently without explicitly computing the mapping Φ thanks to a kernel func-
tion [171]. A kernel function is a function κ that for all x, z ∈ X satisfies
κ( x, z) = h Φ( x ) , Φ( z )i
where Φ is a mapping from Xto a feature space F. For instance
κ( x, z) = h x, z i2 = h Φ( x ) , Φ( z )i
where
Φ : x = (x1 , x2 )→ Φ(x ) = (x 2
1, x 2
2,√ 2x 1 x 2 )∈F
Kernels de-couple the specification of the algorithm from the specification of the
feature space since they provide a way to compute dot products in some feature
space without even knowing what this space the function Φ are.
For instance
κ( x, z) = (1 + xT z )2
corresponds to a transformation to M= 6 dimensional space
Φ(x1 , x2 ) = (1, x2
1, x 2
2,√ 2x 1 ,√ 2x 2 ,√ 2x 1 x 2 )
A Gaussian kernel κ ( x, z ) = exp−γk x −z k2 corresponds to a transformation to an
infinite-dimensional space.
Theoretically, a Gram matrix must be positive semi-definite (PSD). Empirically,
for machine learning heuristics, choices of a function κthat do not satisfy PSD
condition may still perform reasonably if κat least approximates the intuitive idea
of similarity.
The general idea of transposing a low-dimensional method to a nonlinear high-
dimensional setting by using a dual formulation is generally referred to as kernel
trick : given any algorithm that can be expressed solely in terms of dot products,
the kernel trick allows us to construct different nonlinear versions of it.
Kernel methods are together with deep learning and random forests among the
most successful methods in the history of machine learning. We decided to present
them in this section for their powerful strategy in dealing with settings with high
dimension and low number of observations. Their strength can however turn into
a weakness if we aim to scale the approach to very large N . At the same time, as
for all the other methods presented in this book, their generalisation accuracy is
strictly dependent on the adequate choice of the related hyperparameters. In the
case of kernel methods the most important hyperparameters are the regularisation
term λ , the analytical form of the kernel function and the related parameters.
12.6 Similarity matrix and non numeric data
In the previous sections, we have considered feature selection techniques for conven-
tional supervised tasks where data are numeric and represented in a conventional
tabular form DN . What about non-conventional tasks where the training set is not
a data table but a set of items? Examples of items could be music tracks, texts,
images, web sites or graphs. Often, in those cases, we are not able (or confident)
in encoding each item as a numeric vector of size n. Nevertheless, we could be con-
fident in defining a similarity score between pairs of items. For instance, we may
use musical genre to measure the similarity between tracks or user access statistics
to obtain the similarity between web sites.
As a result, we may encode the item set as a similarity matrix S of size [N, N ]
which becomes an alternative way of representing the dataset.
338 CHAPTER 12. FEATURE SELECTION
A symmetric factorisation of a symmetric [N, N ] matrix
S≈ F F T (12.6.9)
is an approximation of the similarity matrix where Fis a [N, K] matrix. The
matrix F may be used as an approximate Kdimensional numeric representation of
the non-numeric item-set.
Note that the positive definitiveness of S is a necessary and sufficient condition
for having an exact factorisation, i.e. an identity in (12.6.9). This is guaranteed
in the numeric case where Sis the covariance matrix and the pairwise similarity is
computed by dot product. In the generic non-numeric case, techniques to repair the
positive definitiveness of Smay be adopted. An alternative is the use of optimisation
techniques to obtain Fas the solution of the minimisation task
F= arg min
UkS− UU T k 2
F
Another limitation of the factorisation approach is that it is hardly scalable for
very large N . For such cases sampling based solution have been proposed in [2].
12.7 Averaging and feature selection
The role of averaging methods in supervised learning has been discussed in the
previous chapter. Averaging may play a crucial role also in dealing with large
dimensionality. Instead of choosing one particular feature selection method, and
accepting its outcome as the final subset, different feature selection methods can
be combined using ensemble approaches. Since there is not an optimal feature
selection technique and due to the possible existence of more than one subset of
features that fits the data equally well, model combination approaches have been
adapted to improve the robustness and stability of final, discriminative methods.
Ensemble techniques typically rely on averaging the outcome of multiple models
learned with different feature subsets. A well-known technique is the random sub-
space method [102], also known as feature bagging, which combines a set of learners
trained on random subsets of features.
12.8 Feature selection from an information-theoretic
perspective
So far, we focused on algorithmic methods to return a subset of relevant features,
without making any formal definition of relevance. In this section, we formalise the
notion of feature relevance by using concepts of information theory, like entropy,
mutual information and conditional information from Sections 3.8 and 3.8.1.
12.8.1 Relevance, redundancy and interaction
This section defines in information-theoretic terms what is a relevant variable in a
supervised learning task where Xis a set of ninput variables and yis the target.
These definitions are obtained by interpreting in information-theoretic terms the
definitions made by [117].
Definition 8.1 (Strong relevance) . A variable xi ∈ X is strongly relevant to the
target yif
I(X−i ;y ) < I (X ;y )
where X −i is the set obtained by removing the variable xi from X.
12.8. INFORMATION-THEORETIC PERSPECTIVE 339
In other words, a variable is strongly relevant if it carries some information about
ythat no other variable can carry. Strong relevance indicates that the feature is
always necessary for an optimal subset.
Definition 8.2 (Weak relevance).A variable is weakly relevant to the target yif
it is not strongly relevant and
∃S ⊆X −i :I ( S; y) < I ( {xi , S };y )
In other words, a variable is weakly relevant when it exists a certain context S
in which it carries information about the target. Weak relevance suggests that the
feature is not always necessary but may become necessary at certain conditions.
This definition makes clear that for some variables (typically the majority) the
relevance is not absolute but more a context-based notion. In a large variate setting,
those features are the hardest to deal with since their importance depends on the
other selected ones.
Definition 8.3 (Irrelevance) . A variable is irrelevant if it is neither strongly or
weakly relevant.
Irrelevance indicates that the feature is not necessary at all. This is definitely
the easiest case in feature selection. Irrelevant variables should be simply discarded.
Example
Consider a learning problem where n = 4, x2 =− x3 +w 2
y=( 1 + w,x1 + x2 >0
0,else
where w and w2 are noise terms. Which variables are strongly, weakly relevant and
irrelevant?
•
Definition 8.4 (Markov blanket).Let us consider a set X of n r.v.s., a target
variable y and a subset My ⊂ X . The subset My is said to be a Markov blanket of
y, y/ ∈My iff
I(y ;X −(My ) | My ) = 0
The following theorem can be shown [183, 150]:
Theorem 8.5 (Total conditioning). If the distribution has a perfect map in a DAG
(Section 4.3.2.1) then
x∈ My ⇔I ( x; y| X −(x,y ) )> 0
This theorem proves that, under specific assumptions about the distribution
discussed in Section 4.3.2.1, the Markov blanket of a target yis composed of the
set of all the strongly relevant variables in X.
Another useful notion to reason about the information of a subset of variables
is the notion of interaction.
Definition 8.6 (Interaction) . Given three r.v.s. x1 ,x2 and y we define the inter-
action between these three variables as
I(x1 ;y )− I(x1 ;y |x2 )
340 CHAPTER 12. FEATURE SELECTION
Figure 12.13: XOR classification separable task with two inputs and one binary class
taking two values (stars and rounds). Two variables x1 and x2 are complementary:
they bring alone no information but they bring the maximal information about y
when considered together.
The interaction term satisfies the following relation:
I(x1 ;y )− I(x1 ;y |x2 ) = I(x1 ;x 2) − I(x1 ;x2 | y ) = I(x2 ;y )− I(x2 ;y |x1 )
In what follows, we show that it is possible to decompose the joint information
of two variables in the sum of the two univariate terms and the interaction. From
the chain rule (3.8.82)
I(x2 ;y| x1 ) + I(x1 ;y ) = I(x1 ;y |x2 ) + I(x2 ;y )
we have
I(x2 ;y| x1 ) = I(x2 ;y )− I(x1 ;y ) + I(x1 ;y| x2 )
By summing I (x1 ;y ) to both sides, from (3.8.82) it follows that the joint information
of two variables about a target ycan be decomposed as follows:
I({ x1 ,x2 }; y) = I(x1 ;y ) + I(x2 ;y )− [I(x1 ;y )− I(x1 ;y |x2 )]
| {z }
interaction
=
=I (x1 ;y ) + I (x2 ;y )− [I (x1 ;x2 )−I (x1 ;x2 | y )]
| {z }
interaction
(12.8.10)
What emerges is that the joint information of two variables is not necessarily equal,
greater or smaller than the sum of the two individual information terms. All de-
pends on the interaction term: if the interaction term is negative, the two variables
are complementary, or in other terms, they jointly bring a higher information than
the sum of the univariate terms. This is typically the case of the XOR exam-
ple illustrated in Figure 12.13 [89]. In this case I (x1 ;y ) = 0, I (x2 ;y ) = 0 but
I({x1 ,x2 } ;y ) >0 and maximal. When they are redundant, the resulting joint
information is lower than the sum I ( x1 ;y ) + I (x2 ;y ).
Since (12.8.10) holds also when x1 and/or x2 are sets of variables, this result
sheds an interesting light about the non-monotonic nature of feature selection [199].
12.8. INFORMATION-THEORETIC PERSPECTIVE 341
12.8.2 Information-theoretic filters
In terms of mutual information the feature selection problem can be formulated as
follows. Given an output target yand a set of input variables X ={ x1 ,...,xn } ,
the optimal subset of dvariables is the solution of the optimisation problem
X∗ = arg max
XS ⊂X,| XS |=dI(X S ;y) (12.8.11)
Thanks to the chain rule (3.8.82), this maximisation task can be tackled by
adopting an incremental approach (e.g. forward approach).
Let X = {xi }, i = 1, . . . , n the whole set of variables and XS the set of s variables
selected after ssteps. The choice of the (s + 1)th variable x(s+1) ∈ X− XS can be
done by solving
x(s+1) = arg max
xk ∈ X − X S
I({XS ,xk } ;y ) (12.8.12)
This is known as the maximal dependency problem and requires at each step multi-
variate estimation of the mutual information term I ({XS , xk } ;y ). Such estimation
is often inaccurate in large variate settings (i.e. large nand large s) because of
ill-conditioning and high variance issues.
In literature several filter approaches have been proposed to solve the optimi-
sation (12.8.12) by approximating the multivariate term I ({XS , xk } ;y ) with low
variate approximations. These approximations are necessarily biased, yet much less
prone to variance than their multivariate counterparts.
We mention here two of the most used information theoretic filters:
•CMIM [74]: since according to the first (chain-rule) formulation
arg max
xk ∈ X − X S
I({XS ,xk } ;y ) = arg max
xk ∈ X− XS
I(xk ;y |XS )
this filter adopts the low-variate approximation
I(xk ;y| XS )≈ min
xj ∈ XS
I(xk ;y| xj )
•mRMR (minimum Redundancy Maximal Relevance) [152]: the mRMR method
approximates at the (s + 1)th step I ({XS , xk } ;y ) with
I(xk ;y )− 1
sX
xi ∈X S
I(xi ;xk )
where s is the number of features in XS . The method implements a forward
selection which selects at the (s + 1)th step
x(s +1) = arg max
xk ∈ X− XS "I(x k ;y)− 1
sX
xi ∈ XS
I(xi ; xk )#
that is a variable which has both high relevance I (xk ;y ) and low average
redundancy with the set XS .
12.8.3 Information-theoretic notions and generalisation
Most of this book has dealt with generalisation error to assess, compare and select
prediction models and in this chapter, we presented feature selection as an instance
of model selection. Nonetheless, this last part of the chapter has been mainly
referring to information-theoretic notions for performing feature selection. It is
342 CHAPTER 12. FEATURE SELECTION
then important to provide some elucidation on how information-theoretic notions
relate to generalisation error.
Given a set Xof input features and a target feature y, the quantity I ( X ;y ) is
not directly observable and has to be estimated before use. Since
I(X ;y ) = H(y )− H(y| X ),
maximising I (X ;y ) in (12.8.11) is equivalent to minimise H (y| X ). The term
H(y| X ) is the entropy (or uncertainty) of yonce the value of the input set X
is given. In the Normal case, this term is proportional to the conditional variance
(Equation (3.5.59)). It follows that finding the set of inputs Xwhich minimises
H(y| X ) boils down to find the set of features that attains the lowest generalisation
error (7.2.6).
In real-world settings, since the conditional entropy H (y| X ) is not observable,
it may be approximated by the generalisation error, e.g. by MISE in the regres-
sion case. Hopefully, the link between feature selection, generalisation error and
information-theory becomes clear: finding the set that maximises the mutual in-
formation in (12.8.11) boils down to find the set that minimises the estimated
generalisation error (12.4.3).
12.9 Assessment of feature selection
Most of the discussed techniques aim to find the best subset of features by per-
forming a large number of comparisons and selections. This additional search layer
increases inevitably the space of possible models and the variance of the resulting
one. Despite the use of validation procedures, low misclassification or low predic-
tion errors may be found only due to chance. As stated in [132], given a sufficiently
exhaustive search, some apparent pattern can always be found, even if all predictors
have come from a random number generator. This is due to the fact that, as a
consequence of the search process, the set of features is dependent on the data used
to train the model, introducing then what is called selection bias4 [132].
A bad (and dangerous) practice is using the same set of observations to select
the feature set and assess the accuracy of the classifier. Even if cross-validation
is used to assess the accuracy of the classifier, this will return an overoptimistic
assessment of the generalisation error (Figure 12.14). Cross-validation has to be
used to assess the entire learning process which is composed of both a feature
selection and a classification step. This means that for each fold, both feature
selection and classification has to be performed before testing on the observations
set aside. Keeping feature selection out of cross-validation will return an assessment
which will be as much biased as the number of observations is small.
If cross-validation cannot be carried out (e.g. because of too small training size),
then the use of external validation sets is strongly recommended.
If no additional data are available, an alternative consists of comparing the gen-
eralisation accuracy returned by cross-validation on the original data with the one
obtained by re-running the learning procedure with randomised datasets. This is
inspired by the method of permutation testing described in Section 6.6. The proce-
dure consists of repeating the feature selection and the cross-validation assessment
several times by using a randomised dataset instead of the original one. For in-
stance, a randomised dataset may be obtained by reshuffling the output vector, a
permuting operation that artificially removes the dependency between the inputs
and the output. After a number of repetitions with randomised datasets, we obtain
4For a formal justification of selection bias, look at Appendix C.11. A causal perspective on
selection bias is in Section 13.7.4
12.10. CONCLUSION 343
Figure 12.14: Selection bias associated to feature selection: the internal leave-one-
out is an overoptimistic estimator of the test generalisation error.
the null distribution of the accuracy in case of no dependency between inputs and
output. If the accuracy associated with the original data is not significantly better
than the one obtained with randomised data, we are overfitting the data. For in-
stance, let us consider a large-variate classification task where the cross-validated
misclassification error after feature selection is 5%. If we repeat the same learning
procedure 100 times with randomised datasets and we obtain a significant (e.g. 10)
number of times a misclassification error smaller or equal than 5%, this is a sign of
potential overfitting.
Making a robust assessment of a feature selection outcome has a striking im-
portance today because we are more and more confronted with tasks characterised
by a very large feature to sample ratio (e.g. in bioinformatics [8]), where a bad
assessment procedure can give too optimistic (overfitted) results.
Example
The script fselbias.R illustrates the problem of selection bias in the case of in-
tensive search during feature selection. Consider a linear input/output regression
dependency with n= 54 inputs (of which only 4 relevant and others irrelevant) and
a dataset of size N= 20. Let us perform a forward search based on internal leave-
one-out. Figure 12.14 shows the evolution of the internal leave-one-out MSE and a
more reliable estimation of the generalisation MSE based on an independent test set
(5000 i.i.d. examples from the same input/output process). It appears that, as the
feature set size increases, the internal leave-one-out error returns a very optimistic
estimation of the generalisation error. Therefore, the internal leave-one-out error is
unable to detect that the optimal size of the input set (i.e. the number of strongly
relevant variables) is equal to four.
•
12.10 Conclusion
Nowadays, feature selection is an essential component of a real-world learning
pipeline. This chapter discussed how the problem is typically addressed as a stochas-
tic optimisation task in a combinatorial state-space where the assessment of each
solution and the search strategy are key elements. Most heuristic approaches rely
on a monotonic assumption, stating that the best subset of size kis always con-
tained in the best subset of size > k. The theorem in Section 12.4.2 and the notions
344 CHAPTER 12. FEATURE SELECTION
of interaction discussed in Section 12.8.1 show that this assumption is simplistic.
Variables that are almost non-informative alone may become extremely informative
together since the relevance of a feature is context-based. This is made formal by
notions of graphical models (e.g. d-separation or Markov blanket) which unveil the
conditional nature of dependency (Chapter 4). Our opinion is that the best way
of conceiving feature selection is not by black-box optimisation but by reasoning
on the conditional structure of the distribution underlying the data. The final aim
should be, as much as possible, to shed light on the context-based role of each fea-
ture. In recent years there have been many discussions about the interpretability
of data-driven models though it is not always made clear what is the most valuable
information for the human user. We deem that in a large variate task the most use-
ful outcome should be an interpretable description of features, returning for each of
them a context-based degree of relevance. Accuracy is only a proxy of information:
the real information is in the structure.
12.11 Exercises
1. Consider the dataset
x1x2x3 y
1 1 0 1
0 0 1 0
0 1 0 0
0 1 1 0
1 1 0 0
1 0 1 1
1 0 0 1
0 1 1 0
1 0 1 0
1 0 0 0
1 1 0 0
0 1 1 0
Rank the input features in a decreasing order of relevance by using the correlation
ρxy =ˆ σxy
ˆ σx ˆ σy
as measure of relevance.
Solution: Since ρx 1 y = 0. 488, ρx 2 y =− 0. 293, ρ x 3 y =− 0. 192, the ranking is
x1 , x2 ,x3 .
2. Consider a regression task with two inputs x1 ,x2 and output y. Suppose we observe
the following training set
X1X2Y
-0.2 0 .1 1
0.1 0 0.5
1 -0.3 1.2
0.1 0.2 1
-0.4 0.4 0.5
0.1 0.1 0
1 -1 1.1
1. Fit a multivariate linear model with β0 = 0 to the dataset.
2. Compute the mean squared training error.
3. Suppose you use a correlation-based ranking strategy for ranking the features.
What would be the top ranked variable?
12.11. EXERCISES 345
Hint:
A= a 11 a12
a12 a22 ⇒ A −1 = 1
a11a22 −a2
12 a 22 −a 12
−a12 a11
Solution:
1. XT X = 2. 23 −1. 45
−1. 45 1. 31
(XT X )−1 = 1. 599 1.77
1. 77 2. 72
XT Y= 2 .05
−0.96
β= ( XTX)−1 XT Y= 1 .58
1. 016
2. e =Y− Xβ =
1. 21
0. 34
−0. 08
0. 64
0. 73
−0. 26
0. 54
It follows that the Mean Squared training error
amounts to 0.41.
3. Since
ρX 1 Y =PN
i=1(X i1−µ 1 )(Y i −µ Y )
qP N
i=1(X i1−µ 1) 2 (Y i−µ Y) 2
= 0.53
and
ρX 2 Y =PN
i=1(X i2−µ 2 )(Y i −µ Y )
qPN
i=1(X i2−µ 2) 2 (Y i−µ Y ) 2
=− 0.48
where µ1 = 0. 24, µ2 = − 0. 07, µY = 0. 75, X1 is the top ranked variable.
3. The .Rdata file bonus4.Rdata in the directory gbcode/exercises of the companion
R package contains a regression dataset with N = 200 observations, n= 50 input
features (in the matrix X) and one target variable (vector Y).
Knowing that there are 3 strongly relevant variables and 2 weakly relevant variables,
the student has to define and implement a strategy to find them.
No existing feature selection code has to be used. However, the student may use
libraries to implement supervised learning algorithms.
The student code should
•return the position of the 3 strongly relevant variables and 2 weakly relevant
variables,
•discuss what strategy could have been used if the number of strongly and
weakly variables was not known in advance.
Solution:
See the file Exercise4.pdf in the directory gbcode/exercises of the companion R
package (Appendix F).
346 CHAPTER 12. FEATURE SELECTION
Chapter 13
From prediction to causal
knowledge
We live in an uncertain and large dimensional world where it is getting easier and
easier to observe and collect large amounts of data but not necessarily control their
generation. This setting is called observational (Section 8.3) and refers to situations
in which we cannot interfere or intervene in the process of generating and capturing
the data. In this setting, machine learning is a powerful tool to detect dependencies
(e.g. correlations) between variables (Chapter 12). Figure 13.1 shows two highly
dependent variables: in such observational setting we may easily predict one variable
given the other, but what can we deduce about their causal relation, i.e. what will
happen to one variable once we manipulate the other? (Un)fortunately not much,
as shown by the label of the variables [190]1.
An accurate predictive model, built from observational data, might tell us very
little about what would happen in case of input manipulation. Most machine learn-
ing algorithms are not conceived to estimate causal effects but to predict observed
outcomes. Now, many of the most crucial questions in science are not merely pre-
dictive but causal. For example, what is the efficacy of a given drug in a given
population? What fraction of deaths from a given disease could have been avoided
by a given treatment or policy? What was the cause of death of a given individual
in a specific incident? What is the impact of education on criminality?
Answering those questions, and then taking the correct decision, requires being
able to make predictions under manipulations or potential experiments (and not
only observations). Is this possible if, for practical or ethical reasons, the experi-
ments or manipulations cannot be done? Have observational data-driven approaches
1You can find several other funny examples of spurious correlation in the website http://
tylervigen.com/spurious-correlations
Figure 13.1: A significant correlation between two causally unrelated variables.
347
348 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
Figure 13.2: Pearl's causal ladder [149]
any hope of untangling causality from association? This will be the topic of this
chapter.
13.1 About the notion of cause
The notion and importance of causal reasoning has been largely debated by philoso-
phers:
•Men are never satisfied until they know the why of a thing (Aristotle).
•I would rather discover one cause than gain the kingdom of Persia (Democri-
tus).
•A thing cannot occur without a cause that produces it (Laplace).
•Felix, qui potuit rerum cognoscere causas (Virgil).
At the same time, for many philosophers and statisticians, causation is a suspi-
cious metaphysical concept that should be avoided when discussing science and/or
making statistics. This suspicion derives from the Hume empiricist tradition (Sec-
tion 2.4). Empiricism always tried to interpret science without having recourse to
unobservable and hidden entities. In that perspective, causation is a hidden and
unobservable connection between things.
The philosopher John Stuart Mill (1843) was one of the first to make explicit
the properties of causal relationships. According to his definition, event A can be
defined to be a cause of event B if
1. A is repeatedly associated with B (concomitant variation)
2. A has to be present each time the effect B occurs (necessity ).
3. B occurs regularly when A is introduced (sufficiency )
13.2. CAUSALITY AND DEPENDENCIES 349
Although this definition is valuable from a historical perspective, it has a number
of limitations: it refers only to categorical aspects (no intensity), it assumes deter-
minism (no variability), and it relates to a univariate setting only (no context).
Also, pretending that logical induction can formalise causality is wishful think-
ing. Consider the following example from [147]. In logics if "A→ B " and "B→ C "
then "A→ C ". But "if the sprinkler is on then the ground is wet" and "if the
ground is wet then it rained" does not imply that "if the sprinkler is on then it
rained."
Now, interesting tasks are multivariate and stochastic. For this reason, we will
explore a more advanced formalism of causality (proposed by J. Pearl2[146]) which
relies on (and extends) probabilistic reasoning. We will show that such formalism
i) eliminates any aspect of vagueness from the definition of causality, ii) makes
this notion falsifiable and iii) pinpoints the crucial role of causal modelling in data
understanding and decision making.
The debate on the scientific role of causality is related to the debate about
the aim of science: is it more about prediction or explanation? This book has
largely covered the issue of prediction without having recourse to explanation or
interpretability aspects. In the following section, we will show that uncovering
mechanisms from data is a way of explaining something and returning human in-
terpretable and valuable information. As visualised in the J. Pearl causal ladder
(Figure 13.2) [149], causal information can be considered as the ultimate (and prob-
ably the most precious) outcome of knowledge discovery from data.
13.2 Causality and dependencies
A dependence between an input x and an output ydoes not always imply the
existence of a causal relationship between them. For instance, dependence may
occur when a latent phenomenon (confounder) causes two effects. In this case, a
statistically significant, yet non-causal, relation between effects is established. For
instance, in the example in Figure 13.1, the latent confounder is the time: both
(causally independent) variables change in the same manner with time and then
happen to be statistically correlated.
In general, it would be erroneous and fallacious to deduce a causality from the
existence of a statistical dependency only: "correlation does not imply causation"
as statisticians are used saying. To better illustrate this notion, let us consider the
following example.
The Caca-Cola study
The Caca-Cola marketing department aims to show that, unlike what feared by
most parents, the famous refreshing drink is so healthy for its drinkers to improve
their sports performances. To support this idea, the department funds a statistical
study on the relationship between the amount of Caca-Cola litres drunk per day
and the time spent (in seconds) by a drinker to run the 100 meters. Here it is the
dataset collected by statisticians:
2He was awarded the 2011 Turing Award, known as the Nobel Prize of computing
350 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
Figure 13.3: Sport performance improves with the amount of Caca-Cola drinking.
Is this a real causal effect?
Liters per day Seconds
1.00 11.9
1.09 12.8
0.20 14.3
0.18 15.3
1.02 12.2
1.04 12.5
1.06 12.6
0.00 16.3
1.08 12.7
0.18 17.7
0.50 14.0
0.17 17.6
illustrated by Figure 13.3 which plots the performance (in seconds) as a function of
the number of drunk litres per day.
The Caca-Cola marketing department is excited. Caca-Cola seems to have mag-
nificent effects on the sprinter performance, as illustrated by the significant corre-
lation between amount of litres and running time of Figure 13.3. The CEO of the
company triumphally extrapolates on the basis of a sophisticated machine learning
tool that any human being fed with more than 3 litres per day could easily beat
the world's record. In front of such enthusiasm, the League of Parents is skeptical
and ask for a simple elucidation: did the Caca-Cola statisticians record the age of
the sprinter? In front of the growing public opinion pressure, Caca-Cola is forced
to publish the complete dataset.
13.2. CAUSALITY AND DEPENDENCIES 351
Figure 13.4: Performance deteriorates as Caca-Cola drinking increases for young
athletes.
Age Liters per day Seconds
17 1.00 11.9
19 1.09 12.8
49 0.20 14.3
59 0.18 15.3
21 1.02 12.2
19 1.04 12.5
17 1.06 12.6
62 0.00 16.3
21 1.08 12.7
61 0.18 17.7
30 0.50 14.0
65 0.17 17.6
At last truth can triumph! Caca-Cola statisticians had hidden the real cause
of good (or bad performance): the age of the athletes! Since youngsters tend to
drink more Caca-Cola as well as having better sport performance, the first causal
relationships between drunk litres and performance was fallacious. On the contrary
a more detailed analysis on a homogenous group of young athletes show that Caca-
Cola tend to deteriorate the performances. Figure 13.4 plots the performance (in
seconds) as a function of drunk litres per day exclusively for the subgroup of young
(less than 30) athletes. Note that the Caca-Cola marketing department was not
wrong in claiming the existence of a significative relationship between Caca-Cola
and performance. On the contrary they were definitely wrong when they claimed
the existence of a cause-effect relationship between these two variables.
•
13.2.1 Simpson's paradox
The Caca-Cola example is a typical instance of Simpson's paradox: an association
between a pair of variables can consistently be inverted in each subpopulation of a
population when the population is partitioned, and conversely, associations in each
subpopulation can be inverted when data are aggregated.
352 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
Note that in the Simpson's paradox there is nothing paradoxical from the stand-
point of arithmetic and is simply due to the close connections between proportions,
percentages and probabilities. It is possible in fact to find eight integers such that
a/b < A/B and c/d < C/D but
(a + c ) /(b +d )> (A +C )/ (B +D )
For instance 1/ 5< 2/ 8 and 6/ 8< 4/ 5, yet 7/ 13 > 6/13.
Let us consider now the experimental outcome of a medical experiment where a
treatment T has been administered to patients and the binary outcome Yhas been
recorded. Note that G = 0 stands for females, G= 1 for males, Y= 1 stands for
recovery and T = 1 for treatment administration. Suppose that the distribution of
treatment among genders is represented by the table below:
T=0 T=1
G=0 5 8 P (T = 1|G = 0) = 8/13
G=1 8 5 P (T = 1|G = 1) = 5/13
P( G= 1| T= 0) = 8 /13 P( G= 1| T= 1) = 5 /13
It follows that the conditional distribution of recovery conditional on gender and
treatment is given by:
T=0 T=1
G=0 P (Y = 1|T = 0, G = 0) = 1/ 5P (Y = 1|T = 1, G = 0) = 2/ 8P (Y = 1|G = 0) = 3/13
G=1 P (Y = 1|T = 0, G = 1) = 6/ 8P (Y = 1|T = 1, G = 1) = 4/ 5P (Y = 1|G = 1) = 10/13
P( Y= 1| T= 0) = 7 /13 P( Y= 1| T= 1) = 6 /13
We obtain then the following probabilistic inequalities:
P( Y= 1| T= 0 , G = 0) = 1 /5 < P ( Y = 1| T= 1 , G = 0) = 2 /8
P( Y= 1| T= 0 , G = 1) = 6 /8 < P ( Y = 1| T= 1 , G = 1) = 4 /5
P( Y= 1| T= 0) = 7 /13 > P ( Y= 1| T= 1) = 6 /13
If we interpret the probabilistic relationships in causal terms, it appears that the
treatment is effective for both the female subpopulation (P (Y = 1|T = 1, G = 0) >
P( Y= 1| T= 0 , G = 0)) and the male population, yet this is not the case for the
entire population (P (Y = 1|T = 0) > P (Y = 1|T = 1)).
Though all this is perfectly compliant with arithmetic (and statistics) it poses
problems from a decision-making perspective: should we base a decision on the
statistics from the aggregate population or from the partitioned subpopulations?
Does it make sense to have a treatment which is effective both for men and women
but not for the union of them? In probabilistic terms, should we trust the marginal
or the conditional distribution? The answer of Pearl is that the answer cannot be
found in probability but has to be found in causal reasoning. This means that we
need a piece of missing information to answer the question and this information is
in the causal graph. In other words, translating a probabilistic relationship into a
causal relationship is the "original sin" in the Simpson paradox.
Figure 13.5 and 13.6 show two alternative causal graphs and, for the moment,
it is sufficient for the reader to know that a direct arrow from a node to another
represents a direct causal action. Knowing which causal graph corresponds to our
medical setting will allow solving the paradox [148].
Looking at both figures, it appears evident that in our case, the correct causal
graph is Figure 13.5: in fact, it is not conceivable that a treatment causes a person
gender (as in Figure 13.6), while it is possible that both a treatment and its efficacy
depend on the gender. This means that in our case the treatment is beneficial
and that the only meaningful probabilistic relation is the unconditional one P (Y=
1|T = 0) > P (Y = 1|T = 1).
We will show afterwards that for a graph like Figure 13.5, the causal dependency
of the treatment T on Y is correctly represented by conditioning on gender.
13.3. CAUSAL VS ASSOCIATIONAL KNOWLEDGE 353
Figure 13.5: Causal graph where Gis a direct cause of both T and Y.
Figure 13.6: Causal graph where Tis a direct cause of G.
The paradox derived from the fact that gender and treatment have an impact
on recovery and that gender has an impact on treatment as well. Females recover
much less (in probabilistic terms P (Y = 1|G = 0) < P (Y = 1|G = 1)) and take
more drugs (P (T = 1|G = 0) > P (T = 1|G = 1)). Using the non-conditioned
relation would be erroneous and counterintuitive.
Now let us keep fixed all the observed numbers but let us change the semantics
of variables. Suppose that Gstands for some beneficial mechanism P (Y = 1|G =
1) > P (Y = 1|G = 0) whose functioning is affected by the treatment P (G = 1|T =
1) < P (G = 1|T = 0). Which causal graph do you think is more adequate? Would
you condition on G or not, to measure the causal effect of T on Y?
13.3 Causal vs associational knowledge
Human common sense knowledge is about how things work in the world. Such
knowledge is (i) effective since it gives humans the ability to intervene in the world
and change it and (ii) causal because it is about the mechanisms that bring from
causes to effects. Here mechanism stands for an input/output relationship where
the choice of inputs determines the outputs, but the reverse does not hold.
The goal of many sciences is to formalise such common sense and understand the
mechanism by which variables came to take on the values they have, and to predict
what the values of those variables would be if the naturally occurring mechanisms
354 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
were subject to manipulations. It follows that the questions that motivate most
studies in the health, social and behavioural sciences are not associational but causal
in nature.
Causal analysis aims to infer not only beliefs or probabilities under static con-
ditions but also the dynamics of beliefs under changing conditions, for example,
changes induced by treatments or external interventions. However, according to
Pearl [146], associational studies deserved more interest than causal studies for the
following reasons:
•associational assumptions, even untested, are testable in principle, given suffi-
ciently large samples and sufficiently fine measurements. Causal assumptions,
in contrast, cannot be verified even in principle unless one resorts to experi-
mental control.
•associational assumptions can be expressed in the familiar language of prob-
ability calculus, and then they assumed an aura of scholarship and scientific
respectability. Causal assumptions, until recently, were deprived of that and
became suspect of informal, anecdotal or metaphysical thinking.
In order to address the lack of an adequate probabilistic formalism to represent
the notion of manipulation, Pearl introduced the do() operator, which allows us to
distinguish between the conventional observational notion of statistical dependency
(quantified in terms of conditional probability) and the interventional notion of
causal dependency.
Let us first remind that a (discrete) random variable yis said to be dependent
(Section 3.5.2) on a variable xif the distribution of yis different from the marginal
one when we observe the value x = x
Prob {y |x =x } 6 = Prob {y }
The property of dependency is symmetric , i.e. if y is dependent on x , then xis
dependent on yas well, i.e.
Prob {x|y =y } 6 = Prob {x}
The concept of causality describes a process where the control (not simply the
observation) of one event changes the likelihood of the occurrence of another event.
Definition 3.1 (Cause) . A variable x is a cause of a variable yif the distribution
of y is different from the marginal one when we set the value x = x
Prob {y | do (x =x )} 6 = Prob {y }
In other terms, xis a cause of yif we can change y by manipulating x but not
the other way round. Unlike dependency, causality is asymmetric, i.e.
Prob {x|do (y =y )} = Prob {x}
It is important to summarise then the differences between conditioning on in-
tervention and conditioning on observation.
Intervention is formalised by the do () operator and corresponds to set (or bet-
ter manipulate) a variable to a specific value. This manipulation may change the
probabilistic model and the nature of the dependencies between variables (e.g. the
stochastic dependency between a cause and an effect is lost once we manipulate the
effect).
Conditioning is formalised by the conditional probability operator and corre-
sponds to observe a specific value of a variable. By observing, we change nothing
to the causal mechanism: we simply narrow our focus to a set of observations. In
plain words, the world does not change, our perception does.
13.4. THE TWO MAIN PROBLEMS IN CAUSALITY 355
13.4 The two main problems in causality
The literature on causality addresses two main problems: (i) the estimation of causal
effects and (ii) the causal discovery from observational data. In the first problem,
the causal mechanism is known or guessed, but the goal is to assess causal effects
from an observational or experimental study. For instance, in a medical study, we
could be confronted with two groups of units (exposed/non-exposed), and we would
like to quantify the causal effect of the exposure (smoking versus not smoking). In
the second problem, we have no (or very limited) information about the causal
mechanism, and the goal is to reconstruct the causal mechanism from data. For
instance, in bioinformatics, this problem is encountered every time we want to infer
a transcriptional network from observed genomics data.
In what follows, we will show how the first problem has been addressed both by
the potential outcomes and the graphical modelling approaches. Then we will dis-
cuss the issue of causal discovery, notably thanks to structure learning in graphical
models.
13.5 Causality and potential outcomes
Causality is about estimating the effect of an action (or manipulation, treatment,
intervention), applied to a unit (e.g. a person, a physical object, a firm), over a
specific outcome. Although a unit has been (at a particular point in time) exposed
to a particular action (treatment/regime), in principle, it could have been exposed
to an alternative action at the very same point in time. This concept has been
formalised by Rubin [164] with the notion of potential outcome, denoting the random
distribution of an outcome variable which can be observed only if the action is
performed. In a binary treatment, for each realised action, there is a potential
outcome associated with the alternative action which could not be observed. The
effect of an action is then related to both potential outcomes (what happened and
what could have happened).
Let y (x) be read as the outcome y under treatment x . If x can take two values,
the potential outcomes y (1) (y (0) ) are two outcome variables (distinct from y) which
denote the distribution of the outcome if the (binary) action xhad (not) been done:
y(0) ∼Prob { y| do( x = 0)}, y(1) ∼Prob { y| do( x = 1)}
Note that the definition of those outcomes is irrespective of which action actually
occurred. The link between y (observation), y (0) and y (1) is
y=( y (1) if x= 1
y(0) if x= 0
Since in the reality we can only observe either y (1) or y (0) , the observed yi
satisfies the following relation:
yi =xi y (1)
i+ (1 −x i )y (0)
i=y (0)
i+ (y (1)
i−y (0)
i)x i .(13.5.1)
Potential outcomes and observations
Let us consider the notion of potential outcome in the context of the following
observed dataset:
356 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
unit treatment x observed yy (0) y (1)
1 1 y1 ? y1
2 0 y2y2 ?
.
.
..
.
..
.
..
.
..
.
.
i 1 yi ?y i
.
.
..
.
..
.
..
.
..
.
.
N 0 yNyN ?
For each unit in the population, the observed outcome is called the factual
outcome yF , while the other is called the counterfactual yCF . Note that sometimes
the factual outcome is drawn from y (0) and sometimes from y (1), according to the
performed action. This definition makes explicit that the observed random variable
yis only a lumped version of the potential outcomes.
•
Beyond treatment and outcome, the third major actor of the causal inference
problem are covariates , denoted by z . For instance, in a medical setting, they could
be measures related to individual medical history or family background information.
The key characteristic of these covariates is that they are a priori known to be
unaffected by the treatment assignment. We will see later that covariates may help
to improve the accuracy of the causal effect estimation by defining configurations
or groups of interest which might play a role in the assignment mechanism.
13.5.1 Causal effect
The causal effect is defined as the comparison of potential outcomes, for the same
unit, at the same moment in time. For a given unit (or individual) iand a binary
treatment, the unit level causal effect (or individualised treatment effect) is
y(1)
i−y (0)
i.
The average causal effect over the population is
τ= E[y (1) ]− E[y(0) ] .(13.5.2)
The definition of causal effect depends on the potential outcome but not on the
actual observed outcome: nevertheless, in practice, if we wish to estimate it, we will
have to use observed outcomes over multiple units and different times.
For the sake of estimation it is then crucial to know (or at least make assumptions
about) why some actions were made and others not: this is known as the assignment
mechanism and quantifies the statistical dependency between treatment, outcome
and covariates.
Example
Let us consider a heterogeneous population of patients in terms of ages. Let us
suppose that the older the patient, the higher is the death risk y. Let us give a
treatment (e.g. drug or hospitalisation) to the oldest patients only and compute
the average risk of treated patients vs untreated ones. Since the a priori risk of old
people was already high, after the treatment their average risk (though decreased
thanks to treatment) remains higher than the risk of non treated persons. In other
terms, though over the entire population
E[y (1) ] < E [y (0) ]
13.5. CAUSALITY AND POTENTIAL OUTCOMES 357
y(1)
y(0)
E[y| x=1]=E[y(1)| x=1]
E[y| x=0]=E[y(0)| x=0]
E[y(0) | x=1]
E[y(1) | x=0]
causal effect
bias
observed effect
Low risk
Age
Treatment
No treatment
Factual
y(0)i
y(1)i
High risk
y
Figure 13.7: Observed vs. causal effect. Left: y (0) distribution (untreated popu-
lation). Right: y (1) distribution (treated population). Black dots denote the set
observed patients: half of them are young and untreated (left) and half of them are
aged and treated (right).
i.e. the individual causal effect is constant and negative (risk reduction), we observe
(Figure 13.7)
E[y| x = 1] > E[y |x = 0]
i.e. a positive treatment effect (risk increase).
It follows that by looking only at the observed data, we would make the wrong
conclusion that the treatment is harmful. Note that this is analogous to consider
a surgeon with many deceased patients during the operations less talented than a
surgeon with few deceased patients: if we ignore the severity of the patients state,
the different death rate could be misleading since the first (smarter) surgeon could
be the one dealing (because more talented) with all difficult cases.
•
13.5.2 Estimation of causal effect
We can never measure a causal effect (for instance (13.5.2)) directly since the coun-
terfactual outcome is missing by definition; such missing data problem is called the
fundamental problem of causal inference.
In the absence of counterfactual data, can observed data be used to estimate
causal treatment effects for a given covariate? This is the aim of the causal analysis
which targets the quantity
E[y(1) | x = 1] − E[y (0) |x = 1] = E[y (1) −y (0) |x = 1]
representing the average difference between the risk of the treated and what would
have happened to them (counterfactual) had they not been treated.
358 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
It can be shown that
E[y |x = 1] − E[y| x = 0]
| {z }
observed
=E [y (1) |x = 1] −E [y (0) |x = 0] =
E[y(1) |x = 1] − E[y (0) |x = 1]
| {z }
avg causal effect on the treated
+E [y(0) |x = 1] −E [y (0) |x = 0]
| {z }
selection bias
(13.5.3)
The selection bias term shows that the observed causal effect may be biased
if the distribution of the potential outcomes is not independent of the assignment
strategy ((y (0) ,y (1) ) 6⊥⊥ x ). This is also known as a non-exchangeable configuration.
We invite the reader to retrieve the terms of (13.5.3) in Figure 13.7.
In order to draw valid causal inferences, we need to make assumptions about
the assignment mechanism , that is the distribution
P(x = 1|z = z, y (0), y(1) ) (13.5.4)
describing the probability of a unit of receiving a treatment given the distribution
of the covariates and of the potential outcomes of the problem.
13.5.3 Assignment mechanisms assumptions
An assignment mechanism is
•individualistic if the treatment of a unit does not depend on other units. For
instance, sequential assignments are not allowed in this case.
•probabilistic if the probability of assignment (13.5.4) is strictly between zero
and one. Each unit has a non null probability of being assigned both treat-
ments (level 0 and 1).
•uncounfounded if it is conditionally independent of the potential outcomes,
i.e. P (x = 1|z = z, y (0), y(1) ) = P (x = 1|z =z ) = e (z ).
An individualistic, probabilistic and unconfounded assignment is also referred
to as a strongly ignorable treatment assignment. A randomised experiment is an
assignment mechanism that has a known probabilistic functional form that is con-
trolled by the researcher. In an observational study, the functional form of the
assignment mechanism is unknown.
13.5.4 About unconfoundness
The main obstacle to unbiased causal reasoning is confounding which occurs when
the assignment of the treatment is not independent of the potential outcomes, e.g.
because of omitted or non observable variables related to both the treatment and
the outcome. Unconfoundness (also known as ignorability ) states that potential
outcomes are independent of the observed treatment conditional on confounding
covariates
(y (0) ,y (1) ) ⊥⊥ x |z. (13.5.5)
In plain words, for individuals with the same z, knowing the treatment brings no
information about y (0) or y (1) or, equivalently, the treatment does not depend on
the causal type. For instance for individuals with the same age, the treatment is
not given to patients who are supposed to react better and knowing that a patient
got a treatment does not provide any information about the success odds.
13.5. CAUSALITY AND POTENTIAL OUTCOMES 359
Note that the lack of dependence between the treatment and the outcome does
not imply that the assignment is ignorable
y⊥⊥ x 6⇒ (y (0) ,y (1) ) ⊥⊥ x
At the same time, an ignorable treatment but with a treatment effect different
from zero does not imply independence between the treatment and the outcome.
(y (0) ,y (1) ) ⊥⊥ x 6⇒ y ⊥⊥ x
The correct interpretation of unconfoundness is that for a given covariate z,
knowing which treatment has been given to a unit (e.g. observing x) does not
provide any additional information on the distribution p (y |do (x )). This is not the
case of the example of Figure 13.7 where by knowing the treatment we know the
age and then the death risk of the patient.
On the other way round, we have a non ignorable assignment mechanism if
the treatment is assigned on the basis of unmeasured characteristics of the unit.
Naively, we could then expect that to ensure plausibility of such assumption, we
should condition on as much pre-treatment information as possible. However, one
can never prove that the treatment assignment process in an observational study is
ignorable since it is always possible that the choice of treatment depends on relevant
yet latent information.
13.5.5 Randomised designs
A randomised experiment is a probabilistic assignment mechanism whose form is
known and controlled by the researcher. The aim of randomisation is to make the
treatment independent of its other causes, thereby destroying possible confounding
settings, due either to observed or unobserved variables. In case of binary treatment,
a random assignment process guarantees that the resulting treatment and control
groups are identical on any other characteristic (than the treatment) influencing
the outcome.
Though it is the "gold standard" procedure for causal reasoning, in practice
there are many concerns about its adoption, ranging from ethical concerns to the
representativeness of of the generated population with respect to the observed pop-
ulation.
The most common types of randomised designs are:
•Bernoulli trials: they assign a treatment by tossing a fair coin for each unit.
Though this is the most intuitive design, it is exposed to a high risk of un-
helpful assignments.
•Completely randomised experiment: a classical randomised experiment where
each unit has the same assignment probability (i.e. the number of treated units
is fixed), and this probability is different from 0.5. In this design, the risk is
that the treatment assignment neglects covariate effects.
•Stratified (or conditionally) randomised experiments: they partition the pop-
ulation of units in blocks (strata), e.g. according to covariates (for instance,
age).
•Paired randomised experiments: two units (chosen on the basis of covariates)
per block, one is treated and the other not.
360 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
13.5.5.1 Estimation of the treatment effect
Suppose we want to estimate the average treatment effect (13.5.2) in a completely
randomised experiment where the treatment xis binary. In a population of Nunits,
this is given by
τ=P N
i=1(y (1)
i−y (0)
i)
N
Unfortunately, this quantity cannot be computed directly since for all i one of the
two terms is missing. Let us then consider an estimator
ˆ
τ=PN
i=1(ˆyi1 −ˆyi0 )
N(13.5.6)
Such estimator can be implemented if the terms ˆ yare observable (i.e. dependent
on observations yi ) and is unbiased if
E[ˆ
τ] = τ(13.5.7)
Since from (13.5.1) yi =xi y (1)
i+ (1 −x i )y (0)
i, we put
ˆyi1 = wi xi y (1)
i,ˆyi0 =wi (1 − xi ) y (0)
i(13.5.8)
In order to guarantee the unbiasedness (13.5.7), the following constraints have
to be satisfied
E[
N
X
i=1
ˆyi1 ] =
N
X
i=1
y(1)
i, E[
N
X
i=1
ˆyi0 ] =
N
X
i=1
y(0)
i
Note that the first constraint, because of the unconfoundness (13.5.5) and (13.5.8),
boils down to
E[
N
X
i=1
ˆyi1 ] = E[
N
X
i=1
wi xi y(1)
i] =
N
X
i=1
wi E [xi ] y (1)
i=
N
X
i=1
y(1)
i
and analogously the second.
Since the treatment xis binary, E [ xi ] = P ( xi = 1). To ensure (13.5.7) we
must weigh each treated example with a weight equal to 1/P (xi = 1) and each
untreated example with 1/P (xi = 0). Since in a completely randomised experiment
N1 + N0 = N we have
P(xi = 1) = N1 /N, P ( xi = 0) = N0 /N
and from (13.5.6) and (13.5.8) we obtain
ˆ τ=P N
i=1(ˆ yi1 −ˆ yi0 )
N= P N
i=1(x i y i1/(N 1 /N )− (1 −x i )y i0 /(N 0 /N))
N=
=P N 1
j=1 y j1
N1 − P N 0
j=1 y j0
N0
Note that this boils down to increase the weight of the observed treated cases
by a factor equal to N/N1 . In general, the lower the probability of an observation,
the higher should be its weight in the estimator of the treatment effect.
This derivation shows that in a randomised case, it is possible to obtain an
unbiased estimation of the causal effect by a proper weighting of the observations
and without access to counterfactual information.
13.5. CAUSALITY AND POTENTIAL OUTCOMES 361
13.5.5.2 Stratified (or conditionally) randomised experiments
This class of randomisation experiments is particularly important since it can be
used as a template to deal with observational experiments. In a stratified randomi-
sation setting, the population is first partitioned into blocks or strata so that the
units within each block are similar in terms of covariates expected to be predictive of
potential outcomes. The simplest case is when we stratify the population according
to gender (two blocks) to estimate the effect of a medical treatment.
Within each block, a completely randomised experiment is performed, and the
relative size of treatment groups is the same over strata. Analogously to the ran-
domised case, the estimation of the treatment effect is obtained by the weighted
average of the treatment effect over each stratum, with a weight proportional to the
population in the stratum.
13.5.6 Observational study
Unlike a randomisation experiment, in an observational study, the functional form
of the assignment mechanism is (at least partially) unknown. This means that we
could have a risk of outcome systematic differences between the treatment groups,
which are not due to treatment exposure (Figure 13.7). The differences between
the groups might be due to confounding variables (e.g., age, genetic susceptibility
to cancer) rather than treatments themselves.
The rationale of the potential outcome approach is to make the observational
setting as conform as possible to the experimental one in order to make causal
inference possible. Since an investigator cannot assign the treatment in nonexper-
imental studies, s(he) must rely on the remaining degree of freedom: the selection
of subjects. It is then necessary to make the assumption that the observational
study is a conditionally randomised experiment, or equivalently that for a given set
of covariate the unconfoundness hypothesis (13.5.5) (i.e. assignment independent
from potential outcomes given covariates) holds. Without such assumption (and
in the absence of alternatives) we would have no idea of how to use observational
values for causal inference.
If uncounfoundness holds, within a subpopulation defined by the same covariates
(e.g. females or males), the difference in the distributions of the observed outcomes
(i.e. treated and untreated) is an unbiased estimator of the treatment effect since
both treated and control units are assumed to be random samples from this subpop-
ulation. This makes irrelevant the fact that we do not know a priori the assignment
mechanism.
Unconfoundness is a formal version of the principle stating that you should
compare "like with like". As Rubin said: "it would make little sense to compare
disease rates in well-educated non-smokers and poorly educated smokers".
Note, however, that, unfortunately, such assumption is hardly testable in prac-
tice. No information in the observed data can tell us whether such assumption holds
or not. At the same time, the validity of a causal estimation based on observational
data relies on the hypothesis that we are conditioning on the right covariates, i.e.
the ones that make treatment and potential outcomes independent. Once condi-
tioned on those covariates, it is expected that the remaining degree of variability
related to the treatment choice (though not random like in randomisation studies)
will be unrelated to potential outcomes.
In practice, the choice of those variables is not easy, notably when the number of
confounding variables is large, or most of them are continuous. Literature abounds
in terms of generic recommendations (e.g. variables that are affected by the treat-
ment like intermediate outcomes should not be included), but no formal procedure
is proposed.
362 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
13.5.7 Strategies for estimation in observational studies
If the unconfoundness assumption is granted, the causal effect estimation may pro-
ceed according to four strategies [163]
1. Model-based imputation: it imputes the missing potential outcomes by build-
ing and using a model to predict what would have happened to a unit had
been exposed to the alternative treatment.
2. Weighting
3. Blocking
4. Matching methods: they impose the same distribution of the control and case
population with respect to some confounding factors.
The first method requires a model of the potential outcomes, the other three can
be implemented before seeing any outcome data. We will briefly discuss here only
matching which refers to a variety of procedures that restrict and reorganise the
original sample for statistical analysis. Matching is a way of discarding observations
so that the remaining data show good balance and overlap. The simplest form is
one-to-one matching where the data points are divided into pairs: each pair contains
both a treatment and a control unit, and the two units are as similar as possible
on the pre-treatment variables. For instance, if the ith unit is associated with zi ,
we match it to an unit whose set of covariates zj is as close as possible to zi . The
difference of the observed outcomes associated to those matched units is supposed to
be an accurate estimate of the causal effect. In multivariate settings, the similarity
is typically measured by metrics like the Mahalanobis distance (9.2.64). Note,
however, that, though the notion of similarity is intuitive, its adoption in a high
variate case is non trivial as discussed in Section 12.1.
In order to avoid the issue of specifying which variables to use for matching,
a well-known alternative is the adoption of propensity score modelling [86]. This
approach relies on classification techniques (e.g. logistic regression) to estimate
from data the propensity score Prob {x = 1|z } , that is a compact way to summarise
high-variate covariates in a single value. Note that the target of a propensity score
classification model is not the outcome ybut the treatment value x . However,
the aim of this approach is not to predict the treatment but define a metric of
the observations in the space of covariates: observations with similar estimated
propensity scores have similar profiles in terms of the covariates z and then should
be candidates for matching.
13.6 From potential outcomes to graphical models
Potential outcomes approach played a major historical role in the history of causality
in sciences. Its main merit was to define under which circumstances it is possible
to transpose methods from the randomisation setting to the observational one.
However, the opacity of the notion of unconfoundness in observational setting is
probably the most relevant Achilles' heel of this approach since no procedural way
exists to assess the validity of such condition, in particular in large variate problems.
According to J. Pearl [146], most investigators have difficulty in understanding
what ignorability means and tend to assume that it is automatically satisfied or
that it is likely to be satisfied the larger is the number of covariates. This led to the
naive (somewhat dangerous) illusion that adding more covariates causes no harm.
The lack of a formal algorithmic way to define the conditioning variables to assess
causal effects is the main reason to introduce the Pearl work on causal graphical
methods.
13.7. CAUSAL BAYESIAN NETWORK 363
We introduced in Chapter 4 the notion of graphical models, notably DAGs,
stressing their efficient representation of large variate distributions and the bijec-
tive mapping between graphical properties (d-separation) and probabilistic prop-
erties (conditional independence). What is also relevant here is their role from
a causal perspective. DAGs is a formalism that encodes and visualises the link
between (conditional) statistical associations and causal mechanisms. As such, it
allows the prediction of the impact on the probability distribution of manipulations
and interventions. Last but not least, it provides visual support to recognise and
avoid common mistakes in causal reasoning. Overall, DAGs are a very convenient
formalism to represent the following properties of causal relationships:
1. Transitivity: if event A causes event B and event B causes event C, then it
must also be true that A causes C.
2. Locality : if A causes C only through the effect of an intermediate B, then
the causal influence is blocked once the event B is kept fixed.
3. Irreflectiveness : an event cannot cause itself.
4. Asymmetry : if A is a cause of B (i.e. one sees changes of B when changing
A), then B cannot be a cause of A (i.e. one cannot expect changes of A when
changing B). Note that this does not exclude temporal feedback loops.
13.7 Causal Bayesian network
A Causal Bayesian Network (CBN) is a graphical model where the notion of edge
has a specific causal meaning and is then semantically richer than in conventional
Bayesian Networks where it merely represents a probabilistic dependence.
Definition 7.1 (Causal BN) . A BN is causal if for each edge xi →xj , the variable
xi is one of the direct causes of xj (Definition 3.1).
Unlike graphical models which are used to encode an (in)dependence structure,
causal graphical models support a stronger interpretation, i.e. the manipulation of
xi induces a change of distribution of xj , whatever the value of all other variables.
This implies that, thanks to the notion of d-separation (Section 4.3.1), it is possible
to associate testable conditional independence patterns to specific causal patterns.
For instance the pattern x1 ⊥⊥ x2 ,x1 6⊥⊥ x2 |y is associated to the common effect
causal pattern in Figure 13.8 where conditioning on a collider (or a descendant)
makes the two causes conditionally dependent. Other interesting patterns are en-
coded in the DAG of Figure 13.9. On the left we have a variable ywhich is a
common cause of x1 and x2 where x1 6⊥⊥ x2 , x1 ⊥⊥ x2 |y . On the right we have a
causal chain pattern where x2 6⊥⊥ y, x2 ⊥⊥ y |x 1
Causal patterns: examples
We may visualise those dependency patterns and their sensitivity to conditioning
by running the scripts collid.R and chain.R . For instance the leftmost plot in
Figure 13.10 represents the pattern in Figure 13.8: variables x1 and x2 are inde-
pendent (black dots) but they become dependent when we restrict the dataset by
conditioning (red dots).
•
364 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
Figure 13.8: Common effect causal pattern.
Figure 13.9: Left: common cause pattern. Right: causal chain pattern.
420 2 4
420 2 4
COLLIDER (y=x1+x2): conditioned on y=0
x1
x2
1.0 0.5 0.0 0.5 1.0
1.0 0.5 0.0 0.5 1.0
FORK (x1=y+e1, x2=y^3+e2): conditioned on y=0
x1
x2
210 1 2
420 2 4
CHAIN (y=x1+e1, x2=2y+e2): conditioned on y=0
x1
x2
Figure 13.10: Visualisation of three causal patterns in terms of dependency (black
dots) and conditional dependency (red dots).
13.7. CAUSAL BAYESIAN NETWORK 365
Figure 13.11: Causal Bayesian Network.
13.7.1 Causal networks and Structural Causal Models
It may be shown that for every DAG characterised by a distribution, Pthere ex-
ists a set of equations, called Structural Causal Model (SCM), that generates a
distribution identical to P.
Definition 7.2 (Structural causal model) . A SCM consists of n equations of the
form
xi = fi (πi , wi ), i = 1, . . . , n (13.7.9)
where πi stands for the set of variables judged to be the immediate causes of x i
and wi represent the noise due to omitted factors.
A structural model is a set of n equations fi . If the noise terms wi are jointly
independent, the model is called Markovian. This implies that no noise term (un-
observed variable) influences more than one observed variable. This assumption is
also known as causal sufficiency that means that we observe all relevant variables.
Note that the symbol = in (13.7.9) should be read like an assignment symbol (e.g.
the symbol := in programming languages) rather than like an algebraic equality.
The SCM correspondent to the CBN is Figure 13.11 is
x= wx
y= wy
z=fz ( x, y, wz )
k=fk ( z, wk )
v=fv ( y, wv )
w=fw ( v, ww )
We see that each edge in the DAG is associated with a function in the SCM.
13.7.2 Pre and post-intervention distributions
The essence of causal reasoning by CBNs is that the observational (pre-interventional)
and the post-interventional distributions are not necessarily the same, yet they are
somewhat related, and their relation is made explicit by the causal graph. A causal
graph is not only a model of the (in)dependences in an observational setting, but
also indicates how those (in)dependences change as a consequence of experiments
and interventions.
366 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
kv
x
z
y
kv
x=x
z
y
P(y | x =x) P(y | x=do(x))
Figure 13.12: DAG surgery: on the left (right) the BN associated to the pre(post)-
intervention distribution.
Consider for instance two different causal configurations: x1 → y→ x2 and
x1 ← y→ x2 . Suppose we are interested in the post-intervention distribution
of y once we act on (i.e. we manipulate) x1 . The post intervention distribution
in the first configuration is given by the conditional distribution p (y| x1 ,x2 ): the
post intervention distribution in the second configuration is the marginal density
p(y |x2 ). In this second configuration, intervening on x1 removed the probabilistic
dependency between x1 and y (i.e. after intervention the value of x1 provides
no more information about y). At the same time, the observation of x2 is still
informative about y.
If we consider instead the post-intervention distribution of x2 under an inter-
vention on y , this is the same (and equal to p (x2 |y )) in both causal configurations.
It is important then in a general setting to understand how to move from pre-
intervention (observational) data to a post-intervention setting. The major advan-
tage of DAGs is that organising causal knowledge in a graphical manner allows
predicting the effect of external interventions with a minimum of extra information.
Interventions can be modelled by removing links or equivalently adding conditional
independence relationships. This makes explicit the difference between observing
and doing : observing does not change the causal structure, while a do() action
induces a "surgery" change in the topology of the DAG, or equivalently it modifies
a set of functions in the SCM (Figure 13.12).
In particular we define atomic intervention (or manipulation) the intervention
where a variable xi is forced to take a value xi (denoted by do ( xi =xi )). The
atomic intervention pulls xi out of the SCM functional mechanism xi = fi (. . . ) and
places it under the influence of a new mechanism that sets xi = xi , while keeping
all the other mechanisms unperturbed. This amounts to removing the equation
xi =fi (. . . ) in the corresponding SCM and replacing it with the equation xi =xi .
In plain words, the effect of manipulations is to disconnect the manipulated variables
from their natural causes.
13.7.3 Causal effect estimation and identification
Post-intervention distribution is required to estimate the causal (treatment) effect
P(y = y|do( x= x0 )) − P(y = y|do( x= x"))
13.7. CAUSAL BAYESIAN NETWORK 367
Unfortunately, the post-intervention distribution is not observable. So it is essential
trying to answer the following question: "Can the controlled (post-intervention)
distribution be estimated from data governed by the pre-intervention distribution?".
This is the problem of identification.
A fundamental theorem in causal analysis states that such identification is fea-
sible whenever the model is Markovian, i.e. the graph is acyclic (i.e., containing no
directed cycles), and all the error terms are jointly independent.
Theorem 7.3 (Adjustment for direct causes [146]) . Let πi the set of direct causes
of xi and let y be any other variable disjoint of πi ∪xi . The post-intervention
distribution of yis
P(y = y|do( xi )) = X
πi
P(y = y| xi , πi ) P( πi ) (13.7.10)
where P (y =y |xi , πi ) and P (πi ) are pre-intervention distributions.
The essence of this theorem is that it transforms a causal statement (left-hand
term of (13.7.10)), i.e. a statement that cannot be directly estimated from observa-
tions, into a probabilistic statement (right-hand term), i.e. a statement that can be
estimated from observational data. The post-intervention conditional distribution
is then obtained by conditioning P (y =y |xi ) further on the parents of xi and then
averaging the results where the weights are given by the prior probability of πi . In
the case of multiple interventions, this general rule holds:
Theorem 7.4 (Truncated factorisation). For any Markovian model, the distribu-
tion generated by an intervention do( XS = X0 ) on a set XS of endogenous variables
is given by the truncated factorisation
P(x1 ,x2 ,..., xk |do( XS = X0 )) = Y
i|i6∈S
P(xi | πi )|X S =X0
where πi are the parents of xi and P( xi |πi ) are pre-intervention distributions.
We have to remove from the factorisation all factors associated with the inter-
vened variables (members of XS ).
Example
Let us consider the problem of computing the post-intervention distribution in two
different (yet related) trivariate causal configurations. In Figure 13.13 we have a
causal configuration3 where z is confounding the effect of xon the outcome y.
The pre-intervention distribution P(associated to the left-side of Figure 13.13)
is
P(y = y|x= x) = X
z
P(y = y|x= x, z= z) P(z = z|x= x) (13.7.11)
The post-intervention distribution P0 (associated to the right-side of Figure 13.13)
and obtained by removing all the edges pointing towards xis
P(y = y|do( x)) = P0 (y = y|x= x) =
=X
z
P0 (y = y|x= x, z= z ) P0 (z = z|x= x) = X
z
P0 (y = y|x= x, z= z ) P0 (z = z)
=X
z
P(y = y|x= x, z= z ) P(z = z)6 = P(y = y|x= x) (13.7.12)
3Check the analogy with Figure 13.5.
368 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
z
x y
z
x y
P: pre-intervention P': post-intervention
Figure 13.13: Confounding configuration.
z
x y
P: pre-intervention P': post-intervention
z
x y
Figure 13.14: Intermediary configuration.
This is in agreement with Theorem 7.3 and shows the difference between conditional
distribution P (y =y |x =x ) and causal intervention P (y =y |do (x)).
Let us consider now the causal configuration4 where z plays the role of inter-
mediate between x and y (Figure 13.14). Since the pre- and the post-intervention
DAGs coincide, so the pre and the post-intervention distribution do as well.
P(y = y|do( x)) = P(y = y|x= x)
This means that in order to measure the causal effect of x and y in the configuration
of Figure 13.14, conditioning is not required. Note that the difference between
those two configurations is i) only due to the different underlying causal structures
and ii) could not be detected by relying on conditional independence tests at the
observational level (see the Simpson paradox in Section 13.2.1).
•
13.7.3.1 Backdoor criterion
We discussed in Section 13.6 that the unconfoundness assumption, essential for es-
timating causal effects in an observational study, is non-testable in the potential
outcomes approach. Theorem 7.3 shows that indeed we need a causal graph to per-
form causal reasoning. But what about the most general case, e.g. non observable
direct parents (Figure 13.15)? How to select a set of variables (also called suffi-
cient set ) such that conditioning on it, we can estimate the causal effect with no
bias? The backdoor criterion [146] is a graphical criterion which allows to find the
sufficient set.
Definition 7.5. A set of variables Z is admissible (or "sufficient") for adjustment
if two conditions hold:
4Check the analogy with Figure 13.6.
13.7. CAUSAL BAYESIAN NETWORK 369
Figure 13.15: Unobserved parent configuration: zis a latent variable. [146]
Figure 13.16: What is the causal effect of x on y?
1. No element of Zis a descendant of the treatment x.
2. The elements of Z block (Definition 3.5) all "backdoor" paths from x to y,
namely all paths that end with an arrow pointing to x.
Given a sufficient set Z, the average causal effect of x(boolean) on y (boolean)
is
Prob {y = 1|do (x = 1)} − Prob {y = 1| do (x = 0)}=
=X
Z
[Prob {y = 1|x = 1,Z =Z } − Prob {y = 1|x = 0,Z =Z } ]P (Z =z )
In plain words, the sufficient set takes the role of the parents in (13.7.10). By
conditioning on this set, the observational distribution may be used to estimate
causal effects.
Exercise
Compute the backdoor set for measuring the causal effect of x on y ,w on y in the
causal structure of Figure 13.16.
•
The rationale of the backdoor criterion is that the total association between
treatment and effect is a composition of the true causal effect and the non-causal
association that is generated by the backdoor paths. The backdoor paths in the
diagram carry spurious associations from x to y , while the paths along the arrows
from x to y carry causative associations. The causal effect of x on the outcome yis
identifiable if all spurious paths are blocked, and no new spurious paths are created.
Blocking the backdoor paths (by conditioning on Z) ensures that the observed
association between x and y is the causal effect of x on y . This criterion can be
applied systematically to diagrams of any size and shape, avoiding the ambiguity
related to unconfoundness seen in the potential-response framework.
370 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
Figure 13.17: Application of rule R2.
13.7.3.2 Beyond sufficient set: do-calculus
Adjusting for sufficient covariates is only one of many methods that permits us
to estimate causal effects in nonexperimental studies. Pearl (1995a) has presented
examples in which such a set of variables does not exist (or is not observable)
and where the causal effect can nevertheless be estimated consistently. This led to
the frontdoor criterion (intermediary variable)5. All those results were masterfully
regrouped by J. Pearl in a set of rules which constitute the foundations of the
calculus of intervention.
A causal effect of x on y is identifiable when the expression containing the
do( x=x ) operator can be transformed in an expression containing conventional
probabilistic statements. Let G ¯
x(G x) the graph obtaining by deleting from Gall
arrows pointing to (emerging from) x
1. P (y |do (x), z, w ) = P (y |do (x), w ) if (y ⊥⊥ z |w, x )G ¯
x
2. P (y |do (x), do(z ) , w ) = P (y |do (x), z, w ) if (y ⊥⊥ z |w, x)G ¯
xz
3. P (y |do (x), do(z ) , w ) = P (y |do (x ) , w ) if (y ⊥⊥ z |w, x )G ¯
xz( w) where z (w ) is
the subset of z that are not ancestors of any wnode in G ¯
x
The rule R1 formalises the notion of surgery, while the rule R2 formalises the
notion of backdoor since it may be used to transform an action into an observation.
An example of the adoption of rule R2 is in Figure 13.17 where P (y| do(z), w ) =
P( y| z, w) since (y ⊥⊥ z |w )G z (as shown by d-separation in the lower graph Gz ).
R3 may be used to remove a do() operator, e.g.
P( y|do( z)) = P( y)
if (y ⊥⊥ z )G z i.e. a cause y is not affected by the manipulation of the descendent z.
13.7.4 Selection bias
Another important insight of CBN on the risks of associative data analysis is related
to the notion of selection bias. Selection bias refers to any association created as
5For more details on a genial interpretation of the smoke/cancer causal relation in terms of
front-door criterion, we refer to [146]
13.7. CAUSAL BAYESIAN NETWORK 371
Figure 13.18: Selection bias: conditioning on a descendant of the outcome. [163]
Figure 13.19: Selection bias: spurious association due to selection. [163]
a result of the process by which individuals are selected into the analysis. In more
general terms, it refers to biases that arise from conditioning on a common effect of
two variables, one of which is either the treatment or a cause of treatment, and the
other is either the outcome or a cause of the outcome. Selection bias in observational
studies may be due to the design (e.g. enrolment criteria) or unintended. A typical
case of bias is due to censoring (informative censoring) or to the restriction of
the study to volunteering persons (self-selection bias). Note also that the risk of
selection bias is not limited to the observational setting. In fact, randomisation
protects against confounding but not against selection bias when the selection occurs
after the randomisation.
Below we will present several examples of selection biases, taken from [163],
which illustrate the risk of spurious associations due to implicit conditioning. In
Figure 13.18 we see an example of selection bias due to conditioning on a descendant
y∗ (e.g. by thresholding) of the outcome y. The relation between xand uis
unconfounded, but the choice of limiting the training set to outcome values which
are smaller than a threshold may induce selection bias.
Figure 13.19 represents a causal graph where xstands for college education, i
for the presence of impaired memory (precursor of Alzeihmer) in memory and y
denotes the Alzeihmer pathology. sis a variable denoting the examples which were
included in the study (0 stands for not selected). Suppose we pool two datasets:
persons with college education (x= 1) and persons with impaired memory (i = 1).
The variable sis then the logical OR of x and i . In the resulting dataset, the
patients with x= 0 will have necessarily i= 1 and then y= 1. As a consequence of
the related selection bias, a negative association between education and Alzheimer
will appear, in spite of their independence made visible by d-separation.
The examples above illustrate two major benefits of a causal representation of
the mechanism underlying data: DAGs visualise how causal relationships translate
into associations and provide a formal tool to detect biased situations (confounding,
selection bias). All this however requires the capability of drawing (or inferring) a
diagram that adequately describes contextually plausible associations.
372 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
Another famous example of selection bias occurred in relation to the Challenger
Space Shuttle disaster in 1986 [94]. The night before the launch, an emergency
meeting took place to understand whether it was dangerous to proceed with the
launch given the exceptional low temperature. The committee concluded that no
relationship existed between temperature and past accidents. Unfortunately, a pos-
teriori it emerged that such conclusion was due to the fact that only data related
to past accidents (and then biased) were taken into consideration. If data related
to launches with no damage would have been included in the analysis as well, the
relation would have been easily detected and the tragedy probably avoided6.
13.8 Counterfactual
A counterfactual question that now you, patient reader of this book, could legiti-
mately ask yourself is "How much would I have been happier if I had decided NOT
to read such a book ?" The reasoning in order to answer such question should rely
on
1. your current state due to the choice of reading this book,
2. the causal model linking the decision and your happiness state (e.g. repre-
sented in Figure 13.20)
Counterfactual reasoning is probably one of the most important characteristics of
human intelligence and refers to the human ability to imagine unseen scenarios and
predict unobserved situations. This is the reason why Pearl [149] put counterfactuals
at the uppermost level of the reasoning skills about causality (Figure 13.2). While
do-operators P (y =y| do (x = 1)) model the average consequence of an intervention
(do(x = 1)) on a population, counterfactuals P (y(x=1) = 1|e ) model the impact
of an unseen intervention (do (x = 1)) on a segment of the population defined by
some evidence e. Counterfactual reasoning combines the observed evidence with
the effect of an intervention in three steps:
1. Abduction: use the evidence eto infer information about the pre-intervention
state (e.g. your motivation this morning or the quality of this book),
2. Intervention: modify the structural model to keep into account the action,
3. Prediction: combine the pre-intervention distribution with the action to pre-
dict what would have happened.
In order to link the factual and the counterfactual worlds, Pearl proposes the
adoption of a twin network merging the distribution of the observational world and
the distribution of the world which was manipulated according to the counterfactual
action (e.g. in the case of our reader, decide not to open this book). Such a twin
network (Figure 13.21) is a handy way to use the DAG formalism to model and
answer counterfactual questions.
Counterfactual example
Consider the talented doctor House, often urged to operate desperate cases, and
the success rate of his operations. Let x= 1 denote the fact of being operated
by doctor House, and y = 1 the death of a patient. Consider the three following
differences:
1. P (y = 1|x = 1) −P (y = 1|x = 0)
6https://bookdown.org/egarpor/PM-UC3M/glm- challenger.html
13.8. COUNTERFACTUAL 373
Happiness
this morning
Happiness
Now
Decision
Quality of
lecture
Motivation
Figure 13.20: Causal model of the reader satisfaction.
Happiness
Before
Happiness
Decision
Quality of
lecture
Motivation
Happiness (d=0)
Decision=0
Figure 13.21: Twin counterfactual model of the reader satisfaction.
374 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
2. P (y = 1|do (x = 1)) −P (y = 1| do (x = 0)) = P (y (1) = 1) −P (y (0) = 1)
3. P (y (1) = 1|x = 0) −P (y (0) = 1|x = 0)
The first quantity is positive, not because the doctor House is a bad doctor, but
because he is used dealing with desperate situations. The second difference is neg-
ative since being operated by dr. House reduces the death risk. The third quantity
is counterfactual and states what would have happened to a patient who has not
been operated on by dr. House if this would have indeed happened. This quantity
should be more negative than the second one since the observation x = 0 provides
additional evidence that the case was not desperate.
•
Counterfactual reasoning in law
It is interesting to see how law directives (e.g. the ones against discrimination) are
often written in counterfactual language: The central question in any employment-
discrimination case is whether the employer would have taken the same action had
the employee been of a different race (age, sex, religion, national origin, etc.) and
everything else had been the same. In counterfactual notation, this boils down to
estimate the following quantity
P(y(x=1) = 1|x = 0 ,y= 0)
where y = 0 where y denote the hiring and xhis race. The quantity above returns
the probability that the refused person (observation y= 0) with race x = 0 would
have been hired if the perceived race would have been different.
•
13.9 Causal structure identification
The estimation of causal effects requires the availability of a causal graph. Even
when experimental interventions are possible, performing the large amount of ex-
periments that would be required to discover causal relationships between tens or
thousands of variables is not practical. Causal discovery aims to return plausible
explanations for observable associations in terms of DAGs. The rationale is that
some causal relationships can be tested without doing experimentation. This means
that some causal dependencies can be inferred from non-temporal statistical data
if one makes certain simplifying assumptions about the underlying process of data
generation.
•Causal Markov assumption (Definition 3.1): the causal interpretation is that
all dependencies (or associations) between variables are due to causal relations.
Note that is an oversimplification since dependencies between variables could
be generated in non causal ways as well (see selection bias in Section 13.7.4).
•Faithfulness: all independencies found in the distribution are due to d-separations
in the graph or equivalently a d-connection implies a dependency (Section 4.3.2.1).
In causal terms, this means that all causal pathways should induce a depen-
dency, though this is not always the case (e.g. two causal pathways could
cancel out each other).
•Stability: the set of independencies of the associated distribution depends
only on the structure of the graph and not on the parametrisation. Unstable
independencies (i.e. dependencies that disappear with a parameter change)
are unlikely to occur in the data, so all the independencies are structural.
13.9. CAUSAL STRUCTURE IDENTIFICATION 375
•Causal sufficiency: A set of variables Xis causally sufficient if for every pair
of variables x1 ,x2 ∈ X , every common direct cause of x1 and x2 is also a
member of X. If this is not the case, there are latent variables.
There are two main families of causal structure identification: score-based and
constraint-based algorithms. Score-based algorithms search, within a set of can-
didate structures, the one which optimises some cost function. Commonly used
score functions are the z-score of hypothesis test (e.g. under the assumption of
Gaussian linear dependencies), maximum-likelihood or information-theoretic score
(e.g. BIC score). Such algorithms transform the problem of structure identification
into a problem of optimisation. Though a number of state-of-the-art algorithms
may be used to address such optimisation, the large size of the search space makes
such an approach unfeasible for a moderate number of variables. Constraint-based
algorithms are more commonly used and discussed in the following sections.
13.9.1 Constraint-based approaches
We have extensively discussed in Chapter 4 the relation between topological proper-
ties and conditional independence in DAGs. The rationale of the constraint-based
approach is to use conditional independence relations as constraints during the
learning of the DAG structure from data. The idea is to derive from data a number
of testable implications (notably by estimating conditional independence) and use
them to disambiguate causal configurations (e.g. directionality) as much as possible.
The resulting algorithm iteratively looks for a DAG compliant with the statistical
constraints and consists of 2 main steps:
1. Skeleton discovery compliant with conditional independence patterns.
2. Orientation based on v-structures and acyclic constraints.
The final goal is to discover the class of Markov equivalent DAGs (Definition 3.10)
which is consistent with the available dataset.
13.9.1.1 Normal conditional independence test
In order to speed up the conditional independence tests, constraint-based algorithms
often assume a Normal distribution. Consider a multivariate normal vector X , such
that xi ,xj ∈ X , XS ⊂ X and s is the dimension of XS . Then
ρx i x j |XS = 0 ⇔ I(xi ,xj | XS )=0
The sample partial correlation ˆ ρx i x j |XS can be computed by regression, inversion
of the covariance matrix or recursively (Section 3.8.3). To test the null hypothesis
ˆ ρx i x j |XS = 0 the Fisher's z-transform
Z=1
2log 1 + ˆ ρx i x j |XS
1− ˆ ρx i x j |XS
is typically considered. The null hypothesis is rejected with the significance level α
(false positive rate) if
Z√ N− s−3 >Φ−1 (1 − α/2)
where Φ(·) is the cumulative distribution of a standard Normal variable (Sec-
tion 3.4.2).
376 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
Let H the empty graph over X// forward strategy
for each pair ( xi , xj ) in X do
Search XS such that I ( xi , xj | XS ) = 0 // conditional independence
if no XS exists then
connect (xi ,xj ) in H
else
Si,j =X S
end if
end for
Figure 13.22: IC algorithm: Si,j stores the set of variables that make xi and x j
conditionally independent).
Let H the complete undirected graph over X// backward strategy
for each pair ( xi , xj ) in H do
for XS ⊆ (V\ ( xi ,xj )) do
if I( xi , xj | XS )=0 then
Delete edge (i, j) from H
Si,j =X S
break
end if
end for
end for
Figure 13.23: SGS algorithm.
13.9.1.2 Skeleton discovery
The first example of constraint-based algorithm is the IC algorithm (Figure 13.22)
which adopts a forward and exhaustive approach. It starts with the empty graph
and for each pair (xi ,xj ) of nodes it adds a connecting edge if no conditioning set
makes them d-separated, otherwise it stores the separating set in the variable Si,j .
The number of tests in the worst case is bounded by
n
2 2 n−2
where n is the number of variables.
Given the high computational cost of this approach, some IC variants start with
the complete undirected graph (backward) or limit the conditioning size. An ex-
ample is the SGS algorithm (Figure 13.23) which requires, however, an exponential
search. In order to bound the computational complexity, the PC (Peter and Clark)
algorithm [172] (Figure 13.24) sets a maximum size Lof the conditioning size. For
a given L , the number of tests is bounded polynomially by
2 n
2 L−1
X
i=0 n−2
i ≤3n2 (n− 2)L
13.9.1.3 Dealing with immoralities in the skeleton
A potential immorality in the skeleton is a triplet of variables such that xi −xk − x j
but no edge exists between xi and xj (i.e I ( xi ;xj |xk )> 0 or equivalently xk / ∈Si,j )
It is then possible to extend the skeleton retrieved in the previous section by
marking the immoralities, orienting the edges of the associated collider and then
13.9. CAUSAL STRUCTURE IDENTIFICATION 377
Let H the complete undirected graph over X
for l = 0 to L do
for adjacent xi , xj in H do
for XS ⊆(N ( xi )∪N ( xj )) \ ( xi ∪ xj ) such that | XS |=l do
if I ( xi ,xj | XS ) = 0 then
Delete edge (i, j) from H
Si,j =X S
break
end if
end for
end for
l= l+ 1
end for
Figure 13.24: PC algorithm: N ( xi ) denotes the set of neighbors of the node xi and
Lthe maximum conditioning set size.
Let H be the skeleton
for each non adjacent pair ( xi ,xj ) with a common neighbor xk in H do
if xk / ∈Si,j then
Add arrowheads pointing to xk (v-structure)
end if
end for
Orient as many undirected edges as possible by avoiding 1) new v-structures
and 2) directed cycles
Figure 13.25: Edge orientation strategy.
obtaining a Partially DAG. This is the first part of the edge orientation strategy in
Figure 13.25. Once the immoralities are identified, we have identified the equiva-
lence class of the resulting PDAG.
The second part of the edge orientation strategy (Figure 13.25) relies on the
following rules:
•R1: Orient j −kinto j →kwhenever there is an arrow i→ j and i, k are not
adjacent.
•R2: Orient i −jinto i →jwhenever i →k→j
•R3: Orient i −jinto i →jwhenever i −k →jand i −l →jand k, l are not
adjacent.
•R4: Orient i −jinto i →jwhenever i −k →land k →l→jand k, l are
not adjacent.
PC example
The diagrams in Figure 13.26B-E illustrate the of steps of a PC algorithm re-
constructing the ground truth DAG in Figure 13.26A [80]. Step B relies on the
independence x ⊥⊥ y . Step C is due to the relation x ⊥⊥ w |z, y ⊥⊥ w |z . In D, the
algorithm creates a collider from triplet x− z− y since no x− y exists. Step E
implements the R1 orientation rule.
•
378 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
Figure 13.26: PC algorithm [80]. Diagram in A denotes the real DAG. The dia-
grams B to F represent the inferred structure at different PC steps. The first step
corresponds to the complete undirected graph. The last two steps are performed
by edge orientation.
13.9.1.4 Limitations
Constraint-based algorithms are well-known structure identification algorithms and
have been intensively used in a lot of practical domains. Nevertheless, they suffer
from some limitations:
•I-equivalence classes: two graphs that are I-equivalent (Definition 3.8) cannot
be distinguished by constraint-based approaches without resorting to manip-
ulative experimentation or temporal information.
•Conditional independence: the use of conditional independence tests requires
an assumption on the dependence (e.g. linear as in Section 13.9.1.1 or nonlin-
ear). In the nonlinear case, it may be particularly expensive to have recourse
to conditional tests.
•Finite-sample error propagation: the sequential nature of the constraint-based
algorithms suffers from an error propagation (false positive) which is hard to
monitor and control.
•Asymptotic results of correctness: if the conditional independence decisions
are correct in the large sample limit, the PC algorithm is guaranteed to con-
verge to the true Markov Equivalence Class in the large sample limit, assuming
i.i.d. samples and the Markov, Faithfulness, Sufficiency assumptions.
•Curse of dimensionality: the exponential complexity nature of the algorithm
(Figure 13.27) makes those algorithms inadequate in large dimensional set-
tings (e.g. bioinformatics).
13.10 Beyond conditional independence
The importance of causal reasoning in large dimensional settings (notably bioinfor-
matics) paved the way for developing alternative strategies for performing causal
inference from data. We will limit here to sketch two main approaches: the use
13.10. BEYOND CONDITIONAL INDEPENDENCE 379
0 200 400 600 800 1000
1e+06 1e+20 1e+34
number of varia bles n
number of CI tests (log scale)
d=15
d=10
d=5
Figure 13.27: Constraint-based algorithms complexity: upper bound on the number
of conditional tests as a function of the problem dimension nand conditioning size
L.
of causal feature selection strategies and the adoption of data-driven techniques to
deal with indistinguishable situations.
13.10.1 Causality and feature selection
Feature selection and causal inference are related by the notion of Markov blanket
(MB) [182]. A Markov Blanket (definition 8.4) is the smallest set of strongly relevant
variables [118], i.e. variables containing information about the target which cannot
be obtained from any other variable (definition 8.1). A MB of a target contains
direct causes (parents), direct effects (children) and spouses (nodes that share a
child with the target). Feature selection techniques, being able to discriminate
between causes and effects, may then play a major role in causal modelling.
Tsamardinos et al. [182] proposed several algorithms to identify the Markov
Blanket of a target variable. Pellet and Elisseef [151] proposed an algorithm in
two steps: the first builds an approximate structure of the causal graph using a
feature selection algorithm, and the second improves it by local adjustments and
orientation.
Most existing algorithms of causal feature selection [88] decompose feature selec-
tion and causality and rely on conditional independence test to orient arcs and detect
causal relationships. Nevertheless, not many approaches address large feature-to-
sample ratio settings. The author proposed a causal filter in [33] which integrates
into the cost function the relevance component and a causal component and ad-
dresses the issues of large feature-to-sample ratio settings. The algorithm, called
mIMR, is a causal extension of the mRMR algorithm (Section 12.8.2) and was
successfully applied to bioinformatics applications [32].
13.10.2 Beyond observational equivalence
Approaches like Bayesian network or mIMR rely on notions of (conditional) in-
dependence and faithfulness to detect causal patterns in the data. They cannot
deal with indistinguishable (or equivalent) configurations (Section 4.3.3) like the
two-variables setting and the completely connected triplet configuration.
However, in what follows, we will see that indistinguishability does not prevent
380 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
Figure 13.28: An example of Markov Blanket in a Causal Bayesian Network. [90]
the existence of statistical algorithms able to reduce the uncertainty about the causal
pattern. Note that this is a recent result in contrast with the common belief that
it was impossible to learn a causal graph with two variables (as stated for instance
by Wasserman in [192]). Indeed this impossibility is limited to the Gaussian case
where the two variables are linked by a linear relation: in this case the joint Gaussian
distribution is fully determined by the mean and covariance matrix and there is no
way to reconstruct directionality form the distribution. However, asymmetry in the
dependence structure is a fundamental property that distinguishes causation from
association. It follows that, under some specific constraints (e.g. non Gaussian noise
or nonlinear relationships), some asymmetric features of the distribution (beyond
(in)dependence relations) might be informative about the causal structure.
13.10.2.1 Learning directionality in bivariate associations
A bivariate association between two variables x and y is necessarily due to one of
those reasons:
1. a causal influence going from x to y,
2. a causal influence going from y to x,
3. a (possibly unobserved) common cause (confounder) of x and y (note that
also time can play the role of confounder),
4. a (possibly unobserved) common effect of x and y (inducing selection bias).
or any of their combinations.
In recent years several approaches addressed the two variable settings (e.g.
ANM [104] and IGCI [54]) by using asymmetric statistical properties (e.g. due
to non Gaussian noise or nonlinear mapping) to detect causal patterns. An addi-
tive noise model (ANM) dependence between two variables x and y satisfies the
13.10. BEYOND CONDITIONAL INDEPENDENCE 381
Figure 13.29: Illustration of asymmetric effects due to the existence of a causal
dependency x→ y [104]
following relations:
y=f ( x) + w, w⊥⊥ x
where the noise term is independent from the input x. Consider a bivariate training
set related to two random variables x and y and suppose that either x→ y or y→ x
holds. The algorithm proposed in [104] is based on the idea that if x→ y is additive,
the independence between cause and noise does not hold for y→ x . The algorithm
steps are:
1. Regress both y on x and x on y.
2. Compute the residuals wy and wx for the two regressions.
3. Compute two independence tests (e.g. HSIC): wy ⊥⊥ x and wx ⊥⊥ y .
4. Compare the two independence test statistics ˆ
Cx→y and ˆ
Cy→x to determine
whether x→ y is more probable than y→ x or vice-versa.
A further important step in this direction has been made by the ChaLearn cause-
effect pair challenge [87] where participants should learn from data the answer to
questions like Does altitude cause temperature or vice-versa? 7 . Hundreds of pairs
of real variables with known causal relationships from several domains (chemistry,
climatology, ecology, economy, engineering, epidemiology, genomics, medicine) were
made available to competitors for training their models. Real data were intermixed
with controls (pairs of independent variables and pairs of variables that are depen-
dent but not causally related) and semi-artificial cause-effect pairs (real variables
mixed in various ways to produce a given outcome). The good accuracy obtained
by several competitors showed that learning strategies can address with success
(or at least significantly better than random) indistinguishable configurations. This
competition opened the way to a recent research direction which poses causal infer-
ence as the problem of learning to classify probability distributions [154]. In those
approaches causal inference is typically based on two steps:
1. a featurisation of the observed dataset,
7See YouTube video "CauseEffectPairs" by Isabelle Guyon at https://www.youtube.com/
watch?v=pgoZ5lmRRvE.
382 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE
2. the training of a binary classifier to distinguish between causal directions.
Existing approaches differ mainly in the featurisation step: for instance, [127] pro-
posed an approach based on kernel mean embeddings.
The author proposed instead a machine learning approach based on mutual in-
formation in [34, 35], called D2C. Given two variables, the D2C approach infers from
a number of asymmetric statistical features of the n -variate distribution the proba-
bility of the existence of a directed causal link. Causal inference is then addressed
as a supervised learning task where: i) inputs are asymmetric features describing
the probabilistic dependency and ii) output is a class denoting the existence of the
causal link.
Once sufficient training data are made available, conventional feature selection
algorithms and classifiers can be used to return a prediction. The rationale of those
approaches is that, though "correlation does not imply causation", it happens that
"causation creates dependence" (if faithfulness holds). In other terms, causality
leaves footprints (e.g. asymmetric descriptors) in the statistical distribution that
can be reused to reduce the uncertainty about the existence (or directionality) of a
causal relationship.
13.11 Concluding remarks
A final chapter on causal inference in an introductory machine learning book could
appear as an unnecessary burden for an already exhausted reader. In fact, the
author deems that conventional machine learning books tend to create in the reader
a feeling of overconfidence about the power of predictive models. On the contrary,
in many crucial real-life applications, the real question at stake is causal (e.g. will
the lockdown policy have an impact on epidemics?) and not associative. Only an
understanding of causal relationships can support a reliable prediction of how a
system will behave once subject to intervention.
It is then important to stress the difference between the notion of probability
conditional on observation (Prob {y|x =x }) and probability conditional on ma-
nipulation (Prob {y | do (x ) = x } ). Those two quantities are different and confuse
them may have disastrous consequences (e.g. overconfidence) in terms of decision
making. Though machine learning provides many solutions to the algorithmic esti-
mation of Prob {y|x =x} , the practitioner should be encouraged to think whether
Prob {y|x =x} is the quantity (s)he is really interested in. For instance, though a
trader would be more than happy in improving the estimation of Prob {y |x =x }
(where x stands for today stock price and ystands for tomorrow's), this quantity
would be of little use for an economist aiming to predict the impact of a Tobin-like
tax on the markets8.
Notions like the Simpson paradox, confounding, selection bias, latent variables,
counterfactuals should sound like a "forewarned is forearmed" message to machine
learning practitioners (and over-optimistic deep-learning evangelists :-). Accurate
prediction (whatever the depth of your network) does not imply accurate under-
standing either good decision making. So, the old data mining adage "We are
drowning in data and starving for knowledge" should instead read 'We are drown-
ing in associations and starving for causality".
8For a thorough analysis of the difference between predicting modelling and causal predictive
modelling, we refer the reader to [173]
Chapter 14
Conclusions
We have come to the end, almost. We will take a few words to remind you that
machine learning is not perfect and to cover a bit of ethical considerations. Then
we will conclude with some take-home messages and final recommendations.
14.1 About ML limitations
From the dawn of the AI discipline, machine learning has been considered a key
component of autonomous intelligent agents. In recent years, though a full-fledged
artificial intelligence does not seem within reach yet, machine learning found great
success thanks to its data-driven and assumption-free nature.
This book insisted on the fact that no modelling effort may be completely
assumption-free. Assumptions (though often implicit and hard-coded in the al-
gorithms) are everywhere in the learning process, from problem formulation to data
collection, model generation and assessment! When such assumptions happen to
match with reality, the resulting method is successful: if it is not the case, the result
may be disappointing (see NFL theorem).
Another misjudgment about machine learning is to consider it as a reliable proxy
of human learning. Machine learning owes its success to the generic and effective way
of transforming a learning problem into a (stochastic) optimisation one. Machines
do not think or learn like us: if this is the key to their success, this makes them
fragile, too. A large part of human rational decision making and understanding
cannot be reduced to the optimisation of a cost function.
This has been put into evidence by a recent trend taking a critical attitude
about the machine learning approach (e.g. the i.i.d. assumption) and its limits.
For instance, research on adversarial learning aims to show that the limited training
and validation set may induce very optimistic expectations about the generalisation
of learners. Recent research showed that automatic learners, which appear to be
accurate emulators of human knowledge (e.g. in terms of classification accuracy),
may be easily fooled once required to work in specific situations. A well-known
example (Figure 14.1) shows that deep learning classifiers, able to reach an almost
100% accuracy rate in recognising animal images, may return pitiful predictions,
when confronted with properly tweaked inputs [67]. Though this seems anecdotal,
such vulnerabilities in learning machines could be very dangerous in safety-critical
settings (e.g. self-driving cars).
Another interesting research direction is the study of the robustness of automat-
ically learned model in settings which are not identical to the one used for training,
e.g. because of nonstationarity and concept drift. This is particularly critical in
health problems where models returning high-quality predictions for a specific co-
383
384 CHAPTER 14. CONCLUSIONS
Figure 14.1: Bad generalisation in front of adversarial examples
hort (e.g. in a given hospital) miserably fail when tested on different patients (e.g.
from another hospital). How to transfer learned models to close settings is then
another hot topic in recent learning research. In this context, causal interpretation
of data generation could play an important role in reducing the risk of drift and
increasing the model stability.
14.2 A bit of ethics
Last but not least, a word of ethics should not hurt in a book for computer scien-
tists. "Data-driven does not necessarily mean "objective. Machine learning models
predict what they have been trained to predict and their forecasts are only as good
as the data used for their training. In that sense, machine learning can reinforce hu-
man prejudices if trained on biased data sets derived from human decisions. Feeding
learners with biased data can have dangerous consequences [141]. In 2016 Twitter
chatbot Tay began uttering racist statements after a single interaction day. The
predictive justice software COMPAS, deciding whether a suspect should be incar-
cerated before trial or not, has been accused of being racially biased by an NGO.
In 2015, Google Photos identified two African American people as "gorillas".
Every ML practitioner should be aware that even models developed with the
best of intentions may exhibit discriminatory biases, perpetuate inequality, or per-
form less well for historically disadvantaged groups 1. Recent efforts in modelling
and introducing fairness in machine learning (most of the time based on causal
considerations) are then more than welcome.
At the same time, the problem goes beyond the scientific and technical realm
and involves human responsibility. Automating a task is a responsible decision-
making act which implies encoding (implicitly or explicitly) ethical priorities in an
autonomous agent. A self-driving car that decides to brake (or not to brake) is
somewhat trading the cost of human life vs. the cost of an over-conservative action.
The choice of entrusting to machines tasks that could have an impact on human
security or human sensibility should never exempt humans (from the programmer
to the decision maker) from legal and moral responsibilities of probable errors.
To conclude, the ethical dilemma of ML may be summarized by the contrapo-
sition of the two citations at the beginning of this book: on the one hand, any
machine learning endeavour harbours the ambition (or the illusion) of catching the
1see also the presentation https://tinyurl.com/y4ld3ohz
14.3. TAKE-HOME NOTIONS 385
essence of reality with numbers or quantitative tools. On the other hand, "not
everything that counts" (notably ethics) can be counted or easily translated into
numerical terms.
14.3 Take-home notions
Quoting Einstein "the supreme goal of all theory is to make the irreducible basic
elements as simple and as few as possible without having to surrender the adequate
representation of a single datum of experience". This sentence catches probably
the primary take-home concept in machine learning: trade-off, notably the trade-
off between bias and variance, underfitting and overfitting, parametric and non-
parametric, false positive and false negative, type I and type II error,...(please add).
Other bulk notions the author would like you to remember (or revise) are:
•information theory: it is a powerful language to talk about stochastic depen-
dence,
•estimators: do not forget they are random variables with their own sampling
distribution, and, even if very (very) good, they may be wrong sometimes2,
•conditional probability: supervised learning is all about estimating it,
•conditional (in)dependence and its non-monotonicity: mastering the complex-
ity (and beauty) of high dimensionality goes that way.
14.4 Recommendations
We would like then to end this manuscript not by selling you a unique and superior
way of proceeding in front of data but by proposing some golden rules for anyone
who would like to adventure in the world of statistical modelling and data analysis:
•However complex is your learning algorithm (either adaptive or deep or pre-
ferred by GAFA), do not forget it is an estimator, and as such, it makes
assumptions (often implicitly). Each approach has its own assumptions! Be
aware of them before using one.
•Simpler things first! According to Wasserman [192], using fancy tools like
neural nets,...without understanding basic statistics is like doing brain surgery
before knowing how to use a band-aid.
•Reality is probably almost always nonlinear but a massive amount of (theo-
retical and algorithmic) results exists only for linear methods.
•Expert knowledge MATTERS... But data too :-)
•It is better to be confident with a number of alternative techniques (preferably
linear and nonlinear) and use them in parallel on the same task.
•Resampling and combining are at the forefront of the data analysis techniques.
Do not forget to test them when you have a data analysis problem.
•Do not be religious about learning/modelling techniques. The best learning
algorithm does NOT exist.
2...even the divine Roberto Baggio missed a penalty in the 1994 FIFA World Cup Final against
Brazil :-(
386 CHAPTER 14. CONCLUSIONS
•Statistical dependency does not imply causality, though it may shed some
light on it.
and the best motto for a machine learner:
Once you stop learning, you start dying (Albert Einstein).
Appendix A
Unsupervised learning
A.1 Probability density estimation
The probability density estimation is the problem of inferring a probability density function
pz , given a finite number of data points { z1 , z2 ,...,z N}drawn from that density function.
We distinguish three alternative approaches to density estimation:
Parametric. This approach assumes a parametric model of the unknown density prob-
ability. The parameters are estimated by fitting the parametric function to the
observed dataset. This approach has been extensively discussed in Chapter 5.
Nonparametric. This approach does not assume any a priori form of the density model.
The form of the density is entirely determined by the data and the number of
parameters grows with the size of the dataset.
Semi-parametric. In this approach the number of parameter is not fixed a priori but is
independent of the size of the dataset.
A.1.1 Nonparametric density estimation
The term nonparametric is used to describe probability density functions whose functional
form is not specified in advance, but is dependent on data [162, 144].
Let us consider a random variable z with density probability pz (z ) and a region R
defined on the zspace. The probability that a value zdrawn according to pz (z ) falls
inside Ris
PR = prob{z ∈R } =Z R
pz ( z) dz (A.1.1)
Let us define with k the random variable which represents the number of points which
falls within R , after we have drawn N points from pz (z ) independently.
From (C.1.1) we have that its probability distribution is
pk ( k) = N!
k!( N− k)! P k
R(1 − P R) ( N−k) (A.1.2)
Moreover, the random variable k /N satisfies
E[k /n] = PR(A.1.3)
and
Var[k /N ] = E [(k /N − PR )2 ] = P R (1 − P R )
N(A.1.4)
Since according to (A.1.4), the variance of k /N converges to zero as N → ∞, it is reason-
able to expect that the fraction k/N return a good estimate of the probability PR
PR ∼
=k
N(A.1.5)
387
388 APPENDIX A. UNSUPERVISED LEARNING
At the same time, if we assume that pz (z ) is continuous and does not vary appreciably
over R , we can approximate PRwith:
PR =Z R
p( z) dz ∼
=p( z) V(A.1.6)
with V volume of R . From (A.1.5) and (A.1.6) it follows that for values of z inside R
p( z)∼
=k
NV (A.1.7)
In order for (A.1.5) to hold it is required to have a large R . This implies a sharply peaked
pz ( z). In order for (A.1.6) to hold it is required to have a small R. This ensures a pz ( z)
constant in R . These are two clashing requirements. We deduce that it is necessary to
find an optimal trade-off for R in order to guarantee a reliable estimation of pz (z ). This
issue is common to all nonparametric approaches to density estimation.
In particular, we will introduce two of them
Kernel-based. This approach fixes R and searches for the optimal number of points k.
k-Nearest Neighbor (k-NN). This approach fixes the value for kand searches for the
optimal R.
The two approaches are discussed in detail in the following sections.
A.1.1.1 Kernel-based methods
Consider a random vector zof dimension [n× 1] and suppose we take an hypercube region
Rwith sides of length Bcentered on the point z. The volume of Ris
V= Bn
Let us now define a kernel function (or Parzen window) K (u ) as
K( u) = 1 if | u j | <1/ 2j = 1 ,...,n
0 else (A.1.8)
where uj is the j th component of the vector u . It follows that the quantity
K z−zi
B
is equal to unity if zi is inside the hypercube centered at z with side B.
Therefore, given a set of N points, the number of points falling inside Ris given by
k=
N
X
i=1
K z−zi
B (A.1.9)
From (A.1.7) and (A.1.9) it is possible to define the kernel-based estimate of the prob-
ability density for the kernel (A.1.8) as
ˆ p( z) = P N
i=1 K z− z i
B
NBn (A.1.10)
Note that the estimate (A.1.10) is discontinuous over the z-space. In order to smooth
it we may choose alternative kernel functions, as the Gaussian kernel. The kernel-based
method is a traditional approach to density estimation. However, two are the most relevant
shortcomings of this approach:
1. it returns a biased estimator [25],
2. it requires the memorisation of the whole set of observations. As a consequence the
estimation is very slow if there is an high number of data.
A.1. PROBABILITY DENSITY ESTIMATION 389
A.1.1.2 k-Nearest Neighbors methods
Consider an hyper sphere Rcentered at a point z, and let us grow it until it contains
a number of kpoints. Using Eq. (A.1.7) we can derive the k-Nearest Neighbor (k-NN)
density estimate
ˆ pz ( z) = k
NV (A.1.11)
where k is the value of a fixed a priori parameter, Nis the number of available observations
and V is the volume of the hyper sphere.
Like kernel-based methods, k-NN is a state-of-the-art technique in density estimation.
However it features two main shortcomings
1. the quantity (A.1.11) is not properly a probability, since its integral over the whole
Z space is not equal to one but diverges
2. as in the kernel method, it requires the storage of the whole dataset.
A.1.2 Semi-parametric density estimation
In semi-parametric techniques the size of the model does not grow with the size of the data
but with the complexity of the problem. As a consequence, the procedure for defining the
structure of the model is more complex than in the approaches previously seen.
A.1.2.1 Mixture models
The unknown density function is represented as a linear superposition of mbasis functions.
The distribution is called called mixture model and has the form
pz ( z) =
m
X
j=1
p( z| j) π( j) (A.1.12)
where m is a parameter of the model and typically m << N . The coefficients π (j ) are
called mixing coefficients and satisfy the following constraints
m
X
j=1
π( j) = 1 0 ≤ π( j)≤ 1 (A.1.13)
The quantity π (j ) is typically interpreted as the prior probability that a data point be
generated by the j th component of the mixture. According to Bayes' theorem, the corre-
sponding posterior probabilities is
p( j| z) = p(z| j )π (j)
p( z)= p(z| j )π (j)
Pm
j=1 p(z| j )π(j )(A.1.14)
Given a data point z , the quantity p (j| z ) represents the probability that the component j
had been responsible for generating z.
An important property of mixture models is that they can approximate any continuous
density with arbitrary accuracy provided the model has a sufficient number of components
and provided the parameters of the model are tuned correctly.
Let us consider a Gaussian mixture model with components p (z| j )∼N (µj , σ2
j) and
suppose that a set of Nobservations is available. Once fixed the number of basis functions,
the parameters to be estimated from data are the mixing coefficients π (j ), and the terms
µj and σj .
The procedure of maximum likelihood estimation of a mixture model is not simple,
due to existence of local minima and singular solutions. Standard nonlinear optimisation
techniques can be employed, once the gradients of the log-likelihood with respect to the
parameters is given. However, there exist algorithms which avoid the complexity of a
nonlinear estimation procedure. One of them is the EM algorithm, which will be introduced
in the following section.
390 APPENDIX A. UNSUPERVISED LEARNING
A.1.2.2 The EM algorithm
The expectation-maximisation or EM algorithm [57] is a simple and practical method for
estimating the mixture parameters avoiding complex nonlinear optimisation algorithm.
The assumption of the EM algorithm is that the available dataset is incomplete. This
incompleteness can either be due to some missing measurements or because some imaginary
data are introduced to simplify the mathematical form of the likelihood function.
The second situation is assumed to hold in the case of mixture models. The goal of
the EM algorithm is then to maximize the likelihood of the parameters of a mixture model
assuming that some data is missing in the available dataset.
The algorithm has an iterative form in which each iteration consists of two steps: an
expectation calculation (E step) and a maximisation (the M step). It has been shown in
literature that the iteration of EM estimates converge to a local maximum of the likelihood
of the incomplete data.
Assume that there exists a statistical model of our dataset DN and that it is parametrized
by a real vector θ. Assume also that further data, denoted by Ξ, exist but are not observ-
able. The quantity ∆N is used to denote the whole dataset, containing both the observed
and unobserved data, and is usually referred to as the complete data.
Let us denote by lcomp (θ ) the log likelihood of the parameter θgiven the complete data.
This is a random variable because the values of Ξ are not known. Hence, it is possible for a
given value θ (τ) of the parameter vector to compute the expected value of lcomp (θ (τ) ). This
gives a deterministic function of the current value of the parameter, denoted by Q (θ (τ) ),
that can be considered as an approximation to the real value of l , called the incomplete
likelihood. The maximisation step is expected to find the parameter value θ (τ +1) which
maximize Q . The EM procedure in detail is the following:
1. Make an initial estimate θ (0) of the parameter vector.
2. The log likelihood l comp (θ (τ) ) of the parameters θ (τ) with respect to the complete
data ∆N is calculated. This is a random function of the unknown dataset Ξ.
3. The E-step: the expectation Q (θ (τ) ) of l comp (θ (τ) ) is calculated.
4. The M-step: a new estimate of the parameters is found by the maximisation
θ(τ +1) = arg max
θQ(θ ) (A.1.15)
The theoretical justification comes from the following result proved in [57]: for a sequence
θ(τ) generated by the EM algorithm it is always true that for the incomplete likelihood
l( θ(τ+1) )≥ l( θ(τ) ) (A.1.16)
Hence the EM algorithm is guaranteed to converge to a local maximum of the incomplete
likelihood.
A.1.2.3 The EM algorithm for the mixture model
In the mixture model estimation problem the problem of determining the parameters (i.e.
the mixing coefficients and the parameters of the density p (z| j ) in Eq. (A.1.12)) would be
straightforward if we knew which component jwas responsible for generating each data
point in the dataset. We therefore consider a hypothetical complete dataset in which each
data point is labeled by the component which generated it. Thus, for each point zi we
introduce m indicator random variables ζ ij ,j = 1,...,m , such that
ζij =( 1 if z i is generated by the j th basis
0 otherwise (A.1.17)
Let ∆Nbe the extension of the dataset DN , i.e. it represents the complete dataset,
including the unobservable ζij . The probability distribution for each (zi , ζij ) is either zero
or p (zi | j ). If we let ζi represent the set {ζi1 ,ζi2 ,..., ζ im }then
pζ i ( ζi ) = π( j0 ) where j0 is such that ζij 0 = 1 (A.1.18)
A.1. PROBABILITY DENSITY ESTIMATION 391
so
p( zi , ζi ) = p( ζi ) p( zi | j0 ) = π( j0 ) p( zi | j0 ) =
m
Y
j=1
[π (j)p (zi |j )]ζ ij (A.1.19)
Thus the complete log likelihood is given by
lcomp (θ ) = ln Lcomp (θ ) = ln
N
Y
i=1
m
Y
j=1
[π (j)p (zi |j )]ζ ij (A.1.20)
=
N
X
i=1
ln
m
Y
j=1
[π (j )p(zi |j )]ζ ij (A.1.21)
=
N
X
i=1
m
X
j=1
ζij {ln π (j ) + ln p (zi |j) } (A.1.22)
where the vector θincludes the mixing coefficients and the parameters of the density p (z| j )
in Eq. (A.1.12).
Introducing the terms ζ ij the logarithm can be brought inside the summation term.
The cost of this algebraic simplification is that we do not know the values of the ζ ij for
the training data. At this point the EM algorithm can be used. For a value θ (τ) of the
parameters the E-step is carried out:
Q( θ (τ) ) = E[ lcomp ( θ (τ) ) = E [
N
X
i=1
m
X
j=1
ζij {ln π (j ) + ln p (zi |j) } ] (A.1.23)
=
N
X
i=1
m
X
j=1
E[ζij ]{ ln π( j) + ln p( zi | j)} (A.1.24)
Since
E[ζ ij ] = P( ζij = 1| zi ) = p(zi | ζij ) P(ζij )
p( zi )= p(zi | j )π (j )
p( zi )= p(j| zi ) (A.1.25)
from Eq. (A.1.14) and (A.1.18) we have
Q( θ (τ) ) =
N
X
i=1
m
X
j=1
p( j| zi ){ ln π( j) + ln p( zi | j)} (A.1.26)
The M-step maximizes Qwith respect to the whole set of parameters θbut it is known
that this can be done individually for each parameter, if we consider a Gaussian mixture
model
p( z| j) = 1
(2πσj2 ) n/2 exp − (z− µj ) 2
2σj2 (A.1.27)
In this case we have:
Q( θ (τ) ) =
N
X
i=1
m
X
j=1
p( j| zi ){ ln π( j) + ln p( zi | j)}(A.1.28)
=
N
X
i=1
m
X
j=1
p( j| zi ) ln π( j)− nln σj (τ) −( z i − µ j (τ) ) 2
2(σj (τ) )2 + constant (A.1.29)
We can now perform the maximisation (A.1.15). For the parameters µj and σj the max-
imisation is straightforward:
µ(τ+1)
j=P N
i=1 p(j| z i )z i
PN
i=1 p(j| z i )(A.1.30)
σ (τ+1)
j 2 =1
nPN
i=1 p(j| z i )(z i −µ ( τ+1)
j) 2
PN
i=1 p(j| z i )(A.1.31)
392 APPENDIX A. UNSUPERVISED LEARNING
For the mixing parameters the procedure is more complex [25] and returns:
π( j)(τ +1) =1
N
N
X
i=1
p( j| zi ) (A.1.32)
where p (j| zi ) is computed as in (A.1.25).
A.2 K-means clustering
The K-means algorithm partitions a collection of N vectors xi ,i = 1,...,N , into Kgroups
Gk , k = 1 ,...,K , and finds a cluster center in each group such that a cost function of
dissimilarity (or distance) measure is minimized. When the Euclidean distance is chosen
as the dissimilarity measure between a vector x in the k th group and the corresponding
cluster center ck , the cost function can be defined by
J=
K
X
k=1
Jk =
K
X
k=1 X
x∈ Gk
d( x, ck ) (A.2.33)
where Jk is the cost function within group k and d is a generic distance function
d( x, ck ) = ( x− ck )T M( x− ck ) (A.2.34)
where M is the distance matrix. The partitioned groups are typically defined by a mem-
bership [K× N ] matrix U, where the element u ki is 1 if the i th data point xi belongs to
group k , and 0 otherwise. The matrix Usatisfies the following conditions:
K
X
k=1
uki = 1 ∀ i= 1 ,...,N
K
X
k=1
N
X
i=1
uki = N
(A.2.35)
Once the cluster centers ck are fixed, the terms u ki which minimize Eq. (A.2.33) are:
uki =( 1 if d(xi , ck )≤ d( xi , cj ), for each j 6= k
0 otherwise (A.2.36)
This means that xi belongs to the group k if ck is the closest center among all centers.
Once the terms u ki are fixed, the optimal center ck that minimizes Eq. (A.2.33) is the
mean of all vectors in the kth group:
ck =1
|Gk |X
x∈ Gk
x(A.2.37)
where | Gk | is the size of Gk .
The K-means algorithm determines iteratively the cluster centers ck and the member-
ship matrix Uusing the following procedure:
1. Initialize the cluster centers ck , typically by randomly selecting K points among all
data points.
2. Evaluate the membership matrix U through Eq. (A.2.36).
3. Compute the cost (A.2.33). If it is below a certain tolerance value or if the improve-
ment is not significant, stop and return the centers and the groups.
4. Update the cluster centers according to Eq. (A.2.37). Go to step 2.
Some final remarks should be made on the K− means algorithm. As many other clustering
algorithms, this technique is iterative and no guarantee of convergence to an optimum
solution can be found. Also, the final performance is quite sensitive to the initial position
of the cluster centers and to the number Kof clusters, typically fixed a priori by the
designer.
Appendix B
Linear algebra notions
Linear algebra, the science of vector spaces, plays a major role in machine learning, where
data are represented in a vectorial form. Though the readers are supposed to have attended
numerical analysis classes, here we will remind some basic notions of linear algebra. For a
more extensive presentation of linear algebra and its links with machine learning, we refer
the reader to recent references like [2, 56].
B.1 Rank of a matrix
Let us consider a [N, n] matrix. Many definitions exist for the rank of a matrix: here we
limit to consider the rank of Xas the maximal number of linearly independent columns
of X . Since the rank of a [N, n ] matrix is at most min{ N , n} , a matrix is full-rank if its
rank is min{ N, n} . A matrix which is not full-rank is also called rank-deficient.
B.2 Inner product
In linear algebra, the dot product, also known as the scalar or inner product, is an operation
which takes two vectors over the real numbers R and returns a real-valued scalar quantity.
It is the standard inner product of the orthonormal Euclidean space. The dot product of
two [n, 1] vectors x = [x1 , x2 ,...,x n] Tand y = [y 1, y2,...,y n] Tis defined as:
hx, y i=
n
X
j=1
xj yj = xT y (B.2.1)
The dot product underlies the definition of the following quantities:
•the Euclidean norm of a vector x:
kx k= p h x, xi, (B.2.2)
also known as the L2 norm,
•the Euclidean distance of two [n, 1] vectors x1 and x2
kx1 −x2 k=p h x1 −x2 , x1 −x2 i, (B.2.3)
•the angle ω between two vectors x1 and x2 which satisfies the relation
−1 ≤hx1 x2 i
kx1 kkx2 k = cos(ω)≤ 1 , (B.2.4)
•the projection of a vector x1 onto a direction x2
πx 2 ( x1 ) = h x 1 , x 2 i
kx2 k2 x 2 = x 2 x T
2
kx2 k x 1 (B.2.5)
where the [n, n ] matrix P = x2 x T
2is called the projection matrix.
393
394 APPENDIX B. LINEAR ALGEBRA NOTIONS
In more qualitative terms, the notion of inner product allows the introduction of a
similarity score between vectors. In this sense, the least similar vectors are two orthogonal
vectors, i.e. two vectors x and y such that h x, yi = 0 and ω = π/2.
Note also that the following relation holds:
xxT y= h x, yix
B.3 Diagonalisation
A [N, N ] matrix X is diagonalisable if it exists an invertible matrix Psuch that
X= P DP −1 . (B.3.6)
A symmetric matrix can always be diagonalised and the diagonal entries of Dare its
eigenvalue.
B.4 QR decomposition
Let us consider a [N, n] matrix X with N≥ n and n linearly independent columns. By
Gram-Schmidt orthogonalisation [2] it is possible to write
X= QR (B.4.7)
where Q is a [N, n] matrix with n orthonormal columns qj (i.e. q T
jq j= 1 and q T
jq k= 0 if
j6= k) and Ris a [ n, n] upper-triangular matrix. Since QT Q= In the pseudo-inverse of
Xcan be written as
X† = ( XTX)−1 XT = ( RT QTQR) −1 RT QT = R−1 QT (B.4.8)
If X is rank-deficient (i.e. only n0 < n < N columns of X are linearly independent) it
is possible to perform the generalised QR decomposition
X= QR
where Q is [N, n0 ] and R is a [n0 , n ] rectangular upper-triangular matrix with n0 < n . Since
Ris of full row rank, the matrix RRT is invertible and the pseudo-inverse of Xcan be
written as
X† = RT ( RRT ) −1 QT (B.4.9)
also known as the Moore-Penrose pseudo-inverse.
B.5 Singular Value Decomposition
Let us consider a [N, n] matrix X with N≥ n : such matrix can always be factorised into
the product of three matrices
X= UDV T (B.5.10)
where U is a [N, N ] matrix with orthonormal columns (i.e. UT U =IN ), D is a [N , n]
diagonal matrix whose diagonal entries d ii ≥ 0 are called the singular values and V is a
[n, n ] matrix with orthonormal columns.
It can be shown that the N columns of U (also called the left singular vectors) are the
Neigenvectors of the [ N, N ] symmetric matrix XX T and the n columns of V (also called
the right singular vectors) are the neigenvectors of the [n, n] symmetric matrix XTX.
The non-zero singular values are the square-roots of the non-zero eigenvalues of XT X and
of the non-zero eigenvalues of XX T . This is made evident by the link between SVD and
diagonalisation of XT X :
XT X= ( UDV T )T( U DV T ) = V DT UT UDV T = V DT DV T
B.6. CHAIN RULES OF DIFFERENTIAL CALCULUS 395
The SVD of a matrix X of rank r can be written also as:
X=
r
X
j=1
dii ui vT
i
where uj is the j th column of U and vj is the j th column of V.
If in the decomposition above we stop at the order r0 < r , we obtain a low-rank
approximation of X:
X0 =
r0
X
j=1
dii ui v T
i.
Another common SVD decomposition is the economy (or reduced) SVD:
X= UDV T (B.5.11)
where k = min{ N, n} ,U is [N , k] with orthonormal columns, Dis a square [k, k] matrix
and V is a [n, k] matrix with orthonormal columns.
SVD plays an important role in determining the ill-conditioning of a square matrix,
i.e. how much the matrix is close to be singular. The condition number of a matrix is the
ratio of its largest singular value to its smallest singular value. The larger this number
(which is ≥ 1) the larger is the ill-conditioning of the matrix.
Note also that if Xis a symmetric matrix, the SVD decomposition returns the diago-
nalisation (B.3.6).
B.6 Chain rules of differential calculus
Let J be the scalar function of α∈ R:
J( α) = f( g( h( α)))
where f, g , h :R→ R are scalar functions. Then the univariate chain rule is
dJ
dα = dJ
df
df
dg
dg
dh
dh
dα
Let us consider the function J :R→ R
J= f( g1 ( α) , g2 ( α) ,...,g n(α))
between α∈R and the scalar J , where gj :R→ R, j = 1,...,n . Then the multivariate
chain rule returns the scalar gradient
dJ
dα =
n
X
j=1
∂J
∂gj dg j
dα
Let J :Rn →Rm be the mapping between an input vector α∈Rn and an output
vector of size m . The associated Jacobian matrix is the [m, n ] matrix
∇α J=h ∂J (α )
∂α1
∂J ( α )
∂α2 . . . , ∂J (α)
∂αn i=
∂J1 ( α )
∂α1
∂J1 ( α )
∂α2 . . . ∂J 1 ( α )
∂αn
.
.
..
.
..
.
.
∂Jm ( α )
∂α1
∂Jm ( α )
∂α2 . . . ∂J m ( α )
∂αn
In the most generic case, suppose that J =Rn → Rm ,α∈Rn , and
J= Fk (Fk−1 ( . . . F1 ( α )))
where Fi :Rn i →Rn i+1 , n1 =n and nk+1 = m . Then the vectored chain rule [2] is
∂J
∂α
|{z}
[m,n]
=∂Fk
∂Fk−1
| {z }
[m,nk ]
∂Fk−1
∂Fk−2
| {z }
[nk ,nk−1 ]
. . . ∂F 1
∂α
|{z}
[n2 ,n]
396 APPENDIX B. LINEAR ALGEBRA NOTIONS
B.7 Quadratic norm
Consider the quadratic norm
J( x) = k Ax + bk 2
where J :Rn → R ,A is a [N, n] matrix, xis a [n, 1] vector and bis a [N, 1] vector. It can
be written in the matrix form
J( x) = xTAT Ax + 2 bT Ax + bT b
The first derivative of J with respect to xis the [n, 1] vector
∂J (x)
∂x = 2 A T ( Ax +b )
and the second derivative is the [n, n] matrix
∂2 J( x)
∂x∂xT = 2 A T A
B.8 Quadratic programming
Quadratic programming is the resolution procedure of continuous optimisation problems
with a squared objective function, for instance
b∗ = J( b) = arg min
bb T Db
where b is a [n, 1] vector and Dis a [n, n] matrix. For instance if n= 2 and Ais a
diagonal matrix, J (b ) = b 2
1+b 2
2and has a single global minimum in [0, 0]. If the solution
is submitted to no constraints, the problem is called unconstrained. If D is a positive
(negative) semidefinite matrix the function J is convex (concave).
In machine learning the most common quadratic programming task is strictly convex
since it derives from the least-squares formulation where D is a definite positive matrix.
The general form of an unconstrained strictly convex quadratic objective function is
b∗ = J( b) = arg min
bb T Db −d T b+k = arg min
bb T Db − d T b(B.8.12)
where D is definite positive, dis a [n, 1] vector and kis a scalar (which has no impact on
the minimisation problem).
The constrained version has a set of linear inequality constraints in the form
AT b≥ b 0
where A is a [n, c] matrix defining the cconstraints under which we want to minimise the
Jfunction.
The R package quadprog provides the implementation solve.QP of a method to solve
a strictly convex constrained quadratic programming task.
B.9 The matrix inversion formula
Let us consider the four matrices F , G ,H and K and the matrix F + GHK . Assume that
the inverses of the matrices F ,G and (F + GHK ) exist. Then
(F +GHK )−1 =F−1 −F−1 G H−1 +K F −1 G −1 KF −1 (B.9.13)
Consider the case where F is a [n× n ] square nonsingular matrix, G =z where z is a
[n× 1] vector, K =zT and H = 1. Then the formula simplifies to
(F + zz T ) −1 =F −1 −F −1 zz T F −1
1 + zT F z
where the denominator in the right hand term is a scalar.
B.9. THE MATRIX INVERSION FORMULA 397
If X and Z are two [N, p] matrices, from (B.9.13) it can be shown the push-through
identity [2]
XT ( IN + ZX T ) −1 = ( Ip + XT Z)−1 XT (B.9.14)
Then for any [N, p ] matrix Xand scalar λ > 0
XT ( λIN +XX T ) −1 = (λIp + XTX)−1 XT (B.9.15)
398 APPENDIX B. LINEAR ALGEBRA NOTIONS
Appendix C
Probability and statistical
notions
C.1 Common univariate discrete probability func-
tions
C.1.1 The Bernoulli trial
ABernoulli trial is a random experiment with two possible outcomes, often called "suc-
cess" and "failure". The probability of success is denoted by pand the probability of
failure by (1 −p ). A Bernoulli random variable z is a binary discrete r.v. associated with
the Bernoulli trial. It takes z= 0 with probability (1 −p ) and z= 1 with probability p.
The probability function of zcan be written in the form
Prob {z =z} =Pz (z ) = pz (1 −p )1−z , z = 0,1
Note that E [z ] = p and Var [z ] = p (1 −p ).
C.1.2 The Binomial probability function
Abinomial random variable represents the number of successes z in a fixed number N of
independent Bernoulli trials with the same probability p of success for each trial. A typical
example is the number z of heads in N tosses of a coin.
The probability function of z∼ Bin(N, p ) is given by
Prob {z =z} =Pz (z ) = N
z! p z (1 −p)N−z , z = 0 , 1 ,...,N (C.1.1)
The mean of the probability function is µ = Np . Note that:
•the Bernoulli probability function is a special case (N= 1) of the binomial function,
•for small p, the probability of having at least 1 success in Ntrials is proportional to
N, as long as Np is small,
•if z1 ∼Bin(N1 , p ) and z2 ∼Bin(N1 , p ) are independent then z1 +z2 ∼Bin(N1 +
N2 , p)
The Binomial distribution returns then the probability of zsuccesses out of N draws
with replacement. The probability of z successes out of N draws without replacement
from a population of size P that contains k terms associated to success is returned by the
hypergeometric distribution:
Prob {z =z} = k
z P−k
N− z
P
N.(C.1.2)
399
400 APPENDIX C. PROBABILISTIC NOTIONS
C.2 Common univariate continuous distributions
C.2.1 Uniform distribution
A random variable zis said to be uniformly distributed on the interval (a, b) (written as
z∼ U ( a, b)) if its probability density function is given by
p( z) = ( 1
b− aif a < z < b
0,otherwise
It can be shown that the skewness of a continuous random variable which is uniformly
distributed is equal to 0.
Exercise
Show that the variance of U ( a, b) is equal to 1
12 (b− a) 2 .
•
C.2.2 The chi-squared distribution
It describes the distribution of squared normal r.v.s. An r.v. z has a χ 2
Ndistribution if
z= x2
1+· ·· +x 2
N
where N∈N and x1 ,x2 ,. . . ,xN are i.i.d. standard normal random variables N (0, 1). The
distribution is called a chi-squared distribution with Ndegrees of freedom. Note also that
•The probability distribution is a gamma distribution with parameters ( 1
2N, 1
2).
•E[z ] = Nand Var [z ] = 2 N.
The χ 2
Ndensity and distribution function for N= 10 are plotted in Figure C.1 (R script
chisq.R in the package gbcode).
C.2.3 Student's t-distribution
It describes the distribution of the ratio of normal and χsquared r.v.s. If x ∼ N (0, 1) and
y∼χ2
Nare independent then the Student's t -distribution with N degrees of freedom is
the distribution of the r.v.
z= x
py /N (C.2.3)
We denote this with z ∼ TN . Note that E [z ] = 0 and V ar [z ] = N/ (N− 2) if N > 2.
The Student density and distribution function for N= 10 are plotted in Figure C.2
by means of the script stu.R in the package gbcode .
C.2. COMMON UNIVARIATE CONTINUOUS DISTRIBUTIONS 401
Figure C.1: χ 2
Nprobability distribution (N = 10)
Figure C.2: Student probability distribution (N = 10)
402 APPENDIX C. PROBABILISTIC NOTIONS
Figure C.3: F probability distribution (N = 10)
C.2.4 F-distribution
It describes the distribution of the ratio of χ squared r.v.s. Let x∼ χ2
Mand y∼χ2
Nbe
two independent r.v.. An r.v. z has a F-distribution with M and N degrees of freedom
(written as z∼F M,N ) if
z= x/M
y/N (C.2.4)
Note that if z∼F M,N then 1/z∼ F N,M , while if z ∼ TN then z2 ∼F1,N . The F-density
and distribution function are plotted in Figure C.3 by means of the script f.R in the
package gbcode .
C.3 Common statistical hypothesis tests
C.3.1 χ2 -test: single sample and two-sided
Consider a random sample from N ( µ, σ2 ) with µ known. Let
H: σ2 = σ2
0;¯
H: σ2 6= σ2
0
Let c
SS = P i ( zi − µ )2 . From Section 5.7 it follows that if H is true then c
SS/σ2
0∼χ 2
N
(Section C.2.2 )
The level α χ2 -test rejects H if c
SS/σ2
0< a 1or c
SS/σ2
0> a 2where
Prob ( c
SS
σ2
0
< a1 ) + Prob ( c
SS
σ2
0
> a2 ) =α
A slight modification is necessary if µis unknown. In this case you must replace µ
with ˆ µin the quantity c
SS and use a χ 2
N−1 distribution.
C.3.2 t-test: two samples, two sided
Consider two r.v.s x ∼ N (µ1 , σ2 ) and y ∼ N (µ2 , σ 2 ) with the same variance. Let D x
N←x
and D y
M←ytwo independent sets of samples of size N and M , respectively..
We want to test H :µ1 =µ2 against ¯
H: µ1 6= µ2 .
Let
ˆ µx =PN
i=1 x i
N, c
SS x =
N
X
i=1
(xi − ˆ µx )2 ,ˆ µy =PM
i=1 y i
M, c
SS y =
M
X
i=1
(yi − ˆ µy )2
It can be shown that if His true then the statistic
t(DN ) = ˆ
µx − ˆ
µy
r 1
M+ 1
N c
SSx + c
SSy
M+ N−2∼ T M+N −2
C.4. TRANSFORMATION OF RANDOM VARIABLES AND VECTORS 403
It follows that the test of size α rejects Hif
|t( DN ) |> tα/2,M+ N −2
C.3.3 F-test: two samples, two sided
Consider a random sample { x1 ,...,x M} ← x ∼ N (µ 1, σ 2
1) and a random sample {y 1,...,y N} ←
y∼ N (µ2 , σ 2
2) with µ 1and µ 2unknown. Suppose we want to test
H: σ2
1=σ 2
2;¯
H: σ2
16=σ 2
2
Let us consider the statistic
f=ˆ
σ2
1
ˆ
σ2
2
=c
SS1 /( M− 1)
c
SS2 /( N− 1) ∼ σ 2
1χ 2
M−1/(M− 1)
σ2
2χ 2
N−1/(N− 1) = σ 2
1
σ2
2
FM−1 ,N−1
It can be shown that if His true, the ratio f has a F-distribution FM−1 ,N −1 (Section
C.2.4) The F-test rejects H if the ratio fis large, i.e. f > Fα,M −1 ,N −1 where
Prob {f > Fα,M−1 ,N −1 } = α
if f∼FM−1 ,N−1 .
C.4 Transformation of random variables and vec-
tors
Theorem 4.1 (Jensen's inequality) . Let x be a continuous r.v. and f a convex function.
Then, E[ f( x)] ≥f (E [ x ]) while if f is concave then E[ f( x)] ≤f (E [ x ])
Given a [n× 1] constant vector aand a random vector z of dimension [n× 1] with
expected value E [ z ] = µ and covariance matrix [z ] = Σ, then
E[ aT z] = aT µ, Var h aT zi = aT Σ a
Also if z ∼ N ( µ, Σ) then aT z ∼ N (aT µ, aT Σ a)
Given a [n× n ] constant matrix Aand a random vector zof dimension [n× 1] with
expected value E [ z ] = µ and covariance matrix Var[z ] = Σ, then
E[ Az] = Aµ, Var [Az ] = A Σ AT
R script
The relation above may be used to sample a [n, 1] random vector xwith covariance
Var [x ] = Σ, starting from the sampling of a [n, 1] random vector z with Var [z ] = In . If we
factorize Σ = AAT , then Var [x] = Var [Az ] = AInAT = Σ. In the script chol2cor.R , we
first define the symmetric matrix Σ, then we sample N times the z vector in the dataset
DN and we multiply DN by A : it is possible to verify numerically that this is equivalent
to sample N times a vector xwith covariance Σ.
•
Theorem 4.2. Given a random vector z of dimension [n× 1] with expected value E[ z] = µ
and covariance matrix Var[ z] = σ2 I , for a generic matrix A of dimension [n× n ] the
following relation holds
E[zT Az] = σ2 tr( A) + µT Aµ (C.4.5)
where tr( A) is the trace of matrix A.
404 APPENDIX C. PROBABILISTIC NOTIONS
C.5 Correlation and covariance matrices
Given n r.v.s z1 ,...,zn the correlation matrix C is a symmetric positive-semidefinite [n, n]
matrix whose (i, j ) entry is the correlation coefficient ρ (zi ,zj ) (Equation (3.6.68)).
The following relation exists between the covariance Σ of the nvariables and the
correlation matrix:
C= (diag(Σ))−1/ 2 Σ (diag(Σ))−1/ 2 ,Σ =
σ1 . . . . . . 0
0σ2 . . . 0
0. . . .. . 0
0. . . . . . σn
C
σ1 . . . . . . 0
0σ2 . . . 0
0. . . . . . 0
0. . . . . . σn
By using the formula above, the script corcov.R shows how it is possible to generate
a set of examples with predefined pairwise correlation ¯ ρ.
C.6 Convergence of random variables
Let { zN } ,N = 1, 2,..., be a sequence of random variables and let z be another random
variable. Let FN ( · ) denote the distribution function of zN and Fz the distribution of z.
We introduce the following definitions:
Definition 6.1 (Convergence in probability). We say that
lim
N→∞z N =zin probability (C.6.6)
and we note zN
P
→zif for each ε > 0
lim
N→∞P{|z N −z| ≥ ε } = 0 (C.6.7)
Definition 6.2 (Convergence with probability one).We say that
lim
N→∞z N =zwith probability one (or almost surely) (C.6.8)
and we note zN
a.s.
→zif
P{ ω: lim
N→∞z N (ω ) = z (ω )} = 1 (C.6.9)
Definition 6.3 (Convergence in Lp ) . For a fixed number p≥ 1 we say that
lim
N→∞z N =zin L p (C.6.10)
if
lim
N→∞E[|z N −z| p ] = 0 (C.6.11)
The following theorems hold:
Theorem 6.4. Convergence in Lp implies convergence in probability.
Theorem 6.5. Convergence with probability one implies convergence in probability.
Note however that convergence in probability does not imply convergence in L2 .
Definition 6.6 (Convergence in distribution) . The sequence zN converges in distribution
to z and we note z N
D
→zif
lim
N→∞F N (z) = F( z) (C.6.12)
for all z for which F is continuous.
It can be shown that
Theorem 6.7. Convergence with probability implies convergence in distribution.
Note however that convergence in distribution does not imply convergence in proba-
bility.
As a summary
zN
a.s.
→zimplies zN
P
→zimplies zN
D
→z
C.7. THE CENTRAL LIMIT THEOREM 405
C.6.1 Example
Let z ∼ U (1, 2) and θ= 0. Consider the two estimators (stochastic processes) ˆ
θ(1)
Nand
ˆ
θ(2)
Nfor N→ ∞ where
ˆ
θ1
N= exp −zN, ˆ
θ2
N=( exp − N with probability 1 −1/N
1 with probability 1/N
For the first estimator, all the tra jectories converge to θ(strongly consistent). For the
second process, the trajectory which does not converge has a probability decreasing to
zero for N → ∞ (weakly consistent).
C.7 The central limit theorem
Theorem 7.1. Assume that z1 , z2 ,. . . ,zN are i.i.d. random variables, discrete or contin-
uous, each having the same probability distribution with finite mean µ and finite variance
σ2 . As N→ ∞, the standardised random variable
(¯
z−µ) √ N
σ
which is identical to ( S N − Nµ)
√Nσ
converges in distribution (Definition 6.6) to a r.v. having the standardized normal distri-
bution N (0,1).
This theorem, which holds regardless of the common distribution of zi , justifies the
importance of the normal distribution, since many r.v. of interest are either sums or
averages. Think for example to the commute time of the example in Section 3.2 which can
be considered as the combined effect of several causes.
An illustration of the theorem by simulation is obtained by running the R script
central.R.
C.8 The Chebyshev's inequality
Let z be a generic random variable, discrete or continuous, having a mean µand a variance
σ2 . The Chebyshev's inequality states that for any positive constant d
Prob {|z −µ | ≥ d} ≤ σ 2
d2 (C.8.13)
An illustration of the Chebyshev's inequality by simulation can be found in the R
script cheby.R.
Note that if we put zequal to the quantity in (3.10.89), then from (3.10.90) and
(C.8.13) we find
Prob {| ¯
z−µ | ≥ d } ≤ σ 2
Nd2 (C.8.14)
i.e. the weak law of large numbers (Section 3.1.5). This law states that the average of a
large sample converges in probability to the mean of the distribution.
C.9 Empirical distribution properties
Let (5.2.2) be the empirical distribution of z obtained from a dataset DN . Note that, being
DN a random vector, the function ˆ
Fz ( ·) is random, too. The following two properties
(unbiasedness and consistency) are valid:
406 APPENDIX C. PROBABILISTIC NOTIONS
Theorem 9.1. For any fixed z
ED N [ˆ
Fz (z )] = Fz (z ) (C.9.15)
Var h ˆ
Fz (z)i =F z (z)(1 − Fz (z ))
N(C.9.16)
Theorem 9.2 (Glivenko-Cantelli theorem).
sup
−∞<z< ∞| ˆ
Fz (z )−Fz (z)|N→∞
→0almost surely (C.9.17)
where the definition of almost sure convergence is in Appendix (Def. 6.2).
The two theoretical results can be simulated by running the R scripts cumdis 2.R and
cumdis 1.R, respectively.
C.10 Useful relations
Some relations
E[(z− µ)2 ] = σ2 = E[z2 − 2 µz+ µ2 ] = E[z2 ]− 2 µE[ z] + µ 2
=E [z2 ]− 2µµ +µ2 =E [z2 ]−µ2
For N = 2
E[(z1 + z2 )2 ] = E[z2
1] + E [z 2
2] + 2E[z 1z2]
= 2E [ z2 ] + 2µ 2
= 4µ2 + 2σ 2
For N = 3
E[(z1 +z2 +z3 )2 ] = E[z2
1] + E[z 2
2] + E [z 2
3] + 2E[z 1z2] + 2E[z 1z3] + 2E [z 2z3]
= 3E [z2 ] + 6µ2 = 9µ2 + 3σ 2
In general for N i.i.d. zi ,E [(z1 +z2 + · ·· + zN )2 ] = N2 µ2 + Nσ2 .
C.11 Minimum of expectation vs. expectation of
minimum
Theorem 11.1. Let us consider M random variables zm , m = 1 ,...,M . Then
E[min
mz m ]≤min
mE[z m ]
Proof. For each m define the r.v. xm =zm − minm zm . Now E [xm ]≥ 0 since zm ≥
minm zm . Then E [ zm ]−E [minm zm ]≥ 0. It follows that
∀m, E[min
mz m ]≤E[z m ]
then
E[min
mz m ]≤min
mE[z m ]
The difference minm E [ zm ]−E [minm zm ] quantifies then the selection bias that occurs
in a selection (e.g. by minimisation) process that relies on observed data in a random
setting.
C.12. TAYLOR EXPANSION OF FUNCTION 407
C.12 Taylor expansion of function
Let J (· ) be a function with pdimensional argument of the form α = [α1 ,...,α p]. The
Taylor expansion of the function J (· ) about ¯ αcan be written as follows
J( α) = J(¯ α) +
p
X
j=1
(αi − ¯ αi ) ∂ J (α )
∂αj α= ¯ α+
p
X
i=1
p
X
j=1
(αi − ¯ αi )(αj − ¯ αj )
2
∂2 J( α)
∂αi∂αj α= ¯ α+. . .
which can be written in vector-form as follows:
J( α)≈ J(¯ α) + ( α−¯ α)T ∇ J( ¯ α) + ( α−¯ α)T H( ¯ α)( α−¯ α)
where ∇J (α ) is the gradient vector and H (α )=[ Hij ] is the Hessian square matrix [p, p]
of all second-order derivatives
Hij = ∂ 2 J(α )
∂αi∂αj
C.13 Proof of Eq. (7.5.28)
Since the cost is quadratic, the input uniform density is π ( x ) = 1
4in the interval [−2, 2],
the regression function is x3 and the noise is i.i.d. with unit variance, from Eq. (7.2.3) we
obtain
R( α) = Z X,Y
C( y, αx) pf ( y| x) π( x) dy dx (C.13.18)
=Z 2
x=− 2Z Y
(y− αx )2 pf (y| x )π(x)dx dy (C.13.19)
=Z 2
x=− 2Z W
(x3 +w −αx )2 pw (w )1
4dx dw (C.13.20)
=1
4[ Z W
pw ( w) dw Z 2
x=−2
(x3 − αx )2 dx + (C.13.21)
+Z 2
x=−2
dx Z W
w2 pw ( w) dw +Z W Z 2
x=−2
2w (x3 − αx )pw (w )dwdx ] (C.13.22)
=1
4Z2
−2
(x3 − αx )2 dx + 4σ 2
w(C.13.23)
=1
4Z 2
−2
(x3 −αx )2 dx +σ 2
w(C.13.24)
C.14 Biasedness of the quadratic empirical risk
Consider a regression framework where y =f (x ) + w , with E [w ] = 0 and Var [w ] = σ 2
w,
and hN an estimation of fobtained by minimising the empirical risk in a dataset DN ∼y .
According to the derivation in [64], let us consider the quantity
gN (x) = E D N ,y [(y− h ( x, α(DN )))2 ] = ED N [Ey [(y− hN )2 ]]
where hN stands for h ( x, α (DN )). Since
(y− hN )2 = (y− f +f −hN )2 = (y− f )2 + (f− hN )2 + 2(y− f )(f− hN )
we obtain
(y− f )2 + (f− hN )2 = (y− hN )2 + 2(y− f )(hN − f )
408 APPENDIX C. PROBABILISTIC NOTIONS
Note that since Ey [y ] = f , for a given h N
Ey [(y− hN )2 ] = Ey [(y− f)2 + ( f− hN )2 + 2(y− f )( f− hN )] =
=Ey [(y−f )2 + (f− hN )2 ]
Since Ey [(y−f )2 ] = E D N [(y−f )2 ] and
(y− h )2 = (y− f +f −h )2 = (y− f )2 + (f− h )2 − 2(y− f )(h− f )
it follows
(y− f )2 + (f− h )2 = (y− h )2 + 2(y− f )(h− f )
and
gN (x) = E D N [ Ey [(y− hN )2 ]] =
=ED N [Ey [(y−f )2 ] + (f− hN )2 ] = E D N [(y−f )2 + (f−hN )2 ] =
=ED N [(y− hN )2 + 2(y−f )(hN −f )]
By averaging the quantity gN ( x) over the X domain we obtain
MISE = ED N ,y,x [(y− h (x, α (DN )))2 ]] =
=ED N ,x [(y− hN )2 ] + 2ED N ,x [(y−f )(hN −f )] = ED N [ \
MISEemp ] + 2Cov[ hN , y ]
where Cov[hN , y ] = ED N ,x [(y−f )(hN −f )] and \
MISEemp is the quantity (7.2.7) for a
quadratic error loss. This means we have to add a covariance penalty term to the apparent
error \
MISEemp in order to have an unbiased estimate of MISE.
Suppose that hN is a linear estimator, i.e.
hN =S y
where S is known as the smoother matrix. Note that in least-square regression Sis the
Hat matrix H =X (XTX ) −1 XT . In the linear case, since HT = H
NCov[hN ,y] = ED N [(Y− F)T( HY− F)] =
=ED N [YT H Y− YT F− FT H Y− YT HT F ] = E D N [YT H (Y−F )] =
=σ2 tr(H ) + FT HF − FT H F =σ2 tr(H ) = σ2 tr((XTX ) −1 XTX ) = σ2 p
where tr(H) is the trace of the matrix H ,Y is a random vector of size [N, 1] and Fis
the vector of the N regression function values f ( xi ). It follows that Cov[hN , y ] = σ2 p/N
and then the Cp formula (8.8.33). Note that the trace of His also known as the effective
number of parameters.
Appendix D
Plug-in estimators
This appendix contains the expression of the plug-in estimators of some interesting pa-
rameters:
•Skewness of a random variable z: given a dataset DN ={z1 ,...,z N}the plug-in
estimate of the skewness (3.3.36) is
ˆ γ=PN
i=1(z i −ˆ µ)3
Nˆ σ3 (D.0.1)
where ˆ µand ˆ σare defined in (5.3.4) and (5.3.5), respectively.
•Kurtosis of a random variable z : given a dataset DN = {z1 ,...,z N}the plug-in
estimate of the kurtosis (3.3.37) is
ˆ κ=P N
i=1(z i −ˆ µ)4
Nˆ σ4 (D.0.2)
where ˆ µand ˆ σare defined in (5.3.4) and (5.3.5), respectively.
•Correlation of two random variables x and y : given a dataset DN = {hx1 , y1 i,..., hxN , yN i}
where xi ∈R , yi ∈R the plug-in estimate of the correlation (3.6.68) is
ˆ ρ=PN
i=1(x i −ˆ µx )(yi − ˆ µy )
ˆ σx ˆ σy
(D.0.3)
where ˆ µx (ˆ µy ) and ˆ σ2
x(ˆ σ2
y) denote the sample mean and sample variance of x( y ).
•Covariance matrix of a n-dimensional random vector z: given a dataset DN =
{z1 ,...,z N}where z i= [z i1,...,z in ] T is a [n, 1] vector, the plug-in estimator of the
covariance matrix (3.7.72) is the [n, n] matrix
ˆ
Σ = P N
i=1(z i −ˆ µ)( zi −ˆ µ)T
N−1(D.0.4)
whose jk entry is
ˆ
Σjk = P N
i=1(z ij −ˆ µj )( zik −ˆ µk )T
N−1
and ˆ µis the [n, 1] vector
ˆ µ=P N
i=1 z i
N
and
ˆ µj =P N
i=1 z ij
N.
Note that (D.0.4) can be also written in matrix form
ˆ
Σ = P N
i=1(Z−1 N ˆ µT )T ( Z−1N ˆ µT )
N−1
where Z is a [N, n] matrix whose i th row is z T
iand 1 Nis a [N, 1] vector of ones.
409
410 APPENDIX D. PLUG-IN ESTIMATORS
•Correlation matrix of a n-dimensional random vector z: the correlation matrix is a
symmetric [n, n] matrix whose jk entry is the correlation between the scalar random
variables zj and zk . Given a dataset DN ={z1 ,...,z N}where z i= [z i1,...,z in ] T
is a [n, 1] vector, the plug-in estimator can be written as the covariance1
ˆ
P = ˜
ZT Z
N
of the scaled matrix ˜
Z= CZD −1
where
C= IN −1 N 1 T
N
N
is the centering [N, N ] matrix, IN is the diagonal unit matrix, 1Nis a [N, 1] vector
of ones and
D= diag(ˆ σ1 ,...,ˆ σn )
is a diagonal [n, n] scaling matrix where ˆ σ2
jis the sample variance of z j.
The diagonal entries of ˆ
P are all 1. The jk entry (j 6 =k ) of the matrix ˆ
P can be also
obtained by applying (D.0.3) to the j th and k th column of Z.
1see also http://users.stat.umn.edu/~helwig/notes/datamat-Notes.pdf
Appendix E
Kernel functions
A kernel function K is a nonnegative function
K:Rn × Rn ×R+ →R +
where the first argument is a n-dimensional input, the second argument is typically called
the center and the third argument is called width or bandwidth. Once a distance function
between the input and the center
d:Rn × Rn →R+ (E.0.1)
is defined, the kernel function can be expressed as a function
K:R+ ×R+ →R+ (E.0.2)
of the distance dand the bandwidth parameter. The maximum value of a kernel function
is located at zero distance and the function decays smoothly as the distance increases.
Here you have some examples of kernel functions.
Inverse distance:
K( d, B) = 1
(d/B )p (E.0.3)
This function goes to infinity as the distance approaches zero.
Corrected inverse distance:
K( d, B) = 1
1 + (d/B )p (E.0.4)
Gaussian kernel:
K( d, B) = exp − d 2
B2 (E.0.5)
Exponential kernel:
K( d, B) = exp − d
B (E.0.6)
Quadratic or Epanechnikov kernel:
K( d, B) = ( 1− d
B
2if |d |< B
0 otherwise (E.0.7)
Tricube kernel:
K( d, B) =
1− d
B 3 3 if |d |< B
0 otherwise
(E.0.8)
Uniform kernel:
K( d, B) = ( 1 if |d | < B
0 otherwise (E.0.9)
Triangular kernel:
K( d, B) = ( 1− d
B if |d |< B
0 otherwise (E.0.10)
411
412 APPENDIX E. KERNEL FUNCTIONS
Appendix F
Companion R package
Several scripts are used in the main text to illustrate statistical and machine learning
notions. All the scripts have been implemented in R and are contained in the R package
gbcode.
To install the R package gbcode containing all the scripts mentioned in the text you
should run the following R instructions in the R console.
> library(devtools)
> install_github("gbonte/gbcode")
> require(gbcode)
Once installed, all the scripts will be available in the root directory of the package. In
order to retrieve the directory containing the gbcode package you should type
> system.file(package = "gbcode")
To change the directory to the one containing scripts
> setwd(find.package("gbcode"))
If you wish to run a script mentioned in the main text (e.g. the script freq.R) without
changing the local directory you should run
> source(system.file("scripts","freq.R",package = "gbcode"))
If you wish to edit a script mentioned in the main text (e.g. the script freq.R) without
changing the local directory you should run
> edit(file=system.file("scripts","freq.R",package = "gbcode"))
If you wish to execute a Shiny dashboard (e.g. leastsquares.R) you should run
> library(shiny)
> source(system.file("shiny","leastsquares.R",package = "gbcode"))
413
414 APPENDIX F. COMPANION R PACKAGE
Appendix G
Companion R Shiny
dashboards
Several Shiny dashboards are used in the main text to illustrate statistical and machine
learning notions. All the Shiny dashboards are contained in the directory shiny of the R
package gbcode and require the installation of the library shiny. To run a Shiny dashboard
(e.g. condpro.R ) you should first move to their directory by
> setwd(paste(find.package("gbcode"),"shiny",sep="/"))
and then run
> runApp("condpro.R")
The Shiny dashboards are also active under Shinyapps1. To run a Shiny dashboard named
NAME.R go to https://gbonte.shinyapps.io/NAME. For instance, to run the Shiny dash-
board condpro.R go to:
https://gbonte.shinyapps.io/condpro
G.1 List of Shiny dashboards
•mcarlo.R: visualisation by means of Monte Carlo simulation.of
1. transformation of a r.v.
2. result of operation on two r.v.s
3. central limit theorem
4. result of linear combination of two independent r.v.s
•condpro.R: visualisation of conditional probability vs. marginal probability in the
bivariate gaussian case and in the regression function case.
•estimation.R: visualisation of different problems of estimation:
1. estimation of mean and variance of a univariate normal r.v.: bias/variance
visualisation
2. estimation of mean and variance of a univariate uniform r.v.: bias/variance
visualisation
3. estimation of confidence interval of the mean of a univariate normal r.v.
4. maximum-likelihood estimation of the mean of a univariate normal r.v.: visual-
isation of the log-likelihood function together with the value of the maximum-
likelihood estimator
1https://www.shinyapps.io
415
416 APPENDIX G. COMPANION R SHINY DASHBOARDS
5. maximum-likelihood estimation of the mean and the variance of a univariate
normal r.v.: visualisation of the bivariate log-likelihood function together with
the value of the maximum-likelihood estimator
6. estimation of mean and covariance of a bivariate normal r.v.: bias/variance
visualisation
7. least-squares estimation of the parameters of a linear target function: visu-
alisation of bias/variance of the predicted conditional expectation and of the
parameter estimators
8. least-squares estimation of the parameters of a nonlinear target function: vi-
sualisation of bias/variance of the predicted conditional expectation
•bootstrap.R: study of the accuracy of the bootstrap estimation of the sampling
distribution, estimator variance and estimator bias. The dashboard considers the
case of sample average for which it is known that bias is null and variance is inversely
proportional to N (Section 5.5.3). The dashboard shows that the bootstrap returns
an accurate estimation of bias and variance of sample average.
•leastsquares.R: visualisation of the minimisation of the empirical risk with gradient-
based iteration. 3 dashboards:
1. linear least-squares: visualisation of the convex empirical risk function and
position of the estimation as gradient-based iteration proceeds
2. NNet least-squares: visualisation of the estimated regression function (single
layer, 3 hidden nodes NNET) and the associated empirical risk as gradient-
based iteration proceeds
3. KNN cross-validation: illustration of the points used for test as cross-validation
proceeds in the case of a KNN regressor (variable number of neighbours).
•regression.R: visualisation of the model selection trade-off in regression by show-
ing the impact of different kinds of hyper-parameters (degree of polynomial model,
number of neighbours in locally constant and locally linear fitting, number of trees
in Random Forest) on the bias, variance and generalisation error.
•classif.R: visualisation of different classification notions in 4 dashboards:
1. Univariate: visualise the relation between posterior probability and class con-
ditional densities in a univariate binary classification task
2. Linear discriminant: visualise the relation between bivariate class conditional
densities and linear discriminant
3. Perceptron: visualise the evolution of the perceptron hyperplane during the
gradient based minimisation of the hyperplane misclassification and the SVM
hyperplane
4. Assessment: visualise the relation between ROC curve, PR curve, confusion
matrix and classifier threshold in a univariate binary classification task.
•classif2.R: visualisation of direct and inverse conditional distributions in the uni-
modal and bimodal case.
Bibliography
[1] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig
Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-
mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia,
Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man´e,
Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon
Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Van-
houcke, Vijay Vasudevan, Fernanda Vi´egas, Oriol Vinyals, Pete Warden, Martin
Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-
scale machine learning on heterogeneous systems, 2015. Software available from
tensorflow.org.
[2] Charu C. Aggarwal. Linear Algebra and Optimization for Machine Learning - A
Textbook. Springer, 2020.
[3] D. W. Aha. Incremental, instance-based learning of independent and graded concept
descriptions. In Sixth International Machine Learning Workshop , pages 387–391, San
Mateo, CA, 1989. Morgan Kaufmann.
[4] D. W. Aha. A Study of Instance-Based Algorithms for Supervised Learning Tasks:
Mathematical, Empirical and Psychological Observations. PhD thesis, University of
California, Irvine, Department of Information and Computer Science, 1990.
[5] D. W. Aha. Editorial of special issue on lazy learning. Artificial Intelligence Review,
11(1–5):1–6, 1997.
[6] H. Akaike. Fitting autoregressive models for prediction. Annals of the Institute of
Statistical Mechanics, 21:243–247, 1969.
[7] D. M. Allen. The relationship between variable and data augmentation and a method
of prediction. Technometrics, 16:125–127, 1974.
[8] Christophe Ambroise and Geoffrey J. McLachlan. Selection bias in gene extraction
on the basis of microarray gene-expression data. PNAS , 99(10):6562–6566, 2002.
[9] B. D. O. Anderson and M. Deistler. Identifiability in dynamic errors-in-variables
models. Journal of Time Series Analysis , 5:1–13, 1984.
[10] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning. Artificial
Intelligence Review, 11(1–5):11–73, 1997.
[11] R. Babuska. Fuzzy Modeling and Identification. PhD thesis, Technische Universiteit
Delft, 1996.
[12] R. Babuska and H. B. Verbruggen. Fuzzy set methods for local modelling and
identification. In R. Murray-Smith and T. A. Johansen, editors, Multiple Model
Approaches to Modeling and Control, pages 75–100. Taylor and Francis, 1997.
[13] D. Barber. Bayesian reasoning and machine learning. Cambridge University Press,
2012.
[14] A. R. Barron. Predicted squared error: a criterion for automatic model selection. In
S. J. Farlow, editor, Self-Organizing Methods in Modeling, volume 54, pages 87–103,
New York, 1984. Marcel Dekker.
[15] Thomas Lumley based on Fortran code by Alan Miller. leaps: Regression Subset
Selection, 2020. R package version 3.1.
417
418 BIBLIOGRAPHY
[16] W. G. Baxt. Improving the accuracy of an artificial neural network using multiple
differently trained networks. Neural Computation , 4:772–780, 1992.
[17] Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jef-
frey Mark Siskind. Automatic differentiation in machine learning: a survey, 2015.
cite arxiv:1502.05767Comment: 43 pages, 5 figures.
[18] M. G. Bello. Enhanced training algorithms, and integrated training/architecture se-
lection for multilayer perceptron networks. IEEE Transactions on Neural Networks,
3(6):864–875, 1992.
[19] H. N. Bensusan. Automatic bias learning: an inquiry into the inductive basis of
induction. PhD thesis, University of Sussex, 1999.
[20] H. Bersini and G. Bontempi. Fuzzy models viewed as multi-expert networks. In
IFSA '97 (7th International Fuzzy Systems Association World Congress, Prague),
pages 354–359, Prague, 1997. Academia.
[21] H. Bersini and G. Bontempi. Now comes the time to defuzzify the neuro-fuzzy
models. Fuzzy Sets and Systems , 90(2):161–170, 1997.
[22] H. Bersini, G. Bontempi, and C. Decaestecker. Comparing rbf and fuzzy inference
systems on theoretical and practical basis. In F. Fogelman-Soulie' and P. Gallinari,
editors, ICANN '95,International Conference on Artificial Neural Networks, pages
169–174, 1995.
[23] M. Birattari and G. Bontempi. The lazy package for r. lazy learning for local regres-
sion. Technical Report 38, IRIDIA ULB, 2003.
[24] M. Birattari, G. Bontempi, and H. Bersini. Lazy learning meets the recursive least-
squares algorithm. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, NIPS 11,
pages 375–381, Cambridge, 1999. MIT Press.
[25] C. M. Bishop. Neural Networks for Statistical Pattern Recognition. Oxford University
Press, Oxford, UK, 1994.
[26] S. Bittanti. Model Identification and Data Analysis. Wiley, 2019.
[27] Joseph K. Blitzstein and Jessica Hwang. Introduction to Probability Second Edition.
2019.
[28] G Bontempi. A blocking strategy to improve gene selection for classification of gene
expression data. Computational Biology and Bioinformatics, IEEE/ACM Transac-
tions on, 4(2):293–300, 2007.
[29] G. Bontempi and H. Bersini. Identification of a sensor model with hybrid neuro-fuzzy
methods. In A. B. Bulsari and S. Kallio, editors, Neural Networks in Engineering
systems (Proceedings of the 1997 International Conference on Engineering Applica-
tions of Neural Networks (EANN '97), Stockolm, Sweden), pages 325–328, 1997.
[30] G. Bontempi, M. Birattari, and H. Bersini. Lazy learning for modeling and control
design. International Journal of Control , 72(7/8):643–658, 1999.
[31] G. Bontempi, M. Birattari, and H. Bersini. A model selection approach for local
learning. Artificial Intelligence Communications , 121(1), 2000.
[32] G. Bontempi, B. Haibe-Kains, C. Desmedt, C. Sotiriou, and J. Quackenbush.
Multiple-input multiple-output causal strategies for gene selection. BMC bioinfor-
matics, 12(1):458, 2011.
[33] G. Bontempi and P.E. Meyer. Causal filter selection in microarray data. In Proceed-
ing of the ICML2010 conference, 2010.
[34] G. Bontempi, C. Olsen, and M. Flauder. D2C: Predicting Causal Direction from
Dependency Features, 2014. R package version 1.1.
[35] Gianluca Bontempi and Maxime Flauder. From dependency to causality: A machine
learning approach. Journal of Machine Learning Research, 16:2437–2457, 2015.
[36] L. Breiman. Stacked regressions. Machine Learning, 24(1):49–64, 1996.
[37] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and
Regression Trees. Wadsworth International Group, Belmont, CA, 1984.
BIBLIOGRAPHY 419
[38] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
[39] D. Broomhead and D. Lowe. Multivariable functional interpolation and adaptive
networks. Complex Systems , 2:321–355, 1988.
[40] Gavin C. Cawley. Over-fitting in model selection and its avoidance. In Jaakko
Hollmn, Frank Klawonn, and Allan Tucker, editors, IDA, volume 7619 of Lecture
Notes in Computer Science, page 1. Springer, 2012.
[41] A. Chalmers. What is this thing called science? (new and extended) . Open University
Press, 2012.
[42] V. Cherkassky and F. Mulier. Learning from Data: Concepts, Theory, and Methods.
Wiley, New York, 1998.
[43] F. Chollet and J.J. Allaire. Deep Learning with R. Manning, 2018.
[44] W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots.
Journal of the American Statistical Association, 74:829–836, 1979.
[45] W. S. Cleveland and S. J. Devlin. Locally weighted regression: an approach to
regression analysis by local fitting. Journal of American Statistical Association,
83:596–610, 1988.
[46] W. S. Cleveland and C. Loader. Smoothing by local regression: Principles and
methods. Computational Statistics , 11, 1995.
[47] T. Cover and P. Hart. Nearest neighbor pattern classification. Proc. IEEE Trans.
Inform. Theory, pages 21–27, 1967.
[48] P. Craven and G. Wahba. Smoothing noisy data with spline functions: Estimat-
ing the correct degree of smoothing by the method of generalized cross-validation.
Numer. Math., 31:377–403, 1979.
[49] G. Cybenko. Just-in-time learning and estimation. In S. Bittanti and G. Picci,
editors, Identification, Adaptation, Learning. The Science of Learning Models from
data, NATO ASI Series, pages 423–434. Springer, 1996.
[50] Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca
Bontempi. Credit card fraud detection: a realistic modeling and a novel learning
strategy. IEEE transactions on neural networks and learning systems, 29(8):3784–
3797, 2017.
[51] Andrea Dal Pozzolo, Olivier Caelen, and Gianluca Bontempi. When is undersam-
pling effective in unbalanced classification tasks? In Joint European Conference on
Machine Learning and Knowledge Discovery in Databases, pages 200–215. Springer,
Cham, 2015.
[52] Andrea Dal Pozzolo, Olivier Caelen, Yann-Ael Le Borgne, Serge Waterschoot, and
Gianluca Bontempi. Learned lessons in credit card fraud detection from a practi-
tioner perspective. Expert systems with applications, 41(10):4915–4928, 2014.
[53] Peter Dalgaard. Introductory statistics with R. Springer, 2002.
[54] P. Daniusis, D. Janzing, J. Mooij, J. Zscheischler, B. Steudel, K. Zhang, and
B. Schlkopf. Inferring deterministic causal relations. In Proceedings of the 26th Con-
ference on Uncertainty in Artificial Intelligence (UAI-2010), pages 143–150, 2010.
[55] A. Dean and D. Voss. Design and Analysis of Experiments. Springer Verlag, New
York, NY, USA, 1999.
[56] Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Mathematics for
Machine Learning. Cambridge University Press, 2020.
[57] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete
data via the em algorithm. Journal of the Royal Statistical Society, B, 39(1):1–38,
1977.
[58] L. Devroye, L. Gy¨orfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition.
Springer Verlag, 1996.
[59] N. R. Draper and H. Smith. Applied Regression Analysis. John Wiley and Sons,
New York, 1981.
420 BIBLIOGRAPHY
[60] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1976.
[61] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classification. John Wiley and
sons, 2001.
[62] B. Efron. The estimation of prediction error: Covariance penalties and cross-
validation. Annals of Statistics , pages 1–26, 1979.
[63] B. Efron. The Jacknife, the Bootstrap and Other Resampling Plans. SIAM, 1982.
Monograph 38.
[64] B. Efron. Bootstrap methods: Another look at the jacknife. JASA , pages 619–642,
2004.
[65] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall,
New York, NY, 1993.
[66] B. Efron and R. J. Tibshirani. Cross-validation and the bootstrap: estimating the
error rate of a prediction rule. Technical report, Stanford University, 1995.
[67] Gamaleldin F. Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alexey
Kurakin, Ian J. Goodfellow, and Jascha Sohl-Dickstein. Adversarial examples that
fool both computer vision and time-limited humans. In Samy Bengio, Hanna M. Wal-
lach, Hugo Larochelle, Kristen Grauman, Nicol Cesa-Bianchi, and Roman Garnett,
editors, NeurIPS , pages 3914–3924, 2018.
[68] J. Fan and I. Gijbels. Variable bandwidth and local linear regression smoothers. The
Annals of Statistics, 20(4):2008–2036, 1992.
[69] J. Fan and I. Gijbels. Adaptive order polynomial fitting: bandwidth robustification
and bias reduction. J. Comp. Graph. Statist. , 4:213–227, 1995.
[70] J. Fan and I. Gijbels. Local Polynomial Modelling and Its Applications. Chapman
and Hall, 1996.
[71] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The KDD process for extracting
useful knowledge from volumes of data. Communications of the ACM, 39(11):27–34,
November 1996.
[72] V. Fedorov. Theory of Optimal Experiments. Academic Press, 1972.
[73] Manuel Fern´andez-Delgado, Eva Cernadas, Sen´en Barro, and Dinani Amorim. Do
we need hundreds of classifiers to solve real world classification problems? Journal
of Machine Learning Research, 15(1):3133–3181, January 2014.
[74] F. Fleuret. Fast binary feature selection with conditional mutual information. Jour-
nal of Machine Learning Research, 5:1531–1555, 2004.
[75] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learn-
ing and an application to boosting. Journal of Computer and System Sciences,
55(1):119–139, 1997.
[76] J. H. Friedman. Flexible metric nearest neighbor classification. Technical report,
Stanford University, 1994.
[77] Jerome H. Friedman. Stochastic gradient boosting. Comput. Stat. Data Anal.,
38(4):367–378, February 2002.
[78] A. Gelman. Bayesian Data Analysis . Chapman and Hall, 2004.
[79] Z. Ghahramani and M. I. Jordan. Supervised learning from incomplete data via an
em approach. In J. D. Cowan, G. T. Tesauro, and J. Alspector, editors, Advances in
Neural Information Processing Systems, volume 6, pages 120–127, San Mateo, CA,
1994. Morgan Kaufmann.
[80] Clark Glymour, Kun Zhang, and Peter Spirtes. Review of causal discovery methods
based on graphical models. Frontiers in Genetics, 10:524, 2019.
[81] P. Godfrey-Smith. Theory and reality: an introduction to the philosophy of science.
The University of Chicago Press, 2003.
[82] D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning.
Addison-Wesley, Reading, MA, 1989.
BIBLIOGRAPHY 421
[83] G.H. Golub and C.F. Van Loan. Matrix computations . Johns Hopkins University
Press, 1996.
[84] T. R. Golub, D. K. Slonin, P. Tamayo, C. Huard, and M. Gaasenbeek. Molecu-
lar clssification of cancer: Class discovery and class prediction by gene expression
monitoring. Science , 286:531–537, 1999.
[85] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,
2016. http://www.deeplearningbook.org.
[86] S. Guo and M.W. Fraser. Propensity Score Analysis: Statistical Methods and Appli-
cations. SAGE, 2014.
[87] I. Guyon. Results and analysis of the 2013 ChaLearn cause-effect pair challenge.
JMLR Workshop and Conference Proceedings, 2014.
[88] I. Guyon, C. Aliferis, and A. Elisseeff. Computational Methods of Feature Selection,
chapter Causal Feature Selection, pages 63–86. Chapman and Hall, 2007.
[89] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal
of Machine Learning Research, 3:1157–1182, 2003.
[90] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal
of Machine Learning Research, 3:1157–1182, 2003.
[91] Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lotfi A. Zadeh. Feature Ex-
traction: Foundations and Applications. Springer-Verlag New York, Inc., 2006.
[92] B. Haibe-Kains, C. Desmedt, S. Loi, M. Delorenzi, C. Sotiriou, and G. Bontempi.
Computational Intelligence in Clinical Oncology: Lessons Learned from an Analysis
of a Clinical Study, pages 237–268. Springer Berlin Heidelberg, Berlin, Heidelberg,
2008.
[93] D. J. Hand. Discrimination and classification. John Wiley, New York, 1981.
[94] D.J. Hand. Statistics: a very short introduction, volume 196. Oxford University
Press, USA, 2008.
[95] W. Hardle and J. S. Marron. Fast and simple scatterplot smoothing. Comp. Statist.
Data Anal., 20:1–17, 1995.
[96] T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman and Hall,
London, UK, 1990.
[97] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6):607–615,
1996.
[98] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning.
Springer, 2001.
[99] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning (2nd
edition). Springer, 2009.
[100] Trevor Hastie and Robert Tibshirani. Efficient quadratic regularization for expres-
sion arrays. Biostatistics (Oxford, England), 5(3):329–40, Jul 2004.
[101] J. S. U. Hjorth. Computer Intensive Statistical Methods. Chapman and Hall, 1994.
[102] Tin Kam Ho. The random subspace method for constructing decision forests. IEEE
Trans. Pattern Anal. Mach. Intell., 20(8):832–844, 1998.
[103] W. Hoeffding. Probability inequalities for sums of bounded random variables. Jour-
nal of American Statistical Association, 58:13–30, 1963.
[104] PO Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Scholkopf. Nonlinear causal
discovery with additive noise models. In Advances in Neural Information Processing
Systems, pages 689–696, 2009.
[105] P. J. Huber. Robust Statistics . Wiley, New York, 1981.
[106] P. Hurley. A Concise Introduction to Logic. CENGAGE Learning Custom Publish-
ing, 2011.
422 BIBLIOGRAPHY
[107] A. K. Jain, R. C. Dubes, and C. Chen. Bootstrap techniques for error estimation.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 5:628–633, 1987.
[108] J.-S. R. Jang. Anfis: Adaptive-network-based fuzzy inference systems. IEEE Trans-
actions on Fuzzy Systems, 23(3):665–685, 1993.
[109] J. S. R. Jang, C. T. Sun, and E. Mizutani. Neuro-Fuzzy and Soft Computing . Matlab
Curriculum Series. Prentice Hall, 1997.
[110] E.T. Jaynes. Probability theory : the logic of science. Cambridge University Press,
2003.
[111] T. A. Johansen and B. A. Foss. Constructing narmax models using armax models.
International Journal of Control, 58:1125–1153, 1993.
[112] M. C. Jones, J. S. Marron, and S. J. Sheather. A brief survey of bandwidth selection
for density estimation. Journal of American Statistical Association, 90, 1995.
[113] M. I. Jordan and T. J. Sejnowski, editors. Graphical models: foundations of neural
computation. The MIT Press, 2001.
[114] V. Y. Katkovnik. Linear and nonlinear methods of nonparametric regression analysis.
Soviet Automatic Control, 5:25–34, 1979.
[115] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing.
Science, 220(4598):671–680, 1983.
[116] R. Kohavi. A study of cross-validation and bootstrap for accuracy estima-
tion and model selection. In Proceedings of IJCAI-95, 1995. available at
http://robotics.stanford.edu/users/ronnyk/ronnyk-bib.html.
[117] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelli-
gence, 97(1-2):273–324, 1997.
[118] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelli-
gence, 97(1-2):273–324, 1997.
[119] D. Koller and N. Friedman. Probabilistic graphical models. The MIT Press, 2009.
[120] A. N. Kolmogorov. Foundations of Probability. Berlin, 1933.
[121] J. Kolodner. Case-Based Reasoning. Morgan Kaufmann, 1993.
[122] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,
521(7553):436–444, 2015.
[123] Raoul LePage and Lynne Billard. Exploring the limits of bootstrap. John Wiley &
Sons, 1992.
[124] R. J.A. Little and D. B. Rubin. Statistical analysis with missing data. Wiley, 2002.
[125] L. Ljung. System identification: Theory for the User . Prentice-Hall, Englewood
Cliffs, NJ, 1987.
[126] Sherene Loi, Benjamin Haibe-Kains, Christine Desmedt, Pratyaksha Wirapati,
Fran¸coise Lallemand, Andrew M Tutt, Cheryl Gillet, Paul Ellis, Kenneth Ryder,
James F Reid, Gianluca Bontempi, et al. Predicting prognosis using molecular
profiling in estrogen receptor-positive breast cancer treated with tamoxifen. BMC
genomics, 9(1):1–12, 2008.
[127] D. Lopez-Paz. From Dependence to Causation. PhD thesis, Cambridge University,
2016.
[128] D. G. Luenberger. Linear and Nonlinear Programming. Addison Wesley, Reading,
MA, 1984.
[129] C. Mallows. Discussion of a paper of beaton and tukey. Technometrics, 16:187–188,
1974.
[130] C. L. Mallows. Some comments on cp . Technometrics, 15:661, 1973.
[131] O. Maron and A. Moore. The racing algorithm: Model selection for lazy learners.
Artificial Intelligence Review, 11(1–5):193–225, 1997.
[132] A. Miller. Subset Selection in Regression (2nd ed.). Chapman and Hall, 2002.
BIBLIOGRAPHY 423
[133] J. Moody. The effective number of parameters: An analysis of generalization and
regularization in nonlinear learning systems. In J. Moody, Hanson, and Lippmann,
editors, Advances in Neural Information Processing Systems, volume 4, pages 847–
854, Palo Alto, 1992. Morgan Kaufmann.
[134] J. Moody and C. J. Darken. Fast learning in networks of locally-tuned processing
units. Neural Computation , 1(2):281–294, 1989.
[135] A. W. Moore, D. J. Hill, and M. P. Johnson. An empirical investigation of brute force
to choose features, smoothers and function approximators. In S. Janson, S. Judd, and
T. Petsche, editors, Computational Learning Theory and Natural Learning Systems,
volume 3. MIT Press, Cambridge, MA, 1992.
[136] K. P. Murphy. An introduction to graphical models. Technical report, 2001.
[137] R. Murray-Smith. A local model network approach to nonlinear modelling. PhD
thesis, Department of Computer Science, University of Strathclyde, Strathclyde,
UK, 1994.
[138] R. Murray-Smith and T. A. Johansen. Local learning in local model networks.
In R. Murray-Smith and T. A. Johansen, editors, Multiple Model Approaches to
Modeling and Control, chapter 7, pages 185–210. Taylor and Francis, 1997.
[139] R. H. Myers. Classical and Modern Regression with Applications. PWS-KENT
Publishing Company, Boston, MA, second edition, 1994.
[140] E. Nadaraya. On estimating regression. Theory of Prob. and Appl. , 9:141–142, 1964.
[141] Cathy O'Neil. Weapons of Math Destruction. Crown, New York, 2016.
[142] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw Hill,
1991.
[143] Simon Parsons and Anthony Hunter. A Review of Uncertainty Handling Formalisms,
pages 8–37. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.
[144] E. Parzen. On estimation of a probability density function and mode. Annals of
Mathematical Statistics, 33:1065–1076, 1962.
[145] Y. Pawitan. In all likelihood: statistical modelling and inference using likelihood.
Oxford Science, 2001.
[146] J. Pearl. Causality. Cambridge University Press, 2000.
[147] Judea Pearl. Probabilistic reasoning in intelligent systems : networks of plausible
inference. Morgan Kaufmann, San Francisco, Calif., 2009. Example for Explaning
away.
[148] Judea Pearl. Comment: Understanding simpson's paradox. The American Statisti-
cian, 68:8–13, 2014.
[149] Judea Pearl and Dana Mackenzie. The Book of Why. Basic Books, New York, 2018.
[150] Jean-Philippe Pellet and Andr Elisseeff. Using markov blankets for causal structure
learning. J. Mach. Learn. Res., 9:1295–1342, 2008.
[151] J.P. Pellet and A. Elisseeff. Using markov blankets for causal structure learning.
Journal of Machine Learning Research, 9:1295–1342, 2008.
[152] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information:
criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 27, 2005.
[153] M. P. Perrone and L. N. Cooper. When networks disagree: Ensemble methods for
hybrid neural networks. In R. J. Mammone, editor, Artificial Neural Networks for
Speech and Vision, pages 126–142. Chapman and Hall, 1993.
[154] Jonas Peters, Dominik Janzing, and Bernhard Sch¨olkopf. Elements of Causal Infer-
ence: Foundations and Learning Algorithms. Adaptive Computation and Machine
Learning. MIT Press, Cambridge, MA, 2017.
[155] D. Plaut, S. Nowlan, and G. E. Hinton. Experiments on learning by back propaga-
tion. Technical Report CMU-CS-86-126, Department of Computer Science, Carnegie
Mellon University, Pittsburgh, PA, 1986.
424 BIBLIOGRAPHY
[156] M. J. D. Powell. Algorithms for Approximation , chapter Radial Basis Functions
for multivariable interpolation: a review, pages 143–167. Clarendon Press, Oxford,
1987.
[157] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical
Recipes in C: The Art of Scientific Computing. Cambridge University Press, 1992.
Second ed.
[158] J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986.
[159] J. R. Quinlan. Simplyfying decision trees. International Journal of Man-Machine
Studies, 27:221–234, 1987.
[160] R Development Core Team. R: A language and environment for statistical computing.
R Foundation for Statistical Computing, Vienna, Austria, 2004. ISBN 3-900051-07-0.
[161] J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978.
[162] M. Rosenblatt. Remarks on some nonparametric estimates of a density function.
Annals of Mathematical Statistics, 27:832–837, 1956.
[163] K. J. Rothman, S. Greenland, and Timothy L. Lash. Modern Epidemiology . Lippin-
cott Williams & Wilkins, Philadelphia, PA, 3rd edition, 2008.
[164] D. B. Rubin. Inference and missing data (with discussion). Biometrika , 63:581–592,
1976.
[165] D. E. Rumelhart, G. E. Hinton, and R. K. Williams. Learning representations by
backpropagating errors. Nature, 323(9):533–536, 1986.
[166] S. Russel and Peter Norvig. Artificial Intelligence: a modern approach. Pearson,
2016.
[167] Y. Saeys, I. Inza, and P. Larranaga. A review of feature selection techniques in
bioinformatics. Bioinformatics , 23:2507–2517, 2007.
[168] R. E. Schapire. Nonlinear Estimation and Classification, chapter The boosting ap-
proach to machine learning: An overview. Springer,.
[169] L. Schneps and C. Colmez. Math on Trial: How Numbers Get Used and Abused in
the Courtroom. EBL ebooks online. Basic Books, 2013.
[170] D. W. Scott. Multivariate density estimation . Wiley, New York, 1992.
[171] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis.
Cambridge University Press, illustrated edition edition, 2004.
[172] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. Springer
Verlag, Berlin, 2000.
[173] P. Spirtes and K. Zhangl. Causal discovery and inference: concepts and recent
methodological advances. Applied Informatics , 3, 2016.
[174] C. Stanfill and D. Waltz. Toward memory-based reasoning. Communications of the
ACM, 29(12):1213–1228, 1987.
[175] C. Stone. Consistent nonparametric regression. The Annals of Statistics , 5:595–645,
1977.
[176] M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal
of the Royal Statistical Society B, 36(1):111–147, 1974.
[177] M. Stone. An asymptotic equivalence of choice of models by cross-validation and
akaike's criterion. Journal of Royal Statistical Society, Series B, 39:44–47, 1977.
[178] T. Takagi and M. Sugeno. Fuzzy identification of systems and its applications
to modeling and control. IEEE Transactions on Systems, Man, and Cybernetics,
15(1):116–132, 1985.
[179] M. Taniguchi and V. Tresp. Averaging regularized estimators. Neural Computation,
(9), 1997.
[180] H. Tijms. Understanding probability. Cambridge, 2004.
[181] V. Tresp. Handbook for neural network signal processing, chapter Committee ma-
chines. CRC Press, 2001.
BIBLIOGRAPHY 425
[182] I. Tsamardinos and C. Aliferis. Towards principled feature selection: Relevancy. In
Proceedings of the 9th International Workshop on Artificial Intelligence and Statis-
tics, 2003.
[183] I Tsamardinos and CF Aliferis. Towards Principled Feature Selection: Relevancy,
Filters and Wrappers. In Ninth International Workshop on Artificial Intelligence
and Statistics, AISTAT, 2003.
[184] B. van Fraassen. The Scientific Image. Oxford University Press, 1980.
[185] V. N. Vapnik. Principles of risk minimization for learning theory. In Advances in
Neural Information Processing Systems, volume 4, Denver, CO, 1992.
[186] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, NY,
1995.
[187] V. N. Vapnik. Statistical Learning Theory. Springer, 1998.
[188] V. N. Vapnik and A. J. Chervonenkis. The necessary and sufficient conditions for
consistency of the method of empirical risk. Pattern Recognition and Image Analysis,
1(3):284–305, 1991.
[189] W. N. Venables and D. M. Dmith. An Introduction to R. Network Theory, 2002.
[190] Tyler Vigen. Spurious correlations. Hachette Book, 2015.
[191] T. P. Vogl, J. K. Mangis, A. K. Rigler, W. T. Zink, and D. L. Alkon. Accelerating
the convergence of the back-propagation method. Biological Cybernetics, 59:257–263,
1988.
[192] L. Wasserman. All of statistics . Springer, 2004.
[193] G. Watson. Smooth regression analysis. Sankhya, Series, A(26):359–372, 1969.
[194] S. M. Weiss and C. A. Kulikowski. Computer Systems that learn. Morgan Kaufmann,
San Mateo, California, 1991.
[195] B. Widrow and M.E. Hoff. Adaptive switching circuits. In WESCON Convention
Record Part IV, 1960.
[196] D. H. Wolpert. Stacked generalization. Technical Report LA-UR-90-3460, Los
Alamos, NM, 1990. He proposes a more sophisticated voting scheme where the
a second level function output performs the final estimation of the true class. I don't
think this is going to referenced in the final thesis.
[197] D. H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural
Computation, 8:1341–1390, 1996.
[198] D. H. Wolpert and R. Kohavi. Bias plus variance decomposition for zero-one loss
functions. In Proceedings of the 13th International Conference on Machine Learning,
pages 275–283, 1996.
[199] Zenglin Xu, Rong Jin, Jieping Ye, Michael R. Lyu, and Irwin King. Non-monotonic
feature selection. In Andrea Pohoreckyj Danyluk, Lon Bottou, and Michael L.
Littman, editors, ICML , volume 382 of ACM International Conference Proceeding
Series, pages 1145–1152. ACM, 2009.
[200] L. A. Zadeh. Fuzzy sets. Information and Control, 8:338–353, 1965.
... La définition d'un modèle de pendule dépend de plusieurs facteurs, comme discuté en Section 1.2. 1. On pourrait envisager un modèle pour calculer la période T du pendule pour une longueur fixée à l'avance ou au contraire établir un modèle pour déterminer quelle longueur l * du fil détermine une période T * . ...
... Afin de pouvoir étudier un système dynamique il est nécessaire d'introduire, comme dans l'exemple précédent, une troisième grandeur, appelée état, qui jouit des propriétés suivantes : 1. elle résume l'ensemble de l'information sur le passé et le présent du système (mémoire). ...
... Définition 2. 6. Un état x (2) est accessible à l'instant t 2 à partir d'un état x (1) s'il existe un instant t 1 et une fonction d'entrée u(·) ∈ Ω telle que ϕ(t 2 , t 1 , x (1) , u(·)) = x (2) Parfois il est intéressant de considérer des systèmes où chaque état x (2) à l'instant t est accessible à partir d'un quelconque état x (1) . ...
Ce syllabus est destiné, en premier lieu, aux étudiants de Bachelier en Sciences Informatiques de l'Université Libre de Bruxelles. Le cours de Modélisation et Simulation vise à fournir une présentation des fondements théoriques de la modélisation des systèmes dynamiques et des outils pour la simulation. La motivation pour le cours naquit d'un constat: nonobstant la réputation mondiale (notamment un prix Nobel et plusieurs prix Francqui) de l'ULB dans le domaine de la modélisation de systèmes dynamiques, aucun cours sur ce sujet n'était dispensé aux élèves informaticiens. Ceci était d'autant plus surprenant si on considère i) que des grandes avancées ont été rendues possible dans la modélisation par l'utilisation de l'ordinateur (il suffit penser aux fractales et à l'étude des phénomènes chaotiques) et ii) que les progrès des méthodes numériques et l'augmentation des performances des ordinateurs permettent aujourd'hui de simuler et prédire le comportement de systèmes de plus en plus complexes. Aussi plusieurs domaines industriels utilisent couramment la simulation numérique afin de raccourcir le cycle de conception et développement de nouveaux produits. La matière enseignée se répartit sur 9 sujets principaux: l'introduction à la modélisation et à la simulation, l'introduction aux systèmes dynamiques, les systèmes à états discrets et à temps discret, les systèmes dynamiques à temps continu, les systèmes linéaires continus, les systèmes non-linéaires continus, les systèmes à temps discret, la simulation Monte-Carlo et la simulation à événements discrets. On trouvera dans l'annexe des rappels d'équations différentielles et de probabilité. Le côté pratique de la modélisation et de la simulation n'est pas négligé et chaque chapitre introduit des applications à des problèmes concrets. Ce manuel ne vise ni à l'originalité, ni à être complet. Il n'a d'autre but que de fournir un supplément à l'étudiant qui suit régulièrement le cours. De nombreuses références à des ouvrages publiés sont faites tout au long de la présentation. En particulier, la présentation des fondements des systèmes dynamiques est inspirée du livre du Pr. Sergio Rinaldi (Politecnico di Milano, Italie) at de son cours de Teoria dei sistemi que j'ai eu l'honneur de suivre dans le lointain 1989. Et, afin de stimuler l'étudiant à entreprendre la découverte de la complexité qui se cache derrière la dynamique d'un système apparemment simple, voici une citation extraite de Gleick: Tout irait bien mieux, non seulement en recherche, mais aussi dans le monde quotidien de la politique et de l'économie, si davantage de gens prenaient conscience du fait que les systèmes élémentaires ne possèdent pas nécessairement des propriétés dynamiques simples.
... Our research considered the implementation of a streaming algorithm capable of performing analysis on a univariate time series as new data becomes available. To make this possible, recursive least squares techniques can be applied to an autoregressive (AR) model [5]. The AR model is commonly used for time series forecasting. ...
... We can then apply recursive least squares to allow the model to work on a data stream [5]. The regression coefficients can be updated as a new sample ( +1 , +1 ) becomes available. ...
... Nonlinear regression is based on developing predictive models, which combine basic functions, such as polynomial, sigmoid, and spline [13]. The polynomial regression is one of the simplest approaches, and it aims at fitting a model by using curves of order n > 2 (quadratic, cubic, etc. ), while the spline approach aims at producing a piece-wise model in which each model is trained with only the value lying in a specified interval. ...
... It performs this through an unsupervised process that projects the data from the original space to a lower dimensional one, where the axis, which are called Principal Components (PC), of this new space are computed by combining the original variables. The first PC is oriented over the direction with the maximum variance of data [13]. This mathematically corresponds to find the vector a = [a 1 , . . . ...
The large-scale deployment of pervasive sensors and decentralized computing in modern smart grids is expected to exponentially increase the volume of data exchanged by power system applications. In this context, the research for scalable, and flexible methodologies aimed at supporting rapid decisions in a data rich, but information limited environment represents a relevant issue to address. To this aim, this paper outlines the potential role of Knowledge Discovery from massive Datasets in smart grid computing, presenting the most recent activities developed in this field by the Task Force on "En-abling Paradigms for High-Performance Computing in Wide Area Monitoring Protective and Control Systems" of the IEEE PSOPE Technologies and Innovation Subcommittee.
... The choice of the optimal number of neighbors k will be performed through automatic leave-one-out selection as described in [5]. Our implementation of the kNN models is based on the R package gbcode [7]. ...
In finance, volatility is defined as a measure of variation of a trading price series over time. As volatility is a latent variable, several measures, named proxies, have been proposed in the literature to represent such quantity. The purpose of our work is twofold. On one hand, we aim to perform a statistical assessment of the relationships among the most used proxies in the volatility literature. On the other hand, while the majority of the reviewed studies in the literature focuses on a uni-variate time series model (NAR), using a single proxy, we propose here a NARX model, combining two proxies to predict one of them, showing that it is possible to improve the prediction of the future value of some proxies by using the information provided by the others. Our results , employing artificial neural networks (ANN), k-Nearest Neighbours (kNN) and support vector regression (SVR), show that the supplementary information carried by the additional proxy could be used to reduce the forecasting error of the aforementioned methods. We conclude by explaining how we wish to further investigate such relationship.
... The coefficient of determination R 2 is in the range from zero to one, the closer the coefficient to one, the better the results. However, this criterion is not appropriate for the comparison of candidate models because overfitting increases artificially its value (Bontempi & Ben Taieb, 2011), as well as the number of features. Finally, this measure is useful to check how well the independent variables explain the variability of the dependent one. ...
The design of experiments and the validation of the results achieved with them are vital in any research study. This paper focuses on the use of different Machine Learning approaches for regression tasks in the field of Computational Intelligence and especially on a correct comparison between the different results provided for different methods, as those techniques are complex systems that require further study to be fully understood. A methodology commonly accepted in Computational intelligence is implemented in an R package called RRegrs. This package includes ten simple and complex regression models to carry out predictive modeling using Machine Learning and well-known regression algorithms. The framework for experimental design presented herein is evaluated and validated against RRegrs. Our results are different for three out of five state-of-the-art simple datasets and it can be stated that the selection of the best model according to our proposal is statistically significant and relevant. It is of relevance to use a statistical approach to indicate whether the differences are statistically significant using this kind of algorithms. Furthermore, our results with three real complex datasets report different best models than with the previously published methodology. Our final goal is to provide a complete methodology for the use of different steps in order to compare the results obtained in Computational Intelligence problems, as well as from other fields, such as for bioinformatics, cheminformatics, etc., given that our proposal is open and modifiable.
La cerveza es un bien de alto consumo en la actualidad que se caracteriza, en países latinos, por su alta concentración en la producción nacional. Estas cerveceras han creado niveles elevados de lealtad en los consumidores, pero carecen de características diferenciadoras debido a la baja diversidad de productos de este tipo. En los últimos años se ha observado un auge de las microcervecerías, las cuales en menos de diez años han crecido exponencialmente haciéndose fuerte en diferentes lugares como bares, pubs, cafés y mostrando aspectos de fortalecimiento ante un mercado dominado por una única empresa. Frente a esto, los consumidores han respondido generando patrones en los que, para diferentes ocasiones o eventos, se inclinan por cierto tipo de cerveza (artesanal o comercial). El estudio apunta a mostrar el panorama actual de la cerveza artesanal en Colombia y generar bases suficientes para enlazar tendencias actuales de la cerveza artesanal y cómo esto influye en el sentimiento y motivaciones del consumidor. La investigación pretende destacar en un campo que no es muy conocido: las microcervecerías en la producción de cerveza artesanal, y obtener una respuesta satisfactoria desde el punto de vista del consumidor. Los resultados obtenidos son ideales para entender el comportamiento del consumidor durante el consumo de cerveza artesanal.
-
Dalila Hattab
The aim of this paper is to investigate the effect of volatility surges during the COVID-19 pandemic crisis on long-term investment trading rules. These trading rules are derived from stock return forecasting based on a Multiple Step Ahead Direct Strategy, and built on the combination of machine learning models and the Autoregressive Fractionally Integrated Moving Average (ARFIMA) model. ARFIMA has the feature to account for the long memory and structural change in conditional variance process. The machine learning models considered are a particular Neural Network model (MLP), K-Nearest Neighbors (KNN) and Support Vector Regression (SVR). The trading performances of the produced models are evaluated in terms of economical metrics reflecting profitability and risk like: Annualized Return, Sharpe Ratio and Profit Ratio. The hybrid model performances are compared to the simple machine learning models and to the classical ARMA-GARCH model using a Volatility Proxy as external regressor. When applying these long-term investment trading rules to the CAC40 index, from May 2016 to May 2020, the finding is that both MLP-based and hybrid ARFIMA-MLP-based trading models show higher performances with a Sharpe Ratio close to 2 and a Profit Ratio around 40% despite the COVID-19 crisis.
Lectures and exercise classes of the ULB/VUB Big Data course (Scalable Analytics part). The GitHub archive contains the lectures and the practical classes on how to implement from scratch different ML algorithms (ordinary least squares, gradient descent, k-means, alternating least squares), using Python NumPy, and how to then make these implementations scalable using Map/Reduce and Spark. Code on: https://github.com/Yannael/BigDataAnalytics_INFOH515 Slides of the course on: https://github.com/Yannael/BigDataAnalytics_INFOH515/tree/master/Analytics_Course_Slides
- Clark Glymour
- Kun Zhang
-
Peter Spirtes
A fundamental task in various disciplines of science, including biology, is to find underlying causal relations and make use of them. Causal relations can be seen if interventions are properly applied; however, in many cases they are difficult or even impossible to conduct. It is then necessary to discover causal relations by analyzing statistical properties of purely observational data, which is known as causal discovery or causal structure search. This paper aims to give a introduction to and a brief review of the computational methods for causal discovery that were developed in the past three decades, including constraint-based and score-based methods and those based on functional causal models, supplemented by some illustrations and applications.
- Gamaleldin F. Elsayed
-
Shreya Shankar
- Brian Cheung
- Jascha Sohl-Dickstein
Machine learning models are vulnerable to adversarial examples: small changes to images can cause computer vision models to make mistakes such as identifying a school bus as an ostrich. However, it is still an open question whether humans are prone to similar mistakes. Here, we create the first adversarial examples designed to fool humans, by leveraging recent techniques that transfer adversarial examples from computer vision models with known parameters and architecture to other models with unknown parameters and architecture, and by modifying models to more closely match the initial processing of the human visual system. We find that adversarial examples that strongly transfer across computer vision models influence the classifications made by time-limited human observers.
Detecting frauds in credit card transactions is perhaps one of the best testbeds for computational intelligence algorithms. In fact, this problem involves a number of relevant challenges, namely: concept drift (customers' habits evolve and fraudsters change their strategies over time), class imbalance (genuine transactions far outnumber frauds), and verification latency (only a small set of transactions are timely checked by investigators). However, the vast majority of learning algorithms that have been proposed for fraud detection rely on assumptions that hardly hold in a real-world fraud-detection system (FDS). This lack of realism concerns two main aspects: 1) the way and timing with which supervised information is provided and 2) the measures used to assess fraud-detection performance. This paper has three major contributions. First, we propose, with the help of our industrial partner, a formalization of the fraud-detection problem that realistically describes the operating conditions of FDSs that everyday analyze massive streams of credit card transactions. We also illustrate the most appropriate performance measures to be used for fraud-detection purposes. Second, we design and assess a novel learning strategy that effectively addresses class imbalance, concept drift, and verification latency. Third, in our experiments, we demonstrate the impact of class unbalance and concept drift in a real-world data stream containing more than 75 million transactions, authorized over a time window of three years.
- K.J. Hastings
Updated to conform to Mathematica® 7.0, Introduction to Probability with Mathematica®, Second Edition continues to show students how to easily create simulations from templates and solve problems using Mathematica. It provides a real understanding of probabilistic modeling and the analysis of data and encourages the application of these ideas to practical problems. The accompanying CD-ROM offers instructors the option of creating class notes, demonstrations, and projects. New to the Second Edition • Expanded section on Markov chains that includes a study of absorbing chains • New sections on order statistics, transformations of multivariate normal random variables, and Brownian motion • More example data of the normal distribution • More attention on conditional expectation, which has become significant in financial mathematics • Additional problems from Actuarial Exam P • New appendix that gives a basic introduction to Mathematica • New examples, exercises, and data sets, particularly on the bivariate normal distribution • New visualization and animation features from Mathematica 7.0 • Updated Mathematica notebooks on the CD-ROM (Go to Downloads/Updates tab for link to CD files.) • After covering topics in discrete probability, the text presents a fairly standard treatment of common discrete distributions. It then transitions to continuous probability and continuous distributions, including normal, bivariate normal, gamma, and chi-square distributions. The author goes on to examine the history of probability, the laws of large numbers, and the central limit theorem. The final chapter explores stochastic processes and applications, ideal for students in operations research and finance.
- T.J. Hastie
- R.J. Tibshirani
This book describes an array of power tools for data analysis that are based on nonparametric regression and smoothing techniques. These methods relax the linear assumption of many standard models and allow analysts to uncover structure in the data that might otherwise have been missed. While McCullagh and Nelder's Generalized Linear Models shows how to extend the usual linear methodology to cover analysis of a range of data types, Generalized Additive Models enhances this methodology even further by incorporating the flexibility of nonparametric regression. Clear prose, exercises in each chapter, and case studies enhance this popular text.
- Leo Breiman
-
Jerome H. Friedman
- Richard A. Olshen
- Charles J. Stone
The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Posted by: rosierosiemezzinnie0269567.blogspot.com
Source: https://www.researchgate.net/publication/242692234_Statistical_foundations_of_machine_learning_2nd_edition_handbook
Post a Comment for "Foundations Of Machine Learning Pdf Download"