The book is dedicated to all students interested in machine learning who are not content with only running lines of (deep-learning) code but who are eager to learn about this discipline's assumptions, limitations, and perspectives. When I was a student, my dream was to become an AI researcher and save humankind with intelligent robots. For several reasons, I abandoned such ambitions (but you never know). In exchange, I discovered that machine learning is much more than a conventional research domain since it is intimately associated with the scientific process transforming observations into knowledge. The first version of this book was made publicly available in 2004 with two objectives and one ambition. The first objective was to provide a handbook to ULB students since I was (and still am) strongly convinced that a decent course should come with a decent handbook. The second objective was to group together all the material that I consider fundamental (or at least essential) for a Ph.D. student to undertake a thesis in my lab. At that time, there were already plenty of excellent machine learning reference books. However, most of the existing work did not sufficiently acknowledge what machine learning owes to statistics and concealed (or did not make explicit enough, notably because of incomplete or implicit notation) important assumptions underlying the process of inferring models from data. The ambition was to make a free academic reference on the foundations of machine learning available on the web. There are several reasons for providing free access to this work: I am a civil servant in an institution that already takes care of my salary; most of the material is not original (though its organisation, notation definition, exercises, code and structure represent the primary added value of the author); in many parts of the world access to expensive textbooks or reference material is still difficult for the majority of students; most of the knowledge underlying this book was obtained by the author thanks to free (or at least non charged) references and, last but not least, education seems to be the last societal domain where a communist approach may be as effective as rewarding. Personally, I would be delighted if this book could be used to facilitate the access of underfunded educational and research communities to state-of-the-art scientific notions.

ResearchGate Logo

Discover the world's research

  • 20+ million members
  • 135+ million publications
  • 700k+ research projects

Join for free

Handbook

Statistical foundations of machine learning

Second edition

Gianluca Bontempi

Machine Learning Group

Computer Science Department

ULB, Universit´e Libre de Bruxelles,

Brussels, Belgium

mlg.ulb.ac.be

September 17, 2021

2

And indeed all things that are

known have number. For it is not

possible that anything whatsoever be

understood or known without this.

Philolaus, 400 BC

Not everything that can be counted

counts, and not everything that

counts can be counted.

W. B. Cameron, 1963

Preface to the 2021 edition

The book is dedicated to all students interested in machine learning who are not

content with only running lines of (deep-learning) code but who are eager to learn

about this discipline's assumptions, limitations, and perspectives. When I was a

student, my dream was to become an AI researcher and save humankind with in-

telligent robots. For several reasons, I abandoned such ambitions (but you never

know). In exchange, I discovered that machine learning is much more than a con-

ventional research domain since it is intimately associated with the scientific process

transforming observations into knowledge.

The first version of this book was made publicly available in 2004 with two

objectives and one ambition. The first objective was to provide a handbook to ULB

students since I was (and still am) strongly convinced that a decent course should

come with a decent handbook. The second objective was to group together all the

material that I consider fundamental (or at least essential) for a Ph.D. student to

undertake a thesis in my lab. At that time, there were already plenty of excellent

machine learning reference books. However, most of the existing work did not

sufficiently acknowledge what machine learning owes to statistics and concealed (or

did not make explicit enough, notably because of incomplete or implicit notation)

important assumptions underlying the process of inferring models from data.

The ambition was to make a free academic reference on the foundations of ma-

chine learning available on the web. There are several reasons for providing free

access to this work: I am a civil servant in an institution that already takes care

of my salary; most of the material is not original (though its organisation, notation

definition, exercises, code and structure represent the primary added value of the

author); in many parts of the world access to expensive textbooks or reference ma-

terial is still difficult for the majority of students; most of the knowledge underlying

this book was obtained by the author thanks to free (or at least non charged) refer-

ences and, last but not least, education seems to be the last societal domain where

a communist approach may be as effective as rewarding. Personally, I would be de-

lighted if this book could be used to facilitate the access of underfunded educational

and research communities to state-of-the-art scientific notions.

Though machine learning was already a hot topic at the end of the 20th century,

nowadays, it is definitely surrounded by a lot of hype and excitement. The number

of publications describing or using a machine learning approach in the last decades

is countless, making it impossible to address the heterogeneity of the domain in

a single book. Therefore, it is interesting to check how much material from the

first edition is still useful: reassuringly enough, the more the nature of the content

is fundamental, the less it is prone to obsolescence. Nevertheless, a lot of new

things (not only deep learning) happened in the domain, and, more specifically, I

realised the importance of some fundamental concepts that were neglected in the

first edition.

In particular, during those years, I realised the importance of exposing young

researchers to notions of multivariate dependency and independence. These notions

are brilliantly summarised in the topic of graphical models whose knowledge is es-

3

4

sential to grasp aspects of dimensionality reduction and feature selection. Secondly,

I (re)discovered that the foundations of machine learning lie in epistemology, the

branch of philosophy aiming to explain the meaning of knowledge and the process of

discovering it. Third, I became convinced that a process of discovering knowledge

from data should not be limited to modelling associations but aimed at discovering

causal mechanisms. Finally, I added a number of exercises, R scripts, and Shiny

dashboards to visualise and illustrate (sometimes too abstract) probabilistic and

estimation notions. In this sense, I am convinced that the adoption of Monte Carlo

simulation to introduce probabilistic concepts should be a more common habit in

introductory statistics classes.

For sure, I am strongly indebted to a lot of authors and their publications. I

hope I acknowledged them adequately in the bibliography. If I did not give enough

credit to some of the existing works, please do not hesitate to contact me. Last

but not least, the book is dedicated to all my ULB students and MLG researchers

in whom I have tried for many years to inculcate complex concepts of statistical

learning. Their eyes staring at my hand-waving, while I was trying to elucidate

some abstruse notions, were the best indicators of how to adapt, select and improve

the book's content.

To all those who want to send a note or continue to follow my machine learning

journey, see you on my blog https://datascience741.wordpress.com.

Acknowledgements

Though the book is not peer-reviewed, the added value of writing a handbook for

students and researchers is that they are typically very careful readers and willing to

pinpoint mistakes, inconsistencies, bad English and (a lot of) typos. First, I would

like to thank (in random order) the MLG researchers who sent me very useful

comments: Abhilash Miranda, Yann-a¨el Le Borgne, Souhaib Ben Taieb, Jacopo

De Stefani, Patrick Meyer, Olivier Caelen, Liran Lerman. Thanks as well to the

following students and readers (in random order) for their comments and remarks:

Robin de Haes, Mourad Akandouch, Zheng Liangliang, Olga Ibanez Sol´e, Maud

Destree, Wolf De Wulf, Dieter Vandesande, Miro-Manuel Matagne, Henry Morgan,

Pascal Tribel. A big thank to all of you! And do not hesitate to drop me an email

if you have comments or remarks!

Contents

Index 4

1 Introduction 15

1.1 Notations ................................. 22

2 Setting the foundations 27

2.1 Deductivelogic .............................. 27

2.2 Formal and empirical science . . . . . . . . . . . . . . . . . . . . . . 28

2.3 Induction, projection, and abduction . . . . . . . . . . . . . . . . . . 29

2.4 Hume and the induction problem . . . . . . . . . . . . . . . . . . . . 30

2.5 Logical positivism and verificationism . . . . . . . . . . . . . . . . . 31

2.6 Popper and the problem of induction . . . . . . . . . . . . . . . . . . 32

2.7 Instrumentalism ............................. 33

2.8 Epistemology and machine learning: the cross-fertilisation . . . . . . 33

3 Foundations of probability 37

3.1 The random model of uncertainty . . . . . . . . . . . . . . . . . . . . 37

3.1.1 Axiomatic definition of probability . . . . . . . . . . . . . . . 39

3.1.2 Visualisation of probability measures . . . . . . . . . . . . . . 39

3.1.3 Symmetrical definition of probability . . . . . . . . . . . . . . 40

3.1.4 Frequentist definition of probability . . . . . . . . . . . . . . 41

3.1.5 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . 42

3.1.6 Independence and conditional probability . . . . . . . . . . . 42

3.1.7 Thechainrule .......................... 45

3.1.8 The law of total probability and the Bayes' theorem . . . . . 45

3.1.9 Direct and inverse conditional probability . . . . . . . . . . . 47

3.1.10 Logics and probabilistic reasoning . . . . . . . . . . . . . . . 49

3.1.11 Combined experiments . . . . . . . . . . . . . . . . . . . . . . 50

3.1.12 Array of joint/marginal probabilities . . . . . . . . . . . . . . 52

3.2 Randomvariables............................. 54

3.3 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3.1 Parametric probability function . . . . . . . . . . . . . . . . . 55

3.3.2 Expected value, variance and standard deviation of a discrete

r.v.................................. 55

3.3.3 Entropy and relative entropy . . . . . . . . . . . . . . . . . . 58

3.4 Continuous random variable . . . . . . . . . . . . . . . . . . . . . . . 59

3.4.1 Mean, variance, moments of a continuous r.v. . . . . . . . . . 59

3.4.2 Univariate Normal (or Gaussian) distribution . . . . . . . . . 60

3.5 Jointprobability ............................. 61

3.5.1 Marginal and conditional probability . . . . . . . . . . . . . . 62

3.5.2 Independence........................... 63

3.5.3 Chainrule............................. 64

5

6CONTENTS

3.5.4 Conditional independence . . . . . . . . . . . . . . . . . . . . 65

3.5.5 Entropy in the continuous case . . . . . . . . . . . . . . . . . 66

3.5.5.1 Joint and conditional entropy . . . . . . . . . . . . . 66

3.6 Bivariate continuous distribution . . . . . . . . . . . . . . . . . . . . 67

3.6.1 Correlation ............................ 68

3.7 Normal distribution: the multivariate case . . . . . . . . . . . . . . . 70

3.7.1 Bivariate normal distribution . . . . . . . . . . . . . . . . . . 71

3.7.2 Gaussian mixture distribution . . . . . . . . . . . . . . . . . . 72

3.7.3 Linear transformations of Gaussian variables . . . . . . . . . 72

3.8 Mutualinformation............................ 73

3.8.1 Conditional mutual information . . . . . . . . . . . . . . . . . 74

3.8.2 Joint mutual information . . . . . . . . . . . . . . . . . . . . 74

3.8.3 Partial correlation coefficient . . . . . . . . . . . . . . . . . . 75

3.9 Functions of random variables and Monte Carlo simulation . . . . . 76

3.10 Linear combinations of r.v. . . . . . . . . . . . . . . . . . . . . . . . 77

3.10.1 The sum of i.i.d. random variables . . . . . . . . . . . . . . . 77

3.11Conclusion ................................ 77

3.12Exercises ................................. 78

4 Graphical models 85

4.1 Conditional independence and multivariate distributions . . . . . . . 85

4.2 Directed acyclic graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.3 Bayesiannetworks ............................ 86

4.3.1 Bayesian network and d-separation . . . . . . . . . . . . . . . 90

4.3.2 D-separation and I-map . . . . . . . . . . . . . . . . . . . . . 91

4.3.2.1 D-separation and faithfulness . . . . . . . . . . . . . 91

4.3.3 Skeleton and I-equivalence . . . . . . . . . . . . . . . . . . . . 93

4.3.4 Stable distributions . . . . . . . . . . . . . . . . . . . . . . . 94

4.4 Markovnetworks............................. 94

4.4.1 Separating vertices, separated subsets and independence . . . 95

4.4.2 Directed and undirected representations . . . . . . . . . . . . 95

4.5 Conclusions................................ 96

5 Parametric estimation 97

5.1 Classicalapproach ............................ 97

5.1.1 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.2 Empirical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.3 Plug-in principle to define an estimator . . . . . . . . . . . . . . . . 100

5.3.1 Sampleaverage.......................... 101

5.3.2 Sample variance . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.4 Sampling distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.4.1 Shiny dashboard . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.5 The assessment of an estimator . . . . . . . . . . . . . . . . . . . . . 103

5.5.1 Bias and variance . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.5.2 Estimation and the game of darts . . . . . . . . . . . . . . . 104

5.5.3 Bias and variance of ˆ

µ...................... 104

5.5.4 Bias of the estimator ˆ

σ2 ..................... 105

5.5.5 A tongue-twister exercise . . . . . . . . . . . . . . . . . . . . 106

5.5.6 Bias/variance decomposition of MSE . . . . . . . . . . . . . . 107

5.5.7 Consistency............................ 107

5.5.8 Eciency ............................. 108

5.6 The Hoeffding's inequality . . . . . . . . . . . . . . . . . . . . . . . . 108

5.7 Sampling distributions for Gaussian r.v.s . . . . . . . . . . . . . . . . 109

5.8 The principle of maximum likelihood . . . . . . . . . . . . . . . . . . 109

CONTENTS 7

5.8.1 Maximum likelihood computation . . . . . . . . . . . . . . . 111

5.8.2 Maximum likelihood in the Gaussian case . . . . . . . . . . . 111

5.8.3 Cramer-Rao lower bound . . . . . . . . . . . . . . . . . . . . 113

5.8.4 Properties of m.l. estimators . . . . . . . . . . . . . . . . . . 114

5.9 Interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.9.1 Confidence interval of µ ..................... 115

5.10 Combination of two estimators . . . . . . . . . . . . . . . . . . . . . 117

5.10.1 Combination of m estimators .................. 118

5.10.1.1 Linear constrained combination . . . . . . . . . . . 118

5.11 Testing hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.11.1 Types of hypothesis . . . . . . . . . . . . . . . . . . . . . . . 119

5.11.2 Types of statistical test . . . . . . . . . . . . . . . . . . . . . 119

5.11.3 Pure significance test . . . . . . . . . . . . . . . . . . . . . . . 120

5.11.4 Tests of significance . . . . . . . . . . . . . . . . . . . . . . . 120

5.11.5 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . 121

5.11.6 The hypothesis testing procedure . . . . . . . . . . . . . . . . 122

5.11.7 Choiceoftest........................... 123

5.11.8 UMP level-α test......................... 125

5.11.9 Likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . 125

5.12Parametrictests ............................. 125

5.12.1 z-test (single and one-sided) . . . . . . . . . . . . . . . . . . . 126

5.12.2 t-test: single sample and two-sided . . . . . . . . . . . . . . . 127

5.13 A posteriori assessment of a test . . . . . . . . . . . . . . . . . . . . 128

5.14Conclusion ................................ 129

5.15Exercises ................................. 130

6 Nonparametric estimation and testing 135

6.1 Nonparametric methods . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.2 Estimation of arbitrary statistics . . . . . . . . . . . . . . . . . . . . 136

6.3 Jackknife ................................. 137

6.3.1 Jackknife estimation . . . . . . . . . . . . . . . . . . . . . . . 137

6.4 Bootstrap ................................. 139

6.4.1 Bootstrap sampling . . . . . . . . . . . . . . . . . . . . . . . 139

6.4.2 Bootstrap estimate of the variance . . . . . . . . . . . . . . . 139

6.4.3 Bootstrap estimate of bias . . . . . . . . . . . . . . . . . . . . 141

6.4.4 Bootstrap confidence interval . . . . . . . . . . . . . . . . . . 141

6.4.5 The bootstrap principle . . . . . . . . . . . . . . . . . . . . . 142

6.5 Randomisation tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.5.1 Randomisation and bootstrap . . . . . . . . . . . . . . . . . . 144

6.6 Permutationtest ............................. 144

6.7 Considerations on nonparametric tests . . . . . . . . . . . . . . . . . 145

6.8 Exercises ................................. 146

7 Statistical supervised learning 147

7.1 Introduction................................ 147

7.2 Estimating dependencies . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.3 Dependency and classification . . . . . . . . . . . . . . . . . . . . . . 152

7.3.1 The Bayes classifier . . . . . . . . . . . . . . . . . . . . . . . 154

7.3.2 Inverse conditional distribution . . . . . . . . . . . . . . . . . 155

7.4 Dependency and regression . . . . . . . . . . . . . . . . . . . . . . . 157

7.5 Assessment of a learning machine . . . . . . . . . . . . . . . . . . . . 158

7.5.1 An illustrative example . . . . . . . . . . . . . . . . . . . . . 159

7.6 Functional and empirical risk . . . . . . . . . . . . . . . . . . . . . . 164

7.6.1 Consistency of the ERM principle . . . . . . . . . . . . . . . 165

8CONTENTS

7.6.2 Key theorem of learning . . . . . . . . . . . . . . . . . . . . . 166

7.6.2.1 Entropy of a set of functions . . . . . . . . . . . . . 167

7.6.2.2 Distribution independent consistency . . . . . . . . 168

7.6.3 The VC dimension . . . . . . . . . . . . . . . . . . . . . . . . 169

7.7 Generalisation error . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

7.7.1 The decomposition of the generalisation error in regression . 170

7.7.2 The decomposition of the generalisation error in classification 173

7.8 The hypothesis-based vs the algorithm-based approach . . . . . . . . 174

7.9 The supervised learning procedure . . . . . . . . . . . . . . . . . . . 175

7.10 Validation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

7.10.1 The resampling methods . . . . . . . . . . . . . . . . . . . . . 177

7.11Concludingremarks ........................... 179

7.12Exercises ................................. 179

8 The machine learning procedure 181

8.1 Introduction................................ 181

8.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

8.3 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

8.4 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

8.5 Thedataset ................................ 184

8.6 Parametric identification . . . . . . . . . . . . . . . . . . . . . . . . . 185

8.6.1 Error functions . . . . . . . . . . . . . . . . . . . . . . . . . . 185

8.6.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . 186

8.6.2.1 The linear least-squares method . . . . . . . . . . . 186

8.6.2.2 Iterative search methods . . . . . . . . . . . . . . . 186

8.6.2.3 Gradient-based methods . . . . . . . . . . . . . . . 187

8.6.2.4 Gradient descent . . . . . . . . . . . . . . . . . . . . 187

8.6.2.5 The Newton method . . . . . . . . . . . . . . . . . . 189

8.6.2.6 The Levenberg-Marquardt algorithm . . . . . . . . 190

8.6.3 Online gradient-based algorithms . . . . . . . . . . . . . . . . 192

8.6.4 Alternatives to gradient-based methods . . . . . . . . . . . . 192

8.7 Regularisation............................... 193

8.8 Structural identification . . . . . . . . . . . . . . . . . . . . . . . . . 194

8.8.1 Model generation . . . . . . . . . . . . . . . . . . . . . . . . . 194

8.8.2 Validation............................. 195

8.8.2.1 Testing ......................... 195

8.8.2.2 Holdout......................... 196

8.8.2.3 Cross-validation in practice . . . . . . . . . . . . . . 196

8.8.2.4 Bootstrap in practice . . . . . . . . . . . . . . . . . 196

8.8.2.5 Complexity based criteria . . . . . . . . . . . . . . . 197

8.8.2.6 A comparison of validation methods . . . . . . . . . 199

8.8.3 Model selection criteria . . . . . . . . . . . . . . . . . . . . . 199

8.8.3.1 The winner-takes-all approach . . . . . . . . . . . . 199

8.8.3.2 The combination of estimators approach . . . . . . 200

8.9 Partition of dataset in training, validation and test . . . . . . . . . . 201

8.10 Evaluation of a regression model . . . . . . . . . . . . . . . . . . . . 201

8.11 Evaluation of a binary classifier . . . . . . . . . . . . . . . . . . . . . 202

8.11.1 Balanced Error Rate . . . . . . . . . . . . . . . . . . . . . . . 203

8.11.2 Specificity and sensitivity . . . . . . . . . . . . . . . . . . . . 203

8.11.3 Additional assessment quantities . . . . . . . . . . . . . . . . 203

8.11.4 Receiver Operating Characteristic curve . . . . . . . . . . . . 204

8.11.5 Precision-recall curves . . . . . . . . . . . . . . . . . . . . . . 204

8.12 Multi-class problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

8.13Concludingremarks ........................... 207

CONTENTS 9

8.14Exercises ................................. 207

9 Linear approaches 211

9.1 Linearregression ............................. 211

9.1.1 The univariate linear model . . . . . . . . . . . . . . . . . . . 211

9.1.2 Least-squares estimation . . . . . . . . . . . . . . . . . . . . . 212

9.1.3 Maximum likelihood estimation . . . . . . . . . . . . . . . . . 214

9.1.4 Partitioning the variability . . . . . . . . . . . . . . . . . . . 214

9.1.5 Test of hypotheses on the regression model . . . . . . . . . . 215

9.1.5.1 Thet-test........................ 215

9.1.6 Interval of confidence . . . . . . . . . . . . . . . . . . . . . . 216

9.1.7 Variance of the response . . . . . . . . . . . . . . . . . . . . . 216

9.1.8 Coefficient of determination . . . . . . . . . . . . . . . . . . . 217

9.1.9 Multiple linear dependence . . . . . . . . . . . . . . . . . . . 217

9.1.10 The multiple linear regression model . . . . . . . . . . . . . . 217

9.1.11 The least-squares solution . . . . . . . . . . . . . . . . . . . . 218

9.1.12 Least-squares and non full-rank configurations . . . . . . . . 219

9.1.13 Properties of least-squares estimators . . . . . . . . . . . . . 219

9.1.14 Variance of the prediction . . . . . . . . . . . . . . . . . . . . 220

9.1.15 The HAT matrix . . . . . . . . . . . . . . . . . . . . . . . . . 220

9.1.16 Generalisation error of the linear model . . . . . . . . . . . . 221

9.1.16.1 The expected empirical error . . . . . . . . . . . . . 221

9.1.16.2 The PSE and the FPE . . . . . . . . . . . . . . . . 223

9.1.17 The PRESS statistic . . . . . . . . . . . . . . . . . . . . . . 226

9.1.18 Dual linear formulation . . . . . . . . . . . . . . . . . . . . . 227

9.1.19 The weighted least-squares . . . . . . . . . . . . . . . . . . . 228

9.1.20 Recursive least-squares . . . . . . . . . . . . . . . . . . . . . . 228

9.1.20.1 1st Recursive formulation . . . . . . . . . . . . . . . 229

9.1.20.2 2nd Recursive formulation . . . . . . . . . . . . . . 230

9.1.20.3 RLS initialisation . . . . . . . . . . . . . . . . . . . 230

9.1.20.4 RLS with forgetting factor . . . . . . . . . . . . . . 230

9.2 Linear approaches to classification . . . . . . . . . . . . . . . . . . . 231

9.2.1 Linear discriminant analysis . . . . . . . . . . . . . . . . . . . 232

9.2.1.1 Discriminant functions in the Gaussian case . . . . 233

9.2.1.2 Uniform prior case . . . . . . . . . . . . . . . . . . . 234

9.2.1.3 LDA parameter identification . . . . . . . . . . . . . 236

9.2.2 Perceptrons............................ 236

9.2.3 Support vector machines . . . . . . . . . . . . . . . . . . . . . 238

9.3 Conclusion ................................ 242

9.4 Exercises ................................. 242

10 Nonlinear approaches 245

10.1 Nonlinear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

10.1.1 Artificial neural networks . . . . . . . . . . . . . . . . . . . . 248

10.1.1.1 Feed-forward architecture . . . . . . . . . . . . . . . 248

10.1.1.2 Back-propagation . . . . . . . . . . . . . . . . . . . 250

10.1.1.3 Approximation properties . . . . . . . . . . . . . . . 253

10.1.2 From shallow to deep learning architectures . . . . . . . . . . 254

10.1.3 From global modelling to divide-and-conquer . . . . . . . . . 257

10.1.4 Classification and Regression Trees . . . . . . . . . . . . . . . 257

10.1.4.1 Learning in Regression Trees . . . . . . . . . . . . . 259

10.1.4.2 Parameter identification . . . . . . . . . . . . . . . . 259

10.1.4.3 Structural identification . . . . . . . . . . . . . . . . 259

10.1.5 Basis Function Networks . . . . . . . . . . . . . . . . . . . . . 262

10 CONTENTS

10.1.6 Radial Basis Functions . . . . . . . . . . . . . . . . . . . . . . 262

10.1.7 Local Model Networks . . . . . . . . . . . . . . . . . . . . . . 262

10.1.8 Neuro-Fuzzy Inference Systems . . . . . . . . . . . . . . . . . 263

10.1.9 Learning in Basis Function Networks . . . . . . . . . . . . . . 265

10.1.9.1 Parametric identification: basis functions . . . . . . 266

10.1.9.2 Parametric identification: local models . . . . . . . 266

10.1.9.3 Structural identification . . . . . . . . . . . . . . . . 268

10.1.10 From modular techniques to local modelling . . . . . . . . . . 268

10.1.11 Local modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 270

10.1.11.1 Nadaraya-Watson estimators . . . . . . . . . . . . . 270

10.1.11.2 Higher order local regression . . . . . . . . . . . . . 272

10.1.11.3 Parametric identification in local regression . . . . . 272

10.1.11.4 Structural identification in local regression . . . . . 275

10.1.11.5 The kernel function . . . . . . . . . . . . . . . . . . 275

10.1.11.6 The local polynomial order . . . . . . . . . . . . . . 275

10.1.11.7 The bandwidth . . . . . . . . . . . . . . . . . . . . . 276

10.1.11.8 The distance function . . . . . . . . . . . . . . . . . 277

10.1.11.9 The selection of local parameters . . . . . . . . . . . 278

10.1.11.10Bias/variance decomposition of the local constant

model.......................... 279

10.2 Nonlinear classification . . . . . . . . . . . . . . . . . . . . . . . . . . 281

10.2.1 Direct estimation via regression techniques . . . . . . . . . . 281

10.2.1.1 The nearest-neighbour classifier . . . . . . . . . . . 281

10.2.2 Direct estimation via cross-entropy . . . . . . . . . . . . . . . 284

10.2.3 Density estimation via the Bayes theorem . . . . . . . . . . . 285

10.2.3.1 Naive Bayes classifier . . . . . . . . . . . . . . . . . 285

10.2.3.2 SVM for nonlinear classification . . . . . . . . . . . 287

10.3 Is there a best learner? . . . . . . . . . . . . . . . . . . . . . . . . . . 288

10.4Conclusions ................................ 290

10.5Exercises ................................. 292

11 Model averaging approaches 309

11.1Stackedregression ............................ 309

11.2Bagging .................................. 310

11.3Boosting.................................. 312

11.3.1 The Ada Boost algorithm . . . . . . . . . . . . . . . . . . . . 312

11.3.2 The arcing algorithm . . . . . . . . . . . . . . . . . . . . . . . 314

11.3.3 Bagging and boosting . . . . . . . . . . . . . . . . . . . . . . 315

11.4RandomForests ............................. 315

11.4.1 Why are Random Forests successful? . . . . . . . . . . . . . . 316

11.5 Gradient boosting trees . . . . . . . . . . . . . . . . . . . . . . . . . 316

11.6Conclusion ................................ 317

11.7Exercises ................................. 317

12 Feature selection 319

12.1 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . 320

12.2 Approaches to feature selection . . . . . . . . . . . . . . . . . . . . . 325

12.3Filtermethods .............................. 325

12.3.1 Principal component analysis . . . . . . . . . . . . . . . . . . 326

12.3.1.1 PCA: the algorithm . . . . . . . . . . . . . . . . . . 327

12.3.2 Clustering............................. 329

12.3.3 Ranking methods . . . . . . . . . . . . . . . . . . . . . . . . . 329

12.4Wrappingmethods ............................ 330

12.4.1 Wrapping search strategies . . . . . . . . . . . . . . . . . . . 331

CONTENTS 11

12.4.2 The Cover and van Campenhout theorem . . . . . . . . . . . 332

12.5Embeddedmethods ........................... 332

12.5.1 Shrinkage methods . . . . . . . . . . . . . . . . . . . . . . . . 333

12.5.1.1 Ridge regression . . . . . . . . . . . . . . . . . . . . 333

12.5.1.2 Lasso .......................... 334

12.5.2 Kernelmethods.......................... 335

12.5.3 Dual ridge regression . . . . . . . . . . . . . . . . . . . . . . . 336

12.5.4 Kernel function . . . . . . . . . . . . . . . . . . . . . . . . . . 336

12.6 Similarity matrix and non numeric data . . . . . . . . . . . . . . . . 337

12.7 Averaging and feature selection . . . . . . . . . . . . . . . . . . . . . 338

12.8 Information-theoretic perspective . . . . . . . . . . . . . . . . . . . . 338

12.8.1 Relevance, redundancy and interaction . . . . . . . . . . . . . 338

12.8.2 Information-theoretic filters . . . . . . . . . . . . . . . . . . . 341

12.8.3 Information-theoretic notions and generalisation . . . . . . . 341

12.9 Assessment of feature selection . . . . . . . . . . . . . . . . . . . . . 342

12.10Conclusion ................................ 343

12.11Exercises ................................. 344

13 From prediction to causal knowledge 347

13.1 About the notion of cause . . . . . . . . . . . . . . . . . . . . . . . . 348

13.2 Causality and dependencies . . . . . . . . . . . . . . . . . . . . . . . 349

13.2.1 Simpson's paradox . . . . . . . . . . . . . . . . . . . . . . . . 351

13.3 Causal vs associational knowledge . . . . . . . . . . . . . . . . . . . 353

13.4 The two main problems in causality . . . . . . . . . . . . . . . . . . 355

13.5 Causality and potential outcomes . . . . . . . . . . . . . . . . . . . . 355

13.5.1 Causaleect ........................... 356

13.5.2 Estimation of causal effect . . . . . . . . . . . . . . . . . . . . 357

13.5.3 Assignment mechanisms assumptions . . . . . . . . . . . . . . 358

13.5.4 About unconfoundness . . . . . . . . . . . . . . . . . . . . . . 358

13.5.5 Randomised designs . . . . . . . . . . . . . . . . . . . . . . . 359

13.5.5.1 Estimation of the treatment effect . . . . . . . . . . 360

13.5.5.2 Stratified (or conditionally) randomised experiments 361

13.5.6 Observational study . . . . . . . . . . . . . . . . . . . . . . . 361

13.5.7 Strategies for estimation in observational studies . . . . . . . 362

13.6 From potential outcomes to graphical models . . . . . . . . . . . . . 362

13.7 Causal Bayesian network . . . . . . . . . . . . . . . . . . . . . . . . . 363

13.7.1 Causal networks and Structural Causal Models . . . . . . . . 365

13.7.2 Pre and post-intervention distributions . . . . . . . . . . . . . 365

13.7.3 Causal effect estimation and identification . . . . . . . . . . . 366

13.7.3.1 Backdoor criterion . . . . . . . . . . . . . . . . . . . 368

13.7.3.2 Beyond sufficient set: do-calculus . . . . . . . . . . 370

13.7.4 Selectionbias ........................... 370

13.8Counterfactual .............................. 372

13.9 Causal structure identification . . . . . . . . . . . . . . . . . . . . . . 374

13.9.1 Constraint-based approaches . . . . . . . . . . . . . . . . . . 375

13.9.1.1 Normal conditional independence test . . . . . . . . 375

13.9.1.2 Skeleton discovery . . . . . . . . . . . . . . . . . . . 376

13.9.1.3 Dealing with immoralities in the skeleton . . . . . . 376

13.9.1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . 378

13.10Beyond conditional independence . . . . . . . . . . . . . . . . . . . . 378

13.10.1 Causality and feature selection . . . . . . . . . . . . . . . . . 379

13.10.2 Beyond observational equivalence . . . . . . . . . . . . . . . . 379

13.10.2.1 Learning directionality in bivariate associations . . . 380

13.11Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 382

12 CONTENTS

14 Conclusions 383

14.1 About ML limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 383

14.2Abitofethics............................... 384

14.3 Take-homenotions ............................ 385

14.4Recommendations ............................ 385

A Unsupervised learning 387

A.1 Probability density estimation . . . . . . . . . . . . . . . . . . . . . . 387

A.1.1 Nonparametric density estimation . . . . . . . . . . . . . . . 387

A.1.1.1 Kernel-based methods . . . . . . . . . . . . . . . . . 388

A.1.1.2 k-Nearest Neighbors methods . . . . . . . . . . . . . 389

A.1.2 Semi-parametric density estimation . . . . . . . . . . . . . . . 389

A.1.2.1 Mixture models . . . . . . . . . . . . . . . . . . . . 389

A.1.2.2 The EM algorithm . . . . . . . . . . . . . . . . . . . 390

A.1.2.3 The EM algorithm for the mixture model . . . . . . 390

A.2 K-meansclustering............................ 392

B Linear algebra notions 393

B.1 Rankofamatrix ............................. 393

B.2 Innerproduct............................... 393

B.3 Diagonalisation .............................. 394

B.4 QRdecomposition ............................ 394

B.5 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . 394

B.6 Chain rules of differential calculus . . . . . . . . . . . . . . . . . . . 395

B.7 Quadraticnorm.............................. 396

B.8 Quadratic programming . . . . . . . . . . . . . . . . . . . . . . . . . 396

B.9 The matrix inversion formula . . . . . . . . . . . . . . . . . . . . . . 396

C Probabilistic notions 399

C.1 Common univariate discrete probability functions . . . . . . . . . . . 399

C.1.1 The Bernoulli trial . . . . . . . . . . . . . . . . . . . . . . . . 399

C.1.2 The Binomial probability function . . . . . . . . . . . . . . . 399

C.2 Common univariate continuous distributions . . . . . . . . . . . . . . 400

C.2.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . 400

C.2.2 The chi-squared distribution . . . . . . . . . . . . . . . . . . 400

C.2.3 Student's t -distribution ..................... 400

C.2.4 F-distribution........................... 402

C.3 Common statistical hypothesis tests . . . . . . . . . . . . . . . . . . 402

C.3.1 χ2 -test: single sample and two-sided . . . . . . . . . . . . . . 402

C.3.2 t-test: two samples, two sided . . . . . . . . . . . . . . . . . . 402

C.3.3 F-test: two samples, two sided . . . . . . . . . . . . . . . . . 403

C.4 Transformation of random variables and vectors . . . . . . . . . . . . 403

C.5 Correlation and covariance matrices . . . . . . . . . . . . . . . . . . 404

C.6 Convergence of random variables . . . . . . . . . . . . . . . . . . . . 404

C.6.1 Example.............................. 405

C.7 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . 405

C.8 The Chebyshev's inequality . . . . . . . . . . . . . . . . . . . . . . . 405

C.9 Empirical distribution properties . . . . . . . . . . . . . . . . . . . . 405

C.10Usefulrelations .............................. 406

C.11 Minimum of expectation vs. expectation of minimum . . . . . . . . . 406

C.12 Taylor expansion of function . . . . . . . . . . . . . . . . . . . . . . 407

C.13 Proof of Eq. (7.5.28) . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

C.14 Biasedness of the quadratic empirical risk . . . . . . . . . . . . . . . 407

CONTENTS 13

D Plug-in estimators 409

E Kernel functions 411

F Companion R package 413

G Companion R Shiny dashboards 415

G.1 List of Shiny dashboards . . . . . . . . . . . . . . . . . . . . . . . . . 415

Chapter 1

Introduction

Over the last decades, a growing number of organisations have been allocating a

vast amount of resources to construct and maintain databases and data warehouses.

In scientific endeavours, data refers to carefully collected observations about some

phenomenon under study. In business, data capture information about economic

trends, critical markets, competitors, and customers. In manufacturing, data record

machinery performances and production rates in different conditions. There are

essentially two reasons why people gather increasing volumes of data. First, they

think some valuable assets are implicitly coded within them, and, second, computer

technology enables effective data storage and processing at reduced costs.

The idea of extracting useful knowledge from volumes of data is common to many

disciplines, from statistics to physics, from econometrics to system identification

and adaptive control. The procedure for finding useful patterns in data is known

by different names in different communities, viz., knowledge extraction, pattern

analysis, data processing. In the artificial intelligence community, the most common

name is machine learning [71]. More recently, the set of computational techniques

and tools to support the modelling of a large amount of data is grouped under the

more general label of data science.

The need for programs that can learn was stressed by Alan Turing, who argued

that it might be too ambitious to write from scratch programs for tasks that even

humans must learn to perform. This handbook aims to present the statistical

foundations of machine learning intended as the discipline which deals with the

automatic design of models from data. In particular, we focus on supervised learning

problems (Figure 1.1), where the goal is to model the relation between a set of input

variables and one or more output variables, which are considered to be dependent

on the inputs in some manner.

Since the handbook deals with artificial learning methods, we do not take into

consideration any argument of biological or cognitive plausibility of the learning

methods we present. Learning is postulated here as a problem of statistical estima-

tion of the dependencies between variables on the basis of empirical data.

The relevance of statistical analysis arises as soon as there is a need to extract

useful information from data records obtained by repeatedly measuring an observed

phenomenon. Suppose we are interested in learning about the relationship1 between

two observed variables x(e.g. the height of a child) and y (e.g. the weight of a

child), which are quantitative observations of some phenomenon of interest (e.g.

obesity during childhood). Sometimes, the a priori knowledge that describes the

relation between x and y is available. In other cases, no satisfactory theory exists,

and all that we can use are repeated measurements of x and y.

1Note that the term relation simply denotes the statistical association (due to a probabilistic

dependency) between the two variables and has no causal connotation.

15

16 CHAPTER 1. INTRODUCTION

Figure 1.1: The supervised learning setting. Machine learning aims to infer from

observed data the best model of the stochastic input/output dependency.

In this book, our focus is the second situation where we assume that only a set of

observed data is available. The reasons for addressing this problem are essentially

two. First, the more complex is the input/output relation, the less effective will be

the contribution of a human expert in extracting a model of the relation. Second,

data-driven modelling may be a valuable support for the designer also in modelling

tasks where she can take advantage of existing knowledge.

Though machine learning is becoming a central component in many (so-called)

intelligent applications, we deem that simply considering it as a powerful computa-

tional technology would be utterly reductive. The process of extracting knowledge

from observations lies at the root of the modern scientific process, and the most

challenging issues in machine learning relate to well-established philosophical and

epistemological problems, notably induction or the notion of truth. This is the

reason why we added in this new version of the handbook a preliminary chapter to

situate the machine learning problem into the broader context of human knowledge

acquisition.

Modelling from data

Modelling from data is often viewed as an art, mixing an expert's insight with the

information contained in the observations. A typical modelling process cannot be

considered as a sequential process but is better represented as a loop with many

feedback paths and interactions with the model designer. Various steps are repeated

several times aiming to reach, through continuous refinements, a good description

of the phenomenon underlying the data.

The modelling process consists of a preliminary phase that brings the data from

their original form to a structured configuration and a learning phase that aims to

select the model, or hypothesis, that best approximates the data (Figure 1.2).

The preliminary phase can be decomposed in the following steps:

Problem formulation. Here the model designer chooses a particular application

domain, a phenomenon to be studied, a number of descriptive variables and

hypothesises the existence of a (stochastic) relation (or dependency) between

the measurable variables. The definition of the input variables (and where

necessary their transformations) is a very crucial step and is called feature

17

Figure 1.2: The modelling process and its decomposition in the preliminary phase

and learning phase.

18 CHAPTER 1. INTRODUCTION

Figure 1.3: A training set for a simple supervised learning problem with one input

variable x and one output variable y . The dots represent the observed samples.

engineering. It is important to stress here the proactive role played by the

human (in contrast to a tabula rasa approach), and that this role is a necessary

condition for any knowledge process.

Experimental design. This step aims to return a dataset which, ideally, should

be made of observations that are well-representative of the phenomenon in

order to maximise the performance of the modelling process [55].

Pre-processing. In this step, raw data are cleaned to make learning easier. Pre-

processing includes a large set of actions on the observed data, such as noise

filtering, outlier removal, missing data treatment [124], feature selection, and

so on.

Once the preliminary phase has returned the dataset in a structured input/output

form (e.g. a two-column table), called training set , the learning phase begins. A

graphical representation of a training set for a simple learning problem with one

input variable xand one output variable y is given in Figure 1.3. This manuscript

will mostly focus on this second phase assuming that the preliminary steps have

already been performed by the model designer.

Suppose that, on the basis of the collected data, we wish to learn the unknown

dependency existing between the xvariable and the yvariable. The knowledge of

this dependency could shed light on the observed phenomenon and let us predict

the value of the output yfor a given input (e.g. what is the expected weight of a

child who is 120cm tall?). What is difficult and tricky in this task is the finiteness

and the random nature of data. For instance, a second set of observations of the

same pair of variables could produce a dataset (Figure 1.4) that is not identical to

the one in Figure 1.3 though both originate from the same measurable phenomenon.

This simple fact suggests that a simple interpolation of the observed data would

not produce an accurate model of the data.

The goal of machine learning is to formalise and optimise the procedure which

brings from data to model and consequently from data to predictions. A learning

procedure can be concisely defined as a search, in a space of possible model config-

urations, of the model which best represents the phenomenon underlying the data.

As a consequence, a learning procedure requires both a search space where possible

solutions may be found and an assessment criterion that measures the quality of

the solutions in order to select the best one.

The search space is defined by the designer using a set of nested classes with

increasing capacity (or representation power). For our introductory purposes, it is

19

Figure 1.4: A second realisation of the training set for the same phenomenon ob-

served in Figure 1.3. The dots represent the observed examples.

Figure 1.5: Training set and three parametric models which belong to the class of

first order polynomials.

sufficient to consider here a class as a set of input/output models (e.g. the set of

polynomial models) with the same model structure (e.g. second-order degree) and

the capacity of the class as a measure of the set of input/output mappings which

can be approximated by the models belonging to the class.

Figure 1.5 shows the training set of Figure 1.3 together with three parametric

models which belong to the class of first-order polynomials. Figure 1.6 shows the

same training set with three parametric models, which belong to the class of second-

order polynomials.

The reader could visually decide whether the class of second-order models is

more suitable or not than the first-order class to model the dataset. At the same

time, she could guess which among the three plotted models is the one that produces

the best fitting.

In real high-dimensional settings, however, a visual assessment of the quality

of a model is neither possible nor sufficient. Data-driven quantitative criteria are

therefore required. We will assume that the goal of learning is to achieve a good

statistical generalisation. This means that the learned model is expected to return

an accurate prediction of the dependent (output) variable for new (unseen) values

of the independent (input) variables. By new values we intend values which are not

part of the training set but are generated by the same stochastic process.

Once the classes of models and the assessment criteria are fixed, the goal of

a learning algorithm is to search i) for the best class of models and ii) for the

best parametric model within such a class. Any supervised learning algorithm is

then made of two nested loops denoted as the structural identification loop and the

parametric identification loop.

20 CHAPTER 1. INTRODUCTION

Figure 1.6: Training set and three parametric models which belong to the class of

second-order polynomials.

Structural identification is the outer loop that seeks the model structure which

is expected to have the best accuracy. It is composed of a validation phase, which

assesses each model structure on the basis of the chosen assessment criterion, and a

selection phase which returns the best model structure on the basis of the validation

output. Parametric identification is the inner loop that returns the best model for a

fixed model structure. We will show that the two procedures are intertwined since

the structural identification requires the outcome of the parametric step in order to

assess the goodness of a class.

Statistical machine learning

On the basis of the previous section, we could argue that learning is nothing more

than a standard problem of optimisation. Unfortunately, the reality is far more

complex. In fact, because of the finite amount of data and their random nature,

there exists a strong correlation between parametric and structural identification

steps, which makes non-trivial the problem of assessing and, finally, choosing the

prediction model. In fact, the random nature of the data demands a definition of

the problem in stochastic terms and the adoption of statistical procedures to choose

and assess the quality of a prediction model. In this context, a challenging issue is

how to determine the class of models more appropriate to our problem. Since the

results of a learning procedure are found to be sensitive to the class of models chosen

to fit the data, statisticians and machine learning researchers have proposed over

the years a number of machine learning algorithms. Well-known examples are linear

models, neural networks, local modelling techniques, support vector machines, and

regression trees. The aim of such learning algorithms, many of which are presented

in this book, is to combine high generalisation with an effective learning procedure.

However, the ambition of this handbook is to present machine learning as a sci-

entific domain that goes beyond the mere collection of computational procedures.

Since machine learning is deeply rooted in conventional statistics, any introduc-

tion to this topic must include some introductory chapters to the foundations of

probability, statistics and estimation theory. At the same time, we intend to show

that machine learning widens the scope of conventional statistics by focusing on a

number of topics often overlooked by statistical literature, like nonlinearity, large

dimensionality, adaptivity, optimisation and analysis of massive datasets.

It is important to remark, also, that the recent adoption of machine learning

models is showing the limitation of pure black-box approaches, targeting accuracy

at the cost of interpretability. This is made evident by the embedding of automatic

approaches in decision-making processes with impact on ethical, social, political, or

21

juridical aspects. While we are personally skeptical about gaining any interpretabil-

ity from a large number of parameters and hyperparameters underlying a supervised

learner, we are confident that human insight can be obtained by techniques able

to reduce or modularise large variate tasks. In this direction, feature selection and

causal inference techniques are promising approaches to master the complexity of

data-driven modelling and return human accessible descriptions (e.g. in the form

of mechanisms).

This manuscript aims to find a good balance between theory and practice by

situating most of the theoretical notions in a real context with the help of practical

examples and real datasets. All the examples are implemented in the statistical

programming language R [160] made available by the companion package gbcode

(Appendix F). In this second edition, we provide as well a number of Shiny dash-

boards (Appendix G) to give the reader a more tangible idea of somewhat abstract

concepts. For an introduction to R we refer the reader to [53, 189]. This prac-

tical connotation is particularly important since machine learning techniques are

nowadays more and more embedded in plenty of technological domains, like bioin-

formatics, robotics, intelligent control, speech and image recognition, multimedia,

web and data mining, computational finance, business intelligence.

Outline

The outline of the book is as follows. Chapter 2 is one of the novelties of the second

edition. Its aim is to situate the process of modelling from data in a larger epistemo-

logical domain dealing with the problem of extracting knowledge from observations.

We deem it interesting to show how some of the formal problems addressed in the

book dates back to old philosophical disputes and works. Chapter 3 summarises the

relevant background material in probability. Chapter 4 has been added to introduce

graphical modelling, a flexible and interpretable way of representing large variate

problems in probabilistic terms. In particular, this formalism puts into evidence the

importance of conditional independence as a key notion to illustrate the properties

of dependencies and simplify the modelling of large dimensional tasks. Chapter 5 in-

troduces the parametric approach to parametric estimation and hypothesis testing.

Chapter 6 presents some nonparametric alternatives to the parametric techniques

discussed in Chapter 5. Chapter 7 introduces supervised learning as the statistical

problem of assessing and selecting a hypothesis function on the basis of input/output

observations. Chapter 8 reviews the steps which lead from raw observations to a

final model. This is a methodological chapter that introduces some algorithmic

procedures underlying most of the machine learning techniques. Chapter 9 presents

conventional linear approaches to regression and classification. Chapter 10 intro-

duces some machine learning techniques which deal with nonlinear regression and

classification tasks. Chapter 11 presents the model averaging approach, a recent and

powerful way for obtaining improved generalisation accuracy by combining several

learning machines. Chapter 12 deals with the problem of dimensionality reduction

and in particular with feature selection strategies. Chapter 13 has been added in the

2nd edition to make clear the limitations of associational approaches and to stress

the risk of wrong extrapolation and biases if pure statistical results are interpreted

in a causal manner. We believe that causal reasoning represents the ultimate step

in the data analytics process going from data to knowledge.

Although the book focuses on supervised learning, some related notions of un-

supervised learning and density estimation are presented in Appendix A.

22 CHAPTER 1. INTRODUCTION

1.1 Notations

Throughout this manuscript, boldface denotes random variables and normal font is

used for instances (realisations) of random variables. Strictly speaking, one should

always distinguish in notation between a random variable and its realisation. How-

ever, we will adopt this extra notational burden only when the meaning is not

clear from the context. Then we will use Prob {z } (or (p (z )) as a shorthand for

Prob {z =z} ((pz (z )) when the identity of the random variable is clear from the

context.

As far as variables are concerned, lowercase letters denote scalars or vectors of

observables, greek letters denote parameter vectors, and uppercase denotes matrices.

Uppercase in italics denotes generic sets while uppercase in greek letters denotes

sets of parameters.

Gender-neutral pronoun: computer sciences suffer from the gender issue and

probably much more than other sciences. Of course, you won't find any solution in

this book but the author (a man) felt odd in referring to a generic reader by using

a masculine pronoun only. He then decided to use as much as possible a "(s)he"

notation or, alternatively, a (balanced) random gender choice.

Generic notation

-θ : Parameter vector.

-θ : Random parameter vector.

-M : Matrix.

- [N× n ] or [N, n]: Dimensionality of a matrix with N rows and ncolumns.

-MT : Transpose of the matrix M.

- diag[m1 , . . . , mN ]: Diagonal matrix with diagonal [m1 , . . . , mN ]

-M : Random matrix.

-ˆ

θ: Estimate of θ.

-ˆ

θ: Estimator of θ.

-τ : Index in an iterative algorithm.

Probability Theory notation

- Ω : Set of possible outcomes.

-ω : Outcome (or elementary event).

-{E} : Set of possible events.

-E : Event.

- Prob {E} : Probability of the event E.

- (Ω, {E}, Prob {·} ): Probabilistic model of an experiment.

-Z : Domain of the random variable z.

1.1. NOTATIONS 23

-P (z ): Probability distribution of a discrete random variable z . Also Pz (z).

-F (z ) = Prob {z z } : Distribution function of a continuous random variable

z. Also Fz (z ).

-p (z ): Probability density of a continuous r.v.. Also pz (z).

-E [z ]: Expected value of the random variable z.

-Ex [z ] = R X z (x, y)p(x ) dx : Expected value of the random variable z averaged

over x.

- Var [z ]: Variance of the random variable z.

-LN (θ ): Likelihood of a parameter θgiven the dataset DN .

-lN (θ ): Log-Likelihood of a parameter θgiven the dataset DN .

-U (a, b ): univariate uniform probability density between a and b a .

-N (µ, σ2 ): univariate Normal probability density with mean µand variance

σ2 (Section 3.4.2).

-zpz (z ): random variable z with probability density pz (z).

-z∼ N (µ, σ 2 ): random variable z with Normal density with mean µand

variance σ2 .

Learning Theory notation

-x : Multidimensional random input variable.

-xj :j th component of the multidimensional input variable.

-X Rn : Input space.

-y : Multidimensional output variable.

-Y R : Output space.

-xi :i th observation of the random vector x.

-xij :i th observation of the jth component of the random vector x.

-f (x ): Target regression function.

-w : Random noise variable.

-zi =hxi , yi i : Input-output example (also observation or data point): ith case

in training set.

-N : Number of observed examples in the training set.

-DN ={z1 , z2, . . . , zN } : Training set.

- Λ: Class of hypothesis.

-α : Hypothesis parameter vector.

-h (x, α ): Hypothesis function.

24 CHAPTER 1. INTRODUCTION

- Λs : Hypothesis class of capacity (or complexity) s.

-L (y, f (x, α )): Loss function.

-R (α ): Functional risk.

-α0 : arg minαΛ R (α ).

-Remp (α ): Empirical functional risk.

-αN : Parameter which minimises the empirical risk of DN

-GN : Mean integrated squared error (MISE).

-l : Number of folds in cross-validation.

-ˆ

Gcv : Cross-validation estimate of GN .

-ˆ

Gloo : Leave-one-out estimate of GN .

-Ntr : Number of examples used for training in cross-validation.

-Nts : Number of examples used for test in cross-validation.

-D(i) : Training set with the ith example set aside.

-αN(i ) : Parameter which minimises the empirical risk of D(i) .

-ˆ

Gbs : Bootstrap estimate of GN .

-D(b) : Bootstrap training set of size N generated by DN with replacement.

-α(b) : Parameter which minimises the empirical risk of the bootstrap set D(b) .

-B : Number of bootstrap examples.

Data analysis notation

-xi :i th row of matrix X.

-x·j :j th column of matrix X.

-xij :j th element of vector xi .

-Xij :ij th element of matrix X.

-q : Query point (point in the input space where a prediction is required).

- ˆ yq : Prediction in the query point.

- ˆ yj

i: Leave-one-out prediction in x i with the j th s example set aside.

-eloo

j=y j ˆ yj

j: Leave-one-out error with the j th example set aside.

-K (· ): Kernel function.

-B : Bandwidth.

1.1. NOTATIONS 25

-β : Linear coefficients vector.

-ˆ

β: Least-squares parameters vector.

-ˆ

βj : Least-squares parameters vector with the jth example set aside.

-hj (x, α ): jth ,j = 1, . . . , m , local model in a modular architecture.

-ρj : Activation or basis function.

-ηj : Set of parameters of the activation function.

26 CHAPTER 1. INTRODUCTION

Chapter 2

Setting the foundations:

machine learning and

epistemology

Machine learning is a relatively new discipline, but its foundations rest on much

older notions like modelling, reasoning, information, truth, knowledge, uncertainty,

induction. Nowadays, much of those notions have a mathematical and/or com-

putational interpretation, also thanks to machine learning. Nevertheless, before

reaching a mathematical formalisation, they have been the object of an extensive

philosophical inquiry and discussion. The aim of this chapter (primarily inspired by

the book [81]) is to provide a rapid historical journey over the most important con-

tributions of philosophy to epistemology, the branch of philosophy of science that

investigates how humans extract and attain knowledge in the scientific process.

The two main phases of human reasoning are the acquisition of true knowledge

and its manipulation in a truth-preserving manner. Induction is concerned with the

first part, while deductive logic addresses the second one. In ancient times, logic

was the only aspect of knowledge that deserved the attention of philosophers and

epistemologists. A possible reason was that, until the scientific revolution, it was a

common belief that either truth was inaccessible (e.g. the allegory of Plato's cave)

or could be attained only by an initiatory process of inspiration, made possible by

the benevolence of God.

2.1 Deductive logic

The most ancient discipline formalising the notions of truth, reasoning, and knowl-

edge is logic, whose origin dates back to Aristotle. Logic is concerned with defining

the properties that reasoning mechanisms should have in order to transform con-

sistently true statements into other true statements. The objects of reasoning are

arguments, i.e. groups of propositions where a proposition is a statement that can

be either true or false. According to [106] an argument (or inference) is made of

two groups of statements, one of which (premises) is claimed to provide support for

the other (conclusions). For instance

If A, then C

A.

C

27

28 CHAPTER 2. SETTING THE FOUNDATIONS

is an argument where the groups of premises is made of the two propositions ("If

A, then C" and "A") and the conclusion is the proposition "C". Premises are the

statements that define the evidence while the conclusion is the statement that the

evidence is supposed to imply. An argument consisting of exactly two premises and

one conclusion, like the one above, is called a syllogism. If one of the two premises is

in the conditional form (as the example above), it is called a hypothetical syllogism.

Logic cannot, in general, tell whether premises are true or false (factual claim). It

is instead concerned with the quality of the reasoning process, which links premises

to conclusions (inferential claim). Its purpose is to develop methods and techniques

that allow us to distinguish good arguments (where the premises do support the

conclusion) from bad ones. In particular, logic distinguishes between validity and

sound arguments. An argument is valid if

it is logically impossible for the conclusion to be false when the premises are

true,

conclusion is a logical consequence of (it follows from) the premises,

it is truth-preserving, i.e. the conclusion is implicitly contained in the premises.

Two examples of valid arguments are

1. Premises: "If A, then C." and "A is true". Conclusion:"C is true".

2. Premises: "Every F is G." and "b is F". Conclusion: "b is G".

Validity is something that is determined by the relationship between premises and

conclusion ("does the premises support the conclusion?") and not by the actual

truth of premises and/or conclusions1. It follows that valid arguments are risk-free

arguments. Note also that the validity of an argument depends only on its form

(or pattern) and not on the content (i.e. no matter what are the substitutes for A

and C in the first argument). A valid argument is also called a deductive argument.

Examples of deductive arguments are arguments in which the conclusion depends

on some arithmetic or geometric computations or mathematical demonstrations.

All arguments in pure mathematics are deductive.

An argument is sound if it is valid and its premises are true. Soundness for

deductive logic has to do with both the validity and truth of the premises. Every

sound argument, by definition, will have a true conclusion as well. For instance, the

argument

All Italians play pretty good football

Gianluca is Italian.

Gianluca plays pretty good football

is valid since the conclusion follows necessarily from the premises but not sound

(otherwise, Gianluca would have been playing for Fiorentina AC).

2.2 Formal and empirical science

Epistemologists are used to distinguishing between formal and empirical sciences.

Deductive arguments are the workhorse of formal sciences like geometry and math-

ematics. Those disciplines are built on a number of axioms, taken for true, and

1with the exception that a deductive argument with true premises and a false conclusion is

necessarily invalid

2.3. INDUCTION, PROJECTION, AND ABDUCTION 29

on an effective truth-preserving mechanism. As such, they reason about a concep-

tual world, not necessarily in relation to the material world, where it is possible

to define notions of truth, correctness, and soundness. On the empirical side, we

find disciplines like physics, biology, and economics, whose statements are supposed

to have a strong relationship with (some aspects of) sensible human experiences.

Though empirical sciences often rely on formal sciences to define notions, concepts,

and models, the validity of an empirical science proposition does not derive exclu-

sively from its formal truth but essentially from the fact that its predictions are

in accordance with experimental observations. Empirical sciences make then use of

inductive arguments where the content of the conclusion is in some way intended

to go beyond the content of the premises: a typical example is a prediction about

a future event based on the observation of some events, i.e. the supervised learning

scenario illustrated in Figure 1.1.

Modern empirical science, and the critical analysis of its inductive basis, be-

gan around the 16th and 17th centuries when the demand for new technologies

(e.g. for military or exploration reasons) stimulated the inquiry into the origins of

knowledge. In 1620 Francis Bacon, an English philosopher (1561-1626), published

the Novum Organum , which presented an inductivist view of science. According to

Bacon, scientific reasoning consists of making generalisations, or inductions , from

observations to general laws of nature (e.g. moving to the conclusion that all swans

are white after a number of historical observations). In other terms, the observa-

tions are supposed to induce the formulation of natural laws in the mind of the

scientist.

2.3 Induction, projection, and abduction

Induction is defined as an inference in which one takes the past as grounds for

beliefs about the future or the observed as grounds for beliefs about the unob-

served. In other words, an inference is ampliative , i.e. it has more content in the

conclusion than in the premises, unlike logical reasoning, which is deductive and

non-ampliative. In inductive inference, the premises or departure points are called

data or observations, and the conclusions are referred to as hypotheses2. A proba-

bilistic language is usually adapted to express a hypothesis derived from induction.

Induction has the following properties that contrast with the deductive pattern of

inference [19]

1. The conclusion (e.g. hypothesis h (D )) follows non-monotonically from the

premises (e.g. the dataset D). The addition of an extra premise (i.e. more

data) might change the conclusion even when the extra premise does not

contradict any of the other premises. In other terms, D1 D2 6⇒ h ( D1 )

h( D2 ), where h( D ) is the inductive consequence of the set of observations D.

2. The truth of the premises is not enough to guarantee the truth of the conclu-

sion as there is no correspondence to the notion of deductive validity.

3. There is an information gain in induction since a hypothesis asserts more than

data alone.

Another substantial difference is that, while logical arguments derive their va-

lidity from their form, this does not apply to inductive arguments: two inductive

arguments may have the same form, but one may be good and the other not. So

inductive inference is both useful and unsafe: no conclusion is a guaranteed truth,

and it can dissolve even if no premise is removed.

2Note that this should not be confused with what mathematicians call mathematical induction

which is a kind of deduction

30 CHAPTER 2. SETTING THE FOUNDATIONS

There are several forms of inductive arguments:

1. Statement about a sample drawn from a population Statement about the

population as a whole

2. Statement about a population Statement about a sample

3. Statement about a sample Statement about a new sample

4. Observation of facts Hypothesis

The third form of inference is also called projection [81] and is implemented in statis-

tical learning by memory-based (e.g. lazy learning in Section 10.1.11) or transduc-

tion algorithms. The fourth form is also known as abduction , explanatory inference,

deduction in reverse or inference to the best explanation. Abduction is a less ambi-

tious form of induction since it does not infer to a generalisation but to a hypothesis

that explains the data. In abduction, given h D and the observation of D we

infer the condition h. The rationale is that explanatory considerations are a guide

to inference: in other words, the hypothesis that would (if correct) best explains

the evidence is the hypothesis that is most likely to be correct. Note that this is

the mechanism typically used in statistical hypothesis testing (Section 5.11).

An example of abduction is the Darwin theory. At his time, Darwin inferred the

hypothesis of natural selection because, though not entailed by biological evidence,

natural selection would provide the best explanation of that evidence. Darwin

did not witness specific cases of evolution but formulated his hypothesis as an

explanation of the available observations.

2.4 Hume and the induction problem

The downside of the induction success is its problematic and unsafe aspect, i.e. the

projection of regularity onto unseen cases. The main problem of induction is how

to justify the inference from the observed (data) to the unobserved (laws of nature),

from the past (historical time series) to the future (e.g. prediction).

David Hume (1711-1776) was a Scottish philosopher who studied the problem of

induction from a philosophic perspective. In 1739 he published A treatise of human

nature, one of the most influential books of Western philosophy. According to Hume,

all reasonings concerning nature are founded on experience, and all reasonings from

experience are founded on the supposition that the course of nature will continue

uniformly the same or in other terms that the future will be like the past. Any

attempt to show, based on experience, that a regularity that has held in the past

will hold in the future too will be circular (since based on the principle of regularity

itself).

So empirical sciences rely on a supposition that, as shown by Hume, has no

logical necessity. In other words, there is no contradiction in supposing that the

future could be totally unlike the past (Figure 2.1) since we have no logical reason

to expect that the past resembles the future.

So why do humans expect the future to be like the past? According to Hume,

this is part of human nature: we have inductive habits, but we cannot justify them.

The principle of uniformity of nature is not a priori true, nor it can be proved

empirically. There is no reason beyond induction to justify inductive reasoning.

Thus, Hume offers a naturalistic explanation of the psychological mechanism by

which empirical predictions are made but not any rational justification for this

practice. Our inductive practices rest on habit and custom and cannot be justified

by rational argument. Induction is psychologically natural to us [81].

2.5. LOGICAL POSITIVISM AND VERIFICATIONISM 31

Figure 2.1: Falsification of the inductive hypothesis "Are all swans white?"

2.5 Logical positivism and verificationism

Logical positivism is a philosophical movement belonging to the wider family of

empiricism, which developed in Europe after World War I and was established by a

group of people (including Schlick, Neurath, and Carnap), also known as the Vienna

Circle. They were inspired by the developments in sciences at the beginning of the

XXth century, notably the work of Einstein. Two are the central ideas (or dogmas)

of logical positivism: the distinction between analytic and synthetic sentences and

the verifiability theory of meaning [81].

Analytic sentences are true or false, whatever is the world state. Analytical

truths (e.g. in mathematics and logic) are necessary but somewhat empty. Math-

ematics does not describe the world and is independent of experience: they are a

convention to use symbols in a particular way.

A synthetic sentence is true or false according to the actual state of the world.

The value of synthetic sentences resides then in their method of verification. In

other words, knowing the meaning of a sentence boils down to know how to verify

it through observation. Verificationism is a strong empiricist principle: the only

source of knowledge and the only source of meaning is observation. There are two

categories of verifiable statements: i) observation statements (e.g. the temperature

is below zero) which are directly verifiable, and ii) theoretical statements (indirectly

verifiable) from which we can deduce observation statements.

Verificationists reject as "meaningless" statements specific to entire fields such

as metaphysics, theology, ethics since they do not imply verifiable observations.

Such statements may be meaningful in influencing emotions or human behaviour

but provide no truth value, information, or factual content.

Science consists then of verifiable and then meaningful claims. According to

the philosophy of logical positivism, a general statement or theory can be arrived

at by inductive reasoning. Moreover, if such a theory is verified by observation or

experiment, it can be promoted to a law. It follows that verifiability is the criterion

of what is and what is not science (demarcation criterion).

Logical positivists stress that almost none of the evidence in everyday life and

science may have the same degree of necessity as deductive logic. No evidence for a

scientific theory is ultimately decisive since there is always the possibility of error,

but this does not prevent science from being supported by evidence. The great aim

of science is to discover and establish generalisations since there is no alternative to

knowledge besides experience.

32 CHAPTER 2. SETTING THE FOUNDATIONS

However, the verificationist ambition of grounding scientific truth in experience

encountered some major problems related to the real possibility of verifying hy-

pothesis in practice:

1. pure observations do not exist: observations are always theory-laden , i.e. they

are inevitably affected by the theoretical beliefs (or expectations) of the inves-

tigator. Observations are neither neutral nor exhaustive, even in a big data

world. To observe means to select what seems to be pertinent for the analysis,

and this demands a specific and voluntary action from the experimenter (e.g.

selection of the instrumentation or the language to communicate the results).

Unfortunately, the analyst is often unaware of such selection, inducing then

dangerous bias in the possible conclusions (Section 13.7.4).

2. no scientific assumption is testable in complete isolation (also known as the

problem of holism about testing): the dogma of verificationism is naive since,

in practice, only whole complex structured hypotheses may be submitted to

empirical tests. Our ideas and hypothesis have contact with the experience

only as a whole. Whenever we assess a theory by comparing it with observa-

tions, we need many additional assumptions to put a theoretical statement at

the same level of observations.

3. unobservable entities escape from verification: one of the basic claims of logical

positivists is that all aspects of science can be reduced to observational state-

ments and submitted to verification (in science, there are no depths, there is

surface everywhere ). However, many successful and universally accepted sci-

entific formulations rely on hidden structures and notions that are not directly

observable (or mapped to observations in a univocal manner). Consider, for

instance, the notions of gene or electron and the significant impact they have

on the human understanding of reality.

Such criticisms contributed to the decline of the positivist program and opened the

way to alternative interpretations of the knowledge discovery process.

2.6 Popper and the problem of induction

Karl Popper (1902-1994) is generally regarded as one of the greatest philosophers

of science of the 20th century. His first achievement was an original definition of

science based on the distinction between scientific and pseudo-scientific statements

(also known as the demarcation problem). The solution he proposes is called falsifi-

cationism in opposition to the verificationism of positivists. Falsificationism claims

that a hypothesis is scientific if and only if it has the potential to be refuted by

some possible observation. To be scientific, a hypothesis has to entail testable pre-

diction; in other words, it has to be bold, to take a risk. For instance, All F is G is

a scientific statement while Some F is G is not. All scientific theories are univer-

sal in nature, and no finite collection of observation statements, however great, is

logically equivalent to or can justify an unrestricted universal proposition. At the

same time, we are never completely sure that a theory is true (aka fallibilism ). A

well-known example is Newton's physics which was considered for a long time as a

gold standard of scientific theory until it was shown to be false in several respects.

Popper was sceptical about all forms of confirmation and notably about the

theory of confirmation proposed by empiricists. According to him, the only good

reasoning is deductively valid reasoning. According to Popper, humans or scientists

do not make inductions; they make conjectures (or hypotheses) and test them (or

their logical consequences obtained by deduction). If the test is successful, the

conjecture is corroborated but never verified or proven. Confirmation is thus a

2.7. INSTRUMENTALISM 33

myth : no theory or belief about the world can be proven. Though no number of

positive experimental outcomes can demonstrate the truth of a scientific theory, a

single genuine counter instance can refute it (modus tollens ). It follows that we learn

something by deduction and not by induction. If the empirical test of the conjecture

is not successful, the conjecture is refuted. The refutation of a hypothesis leads us

(or the scientific community) to revise it or devise a more robust one. The final

result is that scientific laws are falsifiable yet strictly unverifiable.

Scientific knowledge evolves via a two-step cycle that repeats endlessly: the first

stage is made of conjecture making. The second stage is attempted refutation, when

the hypothesis is submitted to critical testing. The most important qualities of a

scientist are then imaginative (almost artistic) creativity, and rigorous testing.

Also, according to Popper, there are no "pure" or theory-free observations. Ob-

servation is always selective: it needs a chosen object, a definite task, an interest, a

point of view, a problem. Observation is theory-laden and involves applying theoret-

ical terms, descriptive language, and a conceptual scheme to particular experimental

situations.

2.7 The hypothetico-deductive method and instru-

mentalism

Nowadays, the most commonly agreed vision of science (hypothetico-deductivism )

merges the main ideas of logic and induction, realism and empiricism, of verifica-

tionism and falsificationism. According to this vision, science is a process where

scientists formulate hypotheses (e.g. inductive step after a preliminary stage where

observations were collected) and then deduce observational predictions from them.

If predictions are accurate, then the theory is supported (in agreement with logical

positivists) or (e.g. in Bayesian terms) its degree of truth increases. If predictions

are not accurate, the theory is disconfirmed (this is coherent with Popper). The

more tests a theory passes, the more confidence we can have in its truth3.

If the value of scientific models is intimately related to the quality of prediction,

they should be seen more as useful tools (or instruments) than a faithful representa-

tion of reality. The notion of instrumentalism was introduced by Van Fraassen [184].

An instrumentalist does not worry about whether a theory is a true description of

the world (e.g. if electrons really exist). The role of a theory is to establish a good

prediction. The question of whether our theory has some deeper match in the real

world will never have an answer so we should stop asking it.

Van Fraassen thinks that the only aim of theories is to accurately describe the

observable parts of the world. If this happens, they are empirically adequate. Trying

to address the hidden nature of reality is of no interest to science.

2.8 Epistemology and machine learning: the cross-

fertilisation

This chapter sketched some contributions of epistemology to the understanding of

how humans extract and attain knowledge from observations, in particular during

the scientific endeavour.

Machine learning, the topic of this book, is a computationally based approach

aiming to produce knowledge from observed data. If we make the basic assumption

3Note also that the predominating use of probabilistic hypotheses to take into account noisy

observation is in contradiction with the restrictive vision of Popper on logical deduction.

34 CHAPTER 2. SETTING THE FOUNDATIONS

that both epistemology and machine learning refers to the same notion of knowl-

edge (i.e. knowledge useful for human beings), an epistemological approach can be

useful to understand both limits and potential of machine learning. The author

is convinced that a fruitful cross-fertilisation can derive from a stronger synergy

between epistemology and machine learning. In particular, he expects the following

contributions from a machine learning approach to the study of knowledge discov-

ery:

machine learning deals intimately with induction or how observations can

induce and/or confirm a theory, one of the most fundamental problems of

philosophy of science. Also, it implements in a reproducible and testable way

the mechanism of learning, generation of hypothesis, and testing.

machine learning is today unavoidable in supporting discovery in scientific

domains where human experts would be overwhelmed by complexity and di-

mensionality.

machine learning is a key factor of the revolution transforming all empirical

sciences into data sciences, i.e. inductive disciplines where the quality and the

accuracy of the discoveries are strictly dependent on the capacity of extracting

accurate information, predictions, or models from large amounts of observed

data.

machine learning generalises and democratises the notion of observed evidence

by making it converging with the notion of data. Every instrument (or tool or

simulator) producing data can be taken as the starting point of a knowledge

discovery process. This extends the common notion of experimental evidence

adopted in conventional sciences, like physics. A financial transaction, a tweet,

or a GPS trace may be for some domains as informative as a CERN multi-

million experience in physics.

machine learning is the ultimate step in the scientific process moving from the

optimistic objective of finding true descriptions of reality to the more realistic

goal of attaining accurate models of observations.

At the same time, there a number of lessons that young data scientists could

learn from ancient and recent philosophers of science:

A critical analysis of the role of observations and data: all empirical sciences

derive their justification from the fact of being firmly founded on experiments.

The distinctive nature of machine learning, and a reason for its success, is the

automatic process of extracting knowledge from data. Observations and data

are then necessary conditions for triggering any knowledge discovery proce-

dure. There is, however, the risk to sanctify the role of data (or facts) as

an unquestionable and objective foundation of truth. This excess has been

several times discussed and criticised by epistemologists (notably the crit-

ics of logical positivism). Pure facts and theory-neutral observations do not

exist, not even in a big-data world. Observations (and more specifically ex-

periments) are never passive or beyond any suspicion: they are the results of

a specific human initiative (or intervention) that can be dictated by specific

objectives, constraints, and motivation. The presupposition that the truth

of empirical statements can be securely established only by observation is a

naive attitude that could lead to disastrous consequences (e.g. sexist or racist

AI applications due to sampling bias) [41].

Skepticism about induction: the Hume analysis, confirmed by theoretical anal-

ysis in machine learning (notable the no-free-lunch theorem) reminds us that

2.8. EPISTEMOLOGY AND MACHINE LEARNING: THE CROSS-FERTILISATION35

atabula rasa approach going from data to knowledge is not possible. There is

no univocal (or optimal) way of proceeding from observations to models since

every learning process relies on (explicit or more often implicit) assumptions.

This is also related to the notion of undetermination of theory by evidence,

which means that there will always be a range of alternative theories compat-

ible with observations.

Importance of hypothesis generation and validation: this important lesson

comes straight from Popper and associates the scientific character of a knowl-

edge discovery process to the possibility of falsification. In that sense, machine

learning complies with the Popper interpretation of science and goes further

by proposing a set of strategies for automatically generating hypotheses and

validating them by empirical evidence. In more actual terms, the best way

to ensure falsifiability to computation sciences is reproducibility and inter-

pretability. These two aspects are essential to guarantee the respect of high

standards of quality and rigour in computational approaches to knowledge

discovery. Forgetting the assumptions underlying any data-driven effort may

lead to accepting biased conclusions and misinterpretations (e.g. from a causal

perspective), which are dangerously endorsed by the size of the dataset or the

complexity of the algorithmic approach.

Model as tools: the adoption of complex representation of reality (though

characterised by high-level notions and principles) makes difficult, if not un-

realistic, the validation of all the components of a model. As a consequence,

a model should not be considered as a faithful copy of reality but as a conve-

nient abstraction, which, if confirmed by experimental validation, becomes a

useful instrument for prediction and decision making.

The confirmation of a hypothesis requires taking into account the procedures

involved in generating data: confirmation of a hypothesis with observations

is not a go-nogo process. Since new evidence changes degrees of validity

(or degree of belief), a probabilistic approach is necessary. This is why any

introduction to machine learning needs first an introduction to probability,

probabilistic reasoning, and then statistics.

36 CHAPTER 2. SETTING THE FOUNDATIONS

Chapter 3

Foundations of probability

Uncertainty is inescapable in the real world. Even without resort to indeterminism,

its pervasiveness is due to the complexity of reality and the limitations of human

observational skills and modelling capabilities. According to [119] uncertainty arises

because of limitations in our ability to observe the world, limitations in our ability

to model it, and possibly even because of innate nondeterminism. Probability theory

is one of many disciplines [143] concerned with the study of uncertain (or random)

phenomena. It is also, according to the author, one of the most successful ones

in terms of formalisation, theoretical and algorithmic developments and practical

applications. For this reason, in this book, we will adopt probability as the math-

ematical language to describe and quantify uncertainty. Uncertain phenomena,

although not predictable in a deterministic fashion, may present some regularities

and consequently be described mathematically by idealised probabilistic models.

These models consist of a list of all possible outcomes together with the respective

probabilities. The theory of probability makes it possible to infer from these models

the patterns of future behaviour.

This chapter presents the basic notions of probability which serve as a necessary

background to understand the statistical aspects of machine learning. We ask the

reader to become acquainted with two aspects: the notion of a random variable

as a compact representation of uncertain knowledge and the use of probability as

an effective formal tool to manipulate and process such uncertain information. In

particular, we suggest the reader give special attention to the notions of conditional

and joint probability. As we will see in the following, these two related notions

are extensively used by statistical modelling and machine learning to define the

dependence and the relationships between random variables.

3.1 The random model of uncertainty

We define a random experiment as any action or process which generates results

or observations which cannot be predicted with certainty. Uncertainty stems from

the existence of alternatives. In other words, each uncertain phenomenon is charac-

terised by a multiplicity of possible configurations or outcomes. Weather is uncer-

tain since it can take multiple forms (e.g. sunny, rainy, cloudy,...). Other examples

of random experiments are tossing a coin, rolling dice, passing an exam or measuring

the time to reach home.

A random experiment is then characterised by a sample space Ω that is a (finite

or infinite) set of all the possible outcomes (or configurations) ω of the experiment.

The elements of the set Ω are called experimental outcomes or realisations. For

example, in the die experiment, Ω = {ω1 , ω2, . . . , ω6 } and ωi stands for the outcome

37

38 CHAPTER 3. FOUNDATIONS OF PROBABILITY

corresponding to getting the face with the number i . If ω is the outcome of a

measurement of some physical quantity, e.g. pressure, then we could have Ω = R+ .

The representation of an uncertain phenomenon is the result of a modelling

activity and, as such, it is not necessarily unique. In other terms different repre-

sentations of a random experiment are possible. In the die experiment, we could

define an alternative sample space made of two sole outcomes: numbers equal to

and different from 1. Also, we could be interested in representing the uncertainty

of two consecutive tosses. In that case, the outcome would be the pair (ω (t) , ω(t+1) )

where ω(t) is the outcome at time t.

Uncertainty stems from variability. Each time we observe a random phenomenon,

we may observe different outcomes. In probabilistic jargon, observing a random

phenomenon is interpreted as the realisation of a random experiment. A single

performance of a random experiment is called a trial. This means that after each

trial, we observe one outcome ωi Ω.

A subset of experimental outcomes is called an event. Consider a trial that

generated the outcome ωi : we say that an event Eoccurred during the trial if the

set E contains the element ωi . For example, in the die experiment, an event (denoted

odd number ) is the set of odd values E = {ω1 , ω3, ω5 } . This means that when we

observe the outcome ω5 the event odd number takes place.

An event composed of a single outcome, e.g. E ={ ω1 } is called an elementary

event.

Note that since events Eare subsets, we can apply to them the terminology of

the set theory:

Ω refers to the certain event i.e. the event that occurs in every trial.

the notation

Ec = {ω: ω / ∈ E}

denotes the complement of E.

the notation

E1 ∪ E2 = {ω Ω : ω ∈ E1 OR ω ∈ E2}

refers to the event that occurs when E1 or E2 or both occur.

the notation

E1 ∩ E2 = {ω Ω : ω ∈ E1 AND ω ∈ E2}

refers to the event that occurs when both E1 and E2 occur.

two events E1 and E2 are mutually exclusive or disjoint if

E1 ∩ E2 = (3.1.1)

that is each time that E1 occurs, E2 does not occur as well.

a partition of Ω is a set of disjoint sets Ej ,j = 1, . . . , J (i.e. Ej 1 ∩ Ej 2 =

ji , j2 J ) such that

J

j=1E j = Ω

given an event Ewe define the indicator function of E by

IE ( ω) = ( 1 if ω∈ E

0 if ω / ∈ E (3.1.2)

3.1. THE RANDOM MODEL OF UNCERTAINTY 39

Let us consider now the notion of class of events. An arbitrary collection of

subsets of Ω is not a class of events. We require that if E1 and E2 are events, the

same also holds for the intersection E1∩E2 and the union E1 E2 . A set of events that

satisfies these conditions is called, in mathematical terms, a Borel field [142]. We

will consider only Borel fields since we want to deal not only with the probabilities

of single events but also with the probabilities of their unions and intersections.

3.1.1 Axiomatic definition of probability

Probability is a measure of uncertainty. Once a random experiment is defined, this

measure associates to each possible outcome ω a number between 0 and 1. It follows

that we can assign to each event Ea real number Prob {E} ∈ [0, 1] which denotes

the probability of the event E . The measure associated with the event including all

possibilities is 1. The function Prob {·} : 2 [0, 1] is called probability measure or

probability distribution and must satisfy the following three axioms:

1. Prob {E} ≥ 0 for any E.

2. Prob { } = 1

3. Prob {E1 ∪ E2 } = Prob {E1 } + Prob {E2 } if E1 and E2 are mutually exclusive

(Equation (3.1.1)).

These conditions are known as the axioms of the theory of probability [120]. The

first axiom states that all the probabilities are nonnegative real numbers. The sec-

ond axiom attributes a probability of unity to the universal event Ω, thus providing

a normalisation of the probability measure. The third axiom states that the prob-

ability function must be additive for disjoint events, consistently with the intuitive

idea of how probabilities behave.

So from a mathematician perspective, probability is easy to define: it is a

countably additive set function defined on a Borel field, with a total mass of one.

Every probabilistic property, for instance E1 ⊂ E2 Prob {E1 } ≤ Prob {E2 } or

Prob {E c } = 1 Prob {E} , can be derived directly or indirectly from the axioms

(and only the axioms).

There are many interpretations and justifications of these axioms, and we dis-

cuss the frequentist and the Bayesian interpretation in Section 3.1.4 briefly. What

is relevant here is that the probability function is a formalisation of uncertainty

and that most of its properties and results appear to be coherent with the human

perception of uncertainty [110].

3.1.2 Visualisation of probability measures

Since probabilistic events are sets of outcomes, Venn diagrams are a convenient

manner to illustrate the relations between events and the notion of probability

measure. Suppose that you are a biker and you are interested in representing

the variability of weather and traffic conditions in your town in the morning. In

particular, you are interested in the probability that the morning will be sunny (or

not) and the road busy (or not). In order to formalise your practical issue, you

could define the uncertainty about the morning state by defining a sample space

which is the set of all possible morning conditions. Two events are of interest here:

sunny mornings and traffic conditions. What is the relationship and probability of

such two events? Figure 3.1 illustrates the sample space, the two events, and the

(hypothetical) probability measures by means of a Venn diagram and two different

tabular representations. The three representations in Figure 3.1 convey the same

information in different manners. Notwithstanding, they do not necessarily scale-

up in the same manner if we take into consideration a larger number of events.

40 CHAPTER 3. FOUNDATIONS OF PROBABILITY

0.2 0.25

0.1 0.45

Not

Sunny 0.25

0.2

Trac

Sunny 0.1

No

Trac

Sunny

Trac

11 0.1

0.21 0

0.250 1

0.45

Trac

0

Sunny

0

P

Figure 3.1: Visualisation of two events and probability measures: Venn diagram

(left), two-way table (center), probability distribution table (right)

0.15 0.15

0.05

1 1 0.051

0.051 01

0.0511 0

1 00 0.15

110 0.1

0.15

P

0.1

0.35

0 01

10 0

0

Trac

0

Sunny

0

Polluted

Sunny

Trac

Polluted

0.1

0.05

0.05

0.1

Figure 3.2: Visualisation of three events and related probability measures: Venn

diagram (left), probability distribution table (right)

For instance, for n events the Venn diagram should contain all 2n hypothetically

possible zones1.

Suppose that you are also interested in another type of event, i.e. the air quality.

Adding such an event to your probability representation would make your Venn

representation more complicated and the two-way table inadequate (Figure 3.2).

The visualisation will still be more difficult to handle and interpret if we deal with

more than three events.

Given their difficulty of encoding information in realistic probabilistic settings,

Venn diagrams are a pedagogical yet very limited tool for representing uncertainty.

Once introduced the notion of probability, a major question remains still open:

how to compute the probability value Prob {E} for a generic event E? The assign-

ment of probabilities is perhaps the most difficult aspect of constructing probabilistic

models. Although the theory of probability is neutral, that is it can make infer-

ences regardless of the actual probability values, its results will be strongly affected

by the choice of a particular assignment. This means that if the assignments are

inaccurate, the predictions of the model will be misleading and will not reflect the

real behaviour of the modelled phenomenon. In the following sections, we are going

to present some procedures which are typically adopted in practice.

3.1.3 Symmetrical definition of probability

Consider a random experiment where the sample space is made of a finite number

Mof symmetric outcomes (i.e., they are equally likely to occur). Let the number

of outcomes that are favourable to the event E (i.e. the event E takes place if one

of them occurs) be ME .

An intuitive definition of probability (also known as the classical definition) of

the event E, that adheres to the axioms, is

Prob {E} =M E

M(3.1.3)

1see Wikipedia https://en.wikipedia.org/wiki/Venn_diagram

3.1. THE RANDOM MODEL OF UNCERTAINTY 41

In other words, according to the principle of indifference (a term popularised by

J.M. Keynes in 1921), we have that the probability of an event equals the ratio of its

favourable outcomes to the total number of outcomes provided that all outcomes are

equally likely [142]. The computation of this quantity requires combinatorial meth-

ods for counting the favourable outcomes. This is typically the approach adopted

for a fair die. Also, in most cases, the symmetric hypothesis is accepted as self-

evident: if a ball is selected at random from a bowl containing W white balls and B

black balls, the probability that we select a white one is W/( W+ B ).

Note that this number is determined without any experimentation and is based

on symmetrical and finite space assumptions. But how to be sure that the symmet-

rical hypothesis holds? and that is invariant? Think, for instance, to the probability

that a newborn be a boy. Is this a symmetric case? More generally, how would one

define the probability of an event if the symmetrical hypothesis does not necessarily

hold or the space is not finite?

3.1.4 Frequentist definition of probability

Let us consider a random experiment and an event E . Suppose we repeat the

experiment N times and that we record the number of times NE that the event E

occurs. The quantity N E

N(3.1.4)

comprised between 0 and 1 is known as the relative frequency of E . It can be

observed that if the experiment is carried out a large number of times under exactly

the same conditions, the frequency converges to a fixed value for increasing N . This

observation led von Mises to use the notion of frequency as a foundation for the

notion of probability.

Definition 1.1 (von Mises) . The probability Prob {E} of an event Eis the limit

Prob {E} = lim

N→∞

NE

N(3.1.5)

where N is the number of observations and NE is the number of times that E

occurred.

This definition appears reasonable, and it is compatible with the axioms in

Section 3.1.1. However, in practice, in any physical experience, the number Nis

finite2 , and the limit has to be accepted as a hypothesis, not as a number that can

be determined experimentally [142].

Moreover, the assumption under exactly the same conditions is not as innocuous

as it seems. How could you ensure that two experiments occur under exactly the

same conditions? And what do those conditions refer to? Temperature, humidity,

obsolescence of the equipment? Are humans really able to control exactly all of

them? Would you be able to reproduce the exact same conditions of an experiment?

Notwithstanding, the frequentist interpretation is very important to show the

links between theory and application. At the same time, it appears inadequate to

represent probability when it is used to model a subjective degree of belief. Think,

for instance, to the probability that your professor wins a Nobel Prize: how to define

in such case a number Nof repetitions?

An important alternative interpretation of the probability measure comes then

from the Bayesian approach. This approach proposes a degree-of-belief interpreta-

tion of probability according to which Prob {E} measures an observer's strength of

belief that E is or will be true [192]. This manuscript will not cover the Bayesian

2As Keynes said "In the long run we are all dead".

42 CHAPTER 3. FOUNDATIONS OF PROBABILITY

approach to statistics and data analysis for the sake of compactness, though the

author is well aware that Bayesian machine learning approaches are more and more

common and successful. Readers interested in the foundations of the Bayesian in-

terpretation of probability are referred to [110]. Readers interested in introductions

to Bayesian machine learning are referred to [78, 13].

3.1.5 The Law of Large Numbers

A well-known justification of the frequentist approach is provided by the Weak Law

of Large Numbers, proposed by Bernoulli.

Theorem 1.2. Let Prob {E} =p and suppose that the event E occurs NE times in

Ntrials. Then, N E

Nconverges to pin probability, that is, for any  > 0 ,

Prob

NE

N p 1as N→ ∞

According to this theorem, the ratio NE /N is close to pin the sense that, for any

 > 0, the probability that | NE /N p| ≤ tends to 1 as N → ∞. This result justifies

the widespread use of the frequentist approach (e.g. in Monte Carlo simulation) to

illustrate or numerically solve probability problems. The relation between frequency

and probability is illustrated by the Shiny dashboard lawlarge.R (package gbcode).

Note that such a result does not imply that the number NE will be close to Np

as one could naively infer from (3.1.5). In fact,

Prob {NE = Np} ≈ 1

p2πN p(1 p) 0, as N → ∞ (3.1.6)

For instance, in a fair coin-tossing game, this law does not imply that the ab-

solute difference between the number of heads and tails should oscillate close to

zero [180] (Figure 3.3). On the contrary, it could happen that the absolute differ-

ence keeps growing (though at a slower rate than the number of tosses) as shown

in the R script freq.R and the Shiny dashboard lawlarge.R.

3.1.6 Independence and conditional probability

Let us consider two different events. We have already introduced the notions of

complementary and disjoint events. Another important definition is the definition

of independent events and the related notion of conditional probability. This notion

is essential in machine learning since supervised learning aims to detect and model

(in)dependencies by estimating conditional probabilities.

Definition 1.3 (Independent events) . Two events E1 and E2 are independent if and

only if

Prob {E1 ∩ E2 } = Prob {E1 } Prob {E2 } (3.1.7)

and we write E1 ⊥ E2 .

The probability Prob {E1 ∩ E2 } of seeing two events occurring together is also

known as joint probability and often noted as Prob {E1 ,E2 } . If two events are inde-

pendent the joint probability depends only on the two individual probabilities. As

an example of two independent events, think of two outcomes of a roulette wheel

or of two coins tossed simultaneously.

From an uncertain reasoning perspective, independence is a very simplistic as-

sumption since the occurrence (or the observation) of one event has no influence

on the occurrence of the other, or similarly that the second event has no memory

3.1. THE RANDOM MODEL OF UNCERTAINTY 43

Figure 3.3: Fair coin-tossing random experiment: evolution of the relative frequency

(left) and of the absolute difference (right) between the number of heads and tails

(R script freq.R in gbcode).

of the first. In other words, independence considers the uncertainty of a complex

joint event as a function of the uncertainties of its components3. This makes the

reasoning much simpler but, at the same time, too rough.

Exercise

Suppose that a fair die is rolled and that the number ω appears. Let E1 be the

event that the number ωis even, E2 be the event that the number ωis greater than

or equal to 3, E3 be the event that the number ωis a 4,5 or 6.

Are the events E1 and E2 independent? Are the events E1 and E3 independent?

Let E1 be an event such that Prob {E1 }> 0 and E2 a second event. We define

the conditional probability of E2 , given that E1 has occurred, the revised probability

of E2 after we learn about E1 occurrence:

Definition 1.4 (Conditional probability) . If Prob {E1 }> 0 then the conditional

probability of E2 given E1 is

Prob {E2|E1 } = Prob {E1 ∩ E2}

Prob {E1 } (3.1.8)

The following result derives from the definition of conditional probability.

Lemma 1. If E1 and E2 are independent events, then

Prob {E1|E2 } = Prob {E1 } (3.1.9)

In qualitative terms, the independence of two events means that the fact of

observing (or knowing) that one of these events (e.g. E1 ) occurred does not change

the probability that the other (e.g. E2 ) will occur.

3We refer the interested reader to the distinction between extensional and intensional reasoning

in [147]. Extensional reasoning (e.g. logics) always makes an assumption of independence, while

intensional reasoning (e.g. probability) consider independence as an exception.

44 CHAPTER 3. FOUNDATIONS OF PROBABILITY

Example

Let E1 and E2 two disjoint events with positive probability. Can they be indepen-

dent? The answer is no since

Prob {E1 ∩ E2 } = Prob {∅} = 0 6 = Prob {E1 } Prob {E2 }> 0

or equivalently Prob {E1|E2 } = 0. We can interpret this result by noting that if

two events are disjoint, the realisation of one of them is highly informative about

the realisation of the other. For instance, though it is very probable that Italy

will win the next football World Cup (Prob {E1 } >> 0) , this probability goes

to zero if the (rare yet possible) event E2 ("World cup won by Belgium") occurs

(Prob {E1|E2 } = 0). The two events are then dependent.

Exercise

Let E1 and E2 be two independent events, and E c

1the complement of E 1 . Are E c

1

and E2 independent?

Exercise

Consider the sample space Ω and the two events E1 and E2 in Figure 3.4. Suppose

that the probability of the two events is proportional to the surface of the regions.

From the Figure we compute

Prob {E1 } = 9

100 = 0. 09 (3.1.10)

Prob {E2 } = 20

100 = 0. 2 (3.1.11)

Prob {E1 ∩ E2 } = 1

100 = 0. 01 6 = Prob {E1 } Prob {E2 } (3.1.12)

Prob {E1 ∪ E2 } = 0.28 = Prob {E1 } + Prob {E2 } − Prob {E1 ∩ E2 } (3.1.13)

Prob {E1|E2 } = 1

20 = 0.05 6 = Prob {E1 } (3.1.14)

Prob {E2|E1 } = 1

96 = Prob {E2 } (3.1.15)

and then derive the following conclusions: the events E1 and E2 are neither disjoint

nor independent. Also, it is more probable that E2 occurs given that E1 occurred

rather than the opposite.

From (3.1.8) we derive

Prob {E1 ,E2 } = Prob {E1 } Prob {E2|E1 } (3.1.16)

If we replace the event E2 with the intersection of two events E2 and E3 , from (3.1.16)

we obtain

Prob {E1 ,E2 ,E3 } = Prob {E1 } Prob {E2 ,E3|E1 } =

Prob {E1 } Prob {E2|E3 , E1 } Prob {E3|E1 } = Prob {E1 ,E3 } Prob {E2|E3 , E1 }

If we divide both terms by Prob {E3 } we obtain

Prob {E1 ,E2|E3 } = Prob {E1|E3 } Prob {E2|E1 ,E3 } (3.1.17)

which is the conditioned version of (3.1.16).

3.1. THE RANDOM MODEL OF UNCERTAINTY 45

Figure 3.4: Events in a sample space.

3.1.7 The chain rule

The equation (3.1.16) shows that a joint probability can be factorised as the prod-

uct of a conditional and an unconditional probability. In more general terms, the

following rule holds.

Definition 1.5 (Chain rule) . For any sequence of events E1 ,E2 ,...,En ,

Prob {E1 ,E2 ,...,En } =

Prob {E1 } Prob {E2|E1 } Prob {E3|E1 ,E2 } . . . Prob {En|E1 ,E2 ,...,En1 }

We will see in Chapter 4 that the chain rule factorisation and the notion of

conditional independence play a major role in the adoption of graphical models to

represent probability distributions.

3.1.8 The law of total probability and the Bayes' theorem

Let us consider an indeterminate practical situation where a set of events E1 , E2 ,...,

Ek may occur. Suppose that no two such events may occur simultaneously, but at

least one of them must occur. This means that E1 , E2 ,..., Ek are mutually exclusive

and exhaustive or, in other terms, that they form a partition of Ω. The following

two theorems can be proven.

Theorem 1.6 (Law of total probability) . Let Prob {Ei} ,i = 1, . . . , k denote the

probability of the ith event Ei and Prob {E|Ei } ,i = 1, . . . , k the conditional probability

of a generic event E given that Ei has occurred. It can be shown that

Prob {E} =

k

X

i=1

Prob {E|Ei } Prob {Ei } =

k

X

i=1

Prob {E ∩ Ei } (3.1.18)

The quantity Prob {E} is referred to as marginal probability and denotes the

probability of the event Eirrespective of the occurrence of other events. A common-

sense interpretation of this theorem is that if an event E (e.g. an effect) depends

on the realisation of kdisjoint events (e.g. causes), the probability of observing

E, is a weighted average of each single conditional probability Prob {E|Ei } where

the weights are given by the marginal probabilities of each event Ei , i = 1, . . . , k.

For instance, we can compute the probability that the highway is busy once we

46 CHAPTER 3. FOUNDATIONS OF PROBABILITY

know the probability that an accident occurred or not (two disjoint events) and the

conditional probabilities of traffic given the occurrence (or not) of an accident.

Theorem 1.7 (Bayes' theorem) . The conditional ("inverse") probability of any Ei ,

i= 1 , . . . , k given that E has occurred is given by

Prob {Ei |E} = Prob {E|Ei } Prob {Ei}

Pk

j=1 Prob {E|E j }Prob {E j }=Prob {E,E i }

Prob {E} i = 1, . . . , k

(3.1.19)

It follows that the Bayes theorem is the only sound way to derive from a condi-

tional probability Prob {E2|E1 } its inverse

Prob {E1|E2 } = Prob {E2|E1 } Prob {E1}

Prob {E2 } (3.1.20)

Any alternative derivation (or shortcut) will lead inevitably to fallacious reasoning

and inconsistent results (see the Prosecutor fallacy discussion in Section 3.1.9 ).

It may be useful also to write a conditioning version of the total probability.

Given an event E0 and the set E1 , E2 ,..., Ek of mutually exclusive events:

Prob {E|E 0 } =

k

X

i=1

Prob {E|Ei , E0 } Prob {Ei|E 0 } (3.1.21)

From (3.1.20) and by conditioning on a third event E3 , we obtain a conditioning

version of the Bayes theorem

Prob {E1|E2 ,E3 } = Prob {E2|E1 ,E3 } Prob {E1|E3}

Prob {E2|E3 } (3.1.22)

as long as Prob {E2|E3 }> 0

Example

Suppose that k = 2 and

• E1 is the event: "Tomorrow is going to rain".

• E2 is the event: "Tomorrow is not going to rain".

E is the event: "Tonight is chilly and windy".

The knowledge of Prob {E1} , Prob {E2 } and Prob {E|Ek} ,k = 1, 2 makes possible

the computation of Prob {Ek|E}.

Exercise

Verify the validity of the law of total probability and of the Bayes theorem for the

problem in Figure 3.5.

3.1. THE RANDOM MODEL OF UNCERTAINTY 47

Figure 3.5: Events in a sample space

3.1.9 Direct and inverse conditional probability

The notion of conditional probability is central in probability and machine learning,

but it is often prone to dangerous misunderstanding, for instance, when inappropri-

ately used in domains like medical sciences or law. The most common error consists

of taking a conditional probability Prob {E1|E2 } for its inverse Prob {E2|E1 } . This

is also known as the prosecutor fallacy, as discussed in an example later.

The first important element to keep in mind is that for any fixed E1 , the quantity

Prob {·|E1 } still satisfies the axioms of probability, i.e. the function Prob {·|E1 } is

itself a probability measure. Conditional probabilities are probabilities [27]. How-

ever, this does not generally hold for Prob {E1|·}, which corresponds to fix the term

E1 on the left of the conditional bar. For instance if E2 ,E3 and E4 are disjoint events

we have

Prob {E2 ∪ E3∪ E4 |E1 } = Prob {E2|E1 } + Prob {E3|E1 } + Prob {E4|E1}

in agreement with the third axiom (Section 3.1.1) but

Prob {E1|E2 ∪ E3∪ E4 } 6 = Prob {E1|E2 } + Prob {E1|E3 } + Prob {E1|E4}

Also it is generally not the case that Prob {E2|E1 } = Prob {E1|E2} . As a conse-

quence if E1 and E2 are not independent then

Prob {E c

1|E 2 }= 1 Prob {E 1 |E 2 }

but

Prob {E1|E c

2} 6= 1 Prob {E 1 |E 2 }(3.1.23)

where Ec denotes the complement of E.

Another remarkable property of conditional probability, which is also a distinc-

tive aspect of probabilistic reasoning, is its non-monotonic property. Given a non

conditional probability Prob {E1 }> 0 a priori, we cannot say anything about the

conditional term Prob {E1|E2 } . This term can be larger, equal or smaller than

Prob {E1} . For instance if observing the event E2 makes the event more (less) prob-

able then Prob {E1|E2 }> Prob {E1 } (Prob {E1|E2 }< Prob {E1 } ). If the two events

are independent, then the probability of E1 does not change by conditioning. It

follows that the degree of belief of an event (or statement) depends on the context.

Note that this does not apply to conventional logical reasoning where the validity

of a statement is context-independent.

48 CHAPTER 3. FOUNDATIONS OF PROBABILITY

Italians

football supporters

World population

Figure 3.6: Italians and football fans.

In more general terms, it is possible to say that any probability statement is con-

ditional since it has been formulated on the basis of an often implicit background

knowledge K . For instance, if we say that the probability of the event E="rain

tomorrow" is Prob {E} = 0. 9, we are implicitly taking into consideration the sea-

son, our location and probably the weather today. So we should better note it as

Prob {E|K} = 0. 9. As succinctly stated in [27] all probabilities are conditional, and

conditional probabilities are probabilities.

Exercise

Consider as sample space Ω the set of all human beings. Let us define two events:

the set E1 of Italians and the set E2 of football supporters. Suppose that the prob-

ability of the two events is proportional to the surface of the regions in Figure 3.6.

Are these events disjoint? Are they independent? What about Prob {E1|E2 } and

Prob {E2|E1 } ? Are they equal? If not, which one is the largest?

The prosecutor fallacy

Consider the following story: A crime occurs in a big city (1M of inhabitants), and

a deteriorated DNA trace of the murderer is collected. The DNA profile matches

the profile of a person in a police database. A geneticist is contacted, and she states

that the probability of finding a person with the same DNA profile is one out of

100 thousand (i.e. 1e 5 ). The prosecution lawyer asks for condemnation with

the following argument: "since the chance of finding an innocent man with such

characteristics is so tiny, then the probability that he is innocent will be tiny as well".

The jury is impressed and ready to proceed with a life sentence. Then the defendant

replies: "Do you know that the population of the city is 1M? So the average number

of persons matching such DNA profile is 10. His chance of being innocent is not so

tiny since it is 9/10 and not one in 100000" Lacking any additional evidence, the

suspect is acquitted.

This short story is inspired by a number of real cases in court that were con-

fronted with the serious error of confounding direct and inverse conditional probabil-

ity [169]. The impact of such false reasoning is so relevant in law that it is known as

the Prosecutor's fallacy, a common default in reasoning when the collected evidence

is tiny if the accused was innocent.

3.1. THE RANDOM MODEL OF UNCERTAINTY 49

Let us analyse in probabilistic terms the fallacious reasoning that occurred in

the example above. Let consider a criminal case for which we have 10 suspects, i.e.

the responsible and 9 innocent persons (out of a 1 million population) matching

the DNA profile. The probability of matching evidence (M) given that someone is

innocent (I) is very low

Prob {M |I } = 9

999999 1e 5

However, what is relevant here is not the probability of the evidence given that he

is innocent (Prob {M |I } ) but the probability that is innocent given the evidence

Prob {I |M } = Prob {M|I } Prob {I}

Prob {M } = 9/ 999999 × 999999/1000000

10/ 1000000 = 9/ 10.

We can rephrase the issue in the following frequentist terms. Given Ninhabi-

tants, m persons with DNA matching profiles and a single murderer, the following

table shows the distribution of persons

Match No match

Innocent m 1N m

Guilty 1 0

From the table above, it is easy to derive the inconsistency of the prosecutor fallacy

reasoning since

Prob {M|I } =m 1

N1 Prob {M } = m

N

Prob {I |M } =m 1

m>> Prob {M|I}

3.1.10 Logics and probabilistic reasoning

This section aims to present some interesting relationships between logic deduction

and probabilistic reasoning.

First, we show that we can write down a probabilistic version of the deductive

modus ponens rule of propositional logic (Section 2.1):

If E1 ⇒ E2 and E1 is true, then E2 is true as well.

Since E1 ⇒ E2 is equivalent in set terms to E1 ⊂ E2 we obtain

Prob {E2|E1 } = Prob {E1 ,E2}

Prob {E1 } = Prob {E1}

Prob {E1 } = 1

i.e. a translation of the modus ponens argument in the probabilistic language.

Interestingly enough, the probability theory provides us with a result also in the

case of true E2 . It is well-known that in propositional logic if E1 ⇒ E2 and E2 is

true, then nothing can be inferred about E1 . Probability theory is more informative

since in this case we may derive from E2 ⊂ E1 that

Prob {E1|E2 } = Prob {E1}

Prob {E2 } Prob {E1}

Note that this is a probabilistic formulation of the abduction principle (Section 2.3).

In other words, probability supports the following common-sense reasoning: if both

E1 ⇒ E2 and E2 apply, then the conditional probability of E1 (i.e. the probability

of E1 once we know that E2 occurred) cannot be smaller than the unconditional

probability (i.e. the probability of E1 if we knew nothing about E2 ).

50 CHAPTER 3. FOUNDATIONS OF PROBABILITY

Also the properties of transitivity and inverse modus ponens hold in probability.

Let us consider three events E1 ,E2 ,E3 . The transitivity principle in logics states

that

If E1 ⇒ E2 and E2 ⇒ E3 then E1 ⇒ E3

In probabilistic terms we can rewrite E1 ⇒ E2 as

Prob {E2|E1 } = 1

and E2 ⇒ E3 as

Prob {E3|E2 } = 1

respectively. From the law of total probability (Equation (3.1.18)) we obtain

Prob {E3|E1 } = Prob {E3|E2 ,E1 } Prob {E c

2|E 1 }

| {z }

0

+ Prob {E3|E2 ,E1}

| {z }

1

Prob {E2|E1}

| {z }

1

= 1

Inverse modus ponens in logics states that

If E1 ⇒ E2 then ¬E2 ⇒ ¬E1

In probabilistic terms from Prob {E2|E1 } = 1 it follows

Prob {E c

1|E c

2}= 1 Prob {E 1 |E c

2}= 1

Prob {E c

2|E 1 }

| {z }

0

Prob {E1}

Prob {E c

2}= 1

Those results show that deductive logic rules can be seen as limiting cases of proba-

bilistic reasoning and confirm the compatibility of probability reasoning with human

common sense.

3.1.11 Combined experiments

So far we assumed that all the events belong to the same sample space. However,

the most interesting use of probability concerns combined (or multivariate) random

experiments whose sample space

Ω=Ω 1× 2×. . . n

is the Cartesian product of the spaces Ωi ,i = 1, . . . , n . For instance, if we want to

study the probabilistic dependence between the height and the weight of a child we

define a joint sample space

Ω = { (w, h ) : w w , h h }

made of all pairs (w, h) where Ωwis the sample space of the random experiment de-

scribing the weight and Ωh is the sample space of the random experiment describing

the height.

Note that all the properties studied so far also holds for events that do not belong

to the same univariate sample space. For instance, given a combined experiment

Ω = Ω1 × 2 two events E1 1 and E2 2 are independent iff Prob {E1|E2 } =

Prob {E1}.

Some examples of real problems modelled by random combined experiments are

presented in the following.

3.1. THE RANDOM MODEL OF UNCERTAINTY 51

Gambler's fallacy

Consider a fair coin-tossing game. The outcome of two consecutive tosses can be

considered independent. Now, suppose that we observe a sequence of 10 consecutive

tails. We could be tempted to think that the chances that the next toss will be head

are now very large. This is known as the gambler's fallacy [180]. In fact, to witness

a very rare event (like 10 consecutive tails) does not imply that the probability of

the next event will change or rather that it will become suddenly dependent on the

past.

Example [192]

Let us consider a medical study about the relationship between the outcome of a

medical test and the presence of a disease. We model this study as a combination

of two random experiments:

1. the random experiment which models the state of the patient. Its sample

space is Ωs ={ H, S } where H and S stand for a healthy and a sick patient,

respectively.

2. the random experiment which models the outcome of the medical test. Its

sample space is Ωo = { +, −} where + and stand for a positive and a

negative outcome of the test, respectively.

The dependency between the state of the patient and the outcome of the test

can be studied in terms of conditional probability.

Suppose that out of 1000 patients, 108 respond positively to the test and that

among them, 9 result to be affected by the disease. Also, among the 892 patients

who responded negatively to the test, only 1 is sick. According to the frequentist in-

terpretation, the probabilities of the joint events Prob {E s , Eo } can be approximated

according to expression (3.1.5) by

Es =S Es =H

Eo = + 9

1000 =.009 1089

1000 =.099

Eo = 1

1000 =.001 8921

1000 =.891

Doctors are interested in answering the following questions. What is the proba-

bility of having a positive (negative) test outcome when the patient is sick (healthy)?

What is the probability of being in front of a sick (healthy) patient when a positive

(negative) outcome is obtained? From the definition of conditional probability we

derive

Prob {E o = +|E s =S} = Prob {E o = +, Es =S }

Prob {E s =S} = . 009

.009 + .001 = .9

Prob {E o = −|E s =H} = Prob {E o = , Es =H }

Prob {E s =H} = . 891

.891 + .099 = . 9

According to these figures, the test appears to be accurate. Does this mean that

we should be scared if we test positive? Though the test is accurate, the answer is

negative, as shown by the quantity

Prob {E s =S |E o = +} = Prob {E o = +, Es =S }

Prob {E o = +} = . 009

.009 + .099 .08

This example confirms that sometimes humans tend to confound Prob {E s|E o } with

Prob {E o|E s } and that the most intuitive response is not always the right one (see

example in Section 3.1.9).

52 CHAPTER 3. FOUNDATIONS OF PROBABILITY

3.1.12 Array of joint/marginal probabilities

Let us consider the combination of two random experiments whose sample spaces are

A ={A1 , ··· , An } and ΩB ={B1 , ··· , Bm } , respectively. Assume that for each

pair of events (Ai , Bj ), i = 1, . . . , n ,j = 1, . . . , m we know the joint probability value

Prob {Ai , Bj } . The joint probability array contains all the necessary information

for computing all marginal and conditional probabilities by means of (3.1.18) and

(3.1.8).

B1 B2 ·· · Bm Marginal

A1 Prob { A1 , B1 } Prob { A1 , B2 } ·· · Prob { A1 , Bm } Prob { A1 }

A2 Prob { A2 , B1 } Prob { A2 , B2 } ·· · Prob { A1 , Bm } Prob { A2 }

.

.

..

.

..

.

..

.

..

.

..

.

.

An Prob { An , B1 } Prob { An , B2 } ·· · Prob { An , Bm } Prob { An }

Marginal Prob {B1 } Prob { B2 } ·· · Prob {Bm } Sum=1

where Prob {Ai } = P j=1,...,m Prob {Ai , Bj } and Prob { Bj } = P i=1,...,n Prob {Ai , Bj } .

Using an entry of the joint probability matrix and the sum of the corresponding

row/column, we may use (3.1.8) to compute the conditional probability as shown

in the following example.

Example: dependent/independent scenarios

Let us model the commute time to go back home for a ULB student living in

St. Gilles as a random experiment. Suppose that its sample space is Ωt = { LOW,

MEDIUM, HIGH}. Consider also an (extremely:-) random experiment represent-

ing the weather in Brussels, whose sample space is Ωw = {G=GOOD, B=BAD}.

Suppose that the array of joint probabilities is

G (in Bxl) B (in Bxl) Marginal

LOW 0.15 0.05 Prob { LOW} = 0.2

MEDIUM 0.1 0.4 Prob { MEDIUM} = 0.5

HIGH 0.05 0.25 Prob { HIGH} = 0.3

Prob { G} = 0. 3 Prob { B} = 0. 7 Sum=1

According to the above probability function, is the commute time dependent on

the weather in Bxl? Note that if weather is good

LOW MEDIUM HIGH

Prob {·| G} 0.15/0.3=0.5 0.1/0.3=0.33 0.05/0.3=0.16

Else if weather is bad

LOW MEDIUM HIGH

Prob {·| B} 0.05/0.7=0.07 0.4/0.7=0.57 0.25/0.7=0.35

Since Prob {·| G } 6 = Prob {·| B } , i.e. the probability of having a certain commute time

changes according to the value of the weather, the relation (3.1.9) is not satisfied.

Consider now the dependency between an event representing the commute time

and an event describing the weather in Rome.

G (in Rome) B (in Rome) Marginal

LOW 0.18 0.02 Prob { LOW} = 0.2

MEDIUM 0.45 0.05 Prob { MEDIUM} = 0.5

HIGH 0.27 0.03 Prob { HIGH} = 0.3

Prob { G} = 0. 9 Prob { B} = 0. 1 Sum=1

Our question now is: is the commute time dependent on the weather in Rome?

If the weather in Rome is good we obtain

LOW MEDIUM HIGH

Prob {·| G} 0.18/0.9=0.2 0.45/0.9=0.5 0.27/0.9=0.3

3.1. THE RANDOM MODEL OF UNCERTAINTY 53

E1E2E3 P(E1 ,E2 ,E3 )

CLEAR RISING DRY 0.4

CLEAR RISING WET 0.07

CLEAR FALLING DRY 0.08

CLEAR FALLING WET 0.10

CLOUDY RISING DRY 0.09

CLOUDY RISING WET 0.11

CLOUDY FALLING DRY 0.03

CLOUDY FALLING WET 0.12

Table 3.1: Joint probability distribution of the three-variable probabilistic model of

the weather

while if the weather in Rome is bad

LOW MEDIUM HIGH

Prob {·| B} 0.02/0.1=0.2 0.05/0.1=0.5 0.03/0.1=0.3

Note that the probability of a commute time event does NOT change according to

the value of the weather in Rome, e.g. Prob { LOW| B} = Prob { LOW} . Try to

answer now the following question. If you would like to predict the commute time

in Brussels, which event would return more information on it: the weather in Rome

or in Brussels?

Example: three sample spaces

Consider a probabilistic model of the day's weather based on the combination of

the following random descriptors where

1. the first represents the sky condition and its sample space is Ωs = {CLEAR,

CLOUDY}.

2. the second represents the barometer trend and its sample space is Ωb = {RISING,

FALLING}.

3. the third represents the humidity in the afternoon and its sample space is

h ={ DRY,WET}.

Let the joint probability values be given by Table 3.1. From the joint values we

can calculate the probabilities P ( CLE AR, RIS IN G )=0 .47 and P(C LOUDY ) =

0. 35 and the conditional probability value

P( DRY | C LEAR, RI SIN G) = P(DRY , CLEAR, RI SI NG)

P( CLE AR, RIS IN G)= 0 .40

0. 47 0.85

Take the time now to compute yourself other probabilities: for instance what is

the probability of having a cloudy sky in wet conditions? Does a rising barometer

increase or not this probability? Is the event "clear sky and falling barometer"

independent from the event "dry weather"?

54 CHAPTER 3. FOUNDATIONS OF PROBABILITY

3.2 Random variables

Machine learning and statistics are concerned with numeric data and measurements

while so far we have mainly been dealing with categories. What is then the link

between the notion of random experiment and data? The answer is provided by the

concept of random variable.

Consider a random experiment and the associated triple (Ω, {E}, Prob {·} ). Sup-

pose that we have a mapping rule z : Ω → Z ⊂ R such that we can associate with

each experimental outcome ωa real value z = z (ω ) in the domain Z. We say that

zis the value taken by the random variable z when the outcome of the random

experiment is ω . Henceforth, in order to clarify the distinction between a random

variable and its value, we will use the boldface notation for denoting a random

variable (as in z) and the normal face notation for the eventually observed value

(as in z = 11).

Since there is a probability associated with each event Eand we have a mapping

from events to real values, a probability distribution can be associated with z.

Definition 2.1 (Random variable) . Given a random experiment (Ω, {E}, Prob {·}),

a random variable zis the result of a mapping z : Ω → Z that assigns a number z

to every outcome ω . This mapping must satisfy the following two conditions:

the set {z z }is an event for every z.

the probabilities

Prob {z = ∞} = 0 Prob {z = −∞} = 0

Given a random variable z ∈ Z and a subset I ⊂ Z we define the inverse

mapping

z1 (I ) = {ω | z(ω )I } (3.2.24)

where z 1 (I ) ∈ {E} is an event. On the basis of the above relation we can associate

a probability measure to zaccording to

Prob {z I } = Prob z1 (I ) = Prob {ω |z (ω )I } (3.2.25)

Prob {z =z} = Prob z1 (z ) = Prob {ω |z (ω ) = z} (3.2.26)

In other words, a random variable is a numerical quantity, linked to some exper-

iment involving some degree of randomness, which takes its value from some set Z

of possible real values. The notion of r.v. formalizes the notion of numeric measure-

ments, which is indeed a mapping between an event (e.g. your body temperature)

and a number (e.g. in the range Z = { 35,...,41} returned by the thermometer).

Another experiment might be the rolling of two six-sided dice and the r.v. z might

be the sum (or the maximum) of the two numbers showing in the dice. In this case,

the set of possible values is Z ={ 2,...,12} (or Z = { 1,..., 6} ).

Example

Suppose that we have to decide when to go home and watch Fiorentina AC playing

the Champion's League final match against Anderlecht. In order to make such a

decision, a quantity of interest is the (random) commute time z for getting from

ULB to home. Our personal experience is that this time is a positive number that

is not constant: for example, z1 = 10 minutes, z2 = 23 minutes, z3 = 17 minutes,

where zi is the time taken on the ith day of the week. The variability of this quantity

is related to a complex random process with a large sample space Ω (depending, for

example, on the weather condition, the weekday, the sports events in town, and so

on). The probabilistic approach uses a random variable to represent this uncertainty

3.3. DISCRETE RANDOM VARIABLES 55

and considers each measure zi as the consequence of a random outcome ωi . The use

of a random variable zto represent the commute time becomes then a compact (and

approximate) way of modelling the disparate set of causes underlying the uncertainty

of this phenomenon. Whatever its limits, the probabilistic representation provides

us with a computational way to decide when to leave if we want to bound the

probability of missing the start of the game.

3.3 Discrete random variables

The probability (mass) function of a discrete r.v. zis the combination of

1. the countable set Zof values that the r.v. can take (also called range),

2. the set of probabilities associated to each value of Z.

This means that we can attach to the random variable some specific mathemat-

ical function Pz (z ) that gives for each z ∈ Z the probability that zassumes the

value z

Pz ( z) = Prob {z = z}(3.3.27)

This function is called probability function or probability mass function. Note that

henceafter will use P (z ) as a shorthand for Prob {z =z} when the identity of the

random variable is clear from the context.

As depicted in the following example, the probability function can be tabulated

for a few sample values of z. If we toss a fair coin twice, and the random variable

zis the number of heads that eventually turn up, the probability function can be

tabulated as follows

Values of the random variable z 012

Associated probabilities 0.25 0.50 0.25

3.3.1 Parametric probability function

Sometimes the probability function is not precisely known but can be expressed as

a function of zand a quantity θ. An example is the discrete r.v. zthat takes its

value from Z = { 1, 2,3} and whose probability function is

Pz ( z, θ) = θ 2z

θ2 +θ4 +θ6

where θ is some fixed nonzero real number.

Whatever the value of θ ,Pz (z )> 0 for z = 1, 2, 3 and Pz (1) + Pz (2) + Pz (3) = 1.

Therefore z is a well-defined random variable, even if the value of θis unknown.

We call θ a parameter , that is some constant, usually unknown, involved in the

analytical expression of a probability function. We will see in the following that the

parametric form is a convenient way to formalise a family of probabilistic models

and that the problem of estimation can be seen as a parameter identification task.

3.3.2 Expected value, variance and standard deviation of a

discrete r.v.

Though the probability function Pz provides a complete description of the uncer-

tainty of z , it is often not practical to use since this requires to keep in mind (or in

memory) as many values as the size of Z. Therefore, it is more convenient to deal

with some compact representation of Pz obtained by computing a functional (i.e.

56 CHAPTER 3. FOUNDATIONS OF PROBABILITY

a function of a function) of Pz . The most common single-number summary of the

distribution Pz is the expected value which is a measure of central tendency4.

Definition 3.1 (Expected value) . The expected value of a discrete random variable

zis

E[z ] = µ=X

z∈Z

zPz ( z ) (3.3.28)

assuming that the sum is well-defined.

An interesting property of the expected value is that it is the value which mini-

mizes the squared deviation

µ= arg min

mE[(zm ) 2 ] (3.3.29)

Note that the expected value is not necessarily a value that belongs to the domain

Zof the random variable. It is important also to remark that while the term

mean is used as a synonym of expected value, this is not the case for the term

average. We will discuss in detail the difference between mean and sample average

in Section 5.3.2.

Example [180]

Let us consider a European roulette with numbers 0, 1,..., 36 and where the number

0 is considered as winning for the house. The gain of a player who places a 1$ bet

on a single number is a random variable zwhose sample space is Z = {− 1,35} . In

other words, only two outcomes are possible: either she wins z1 = 1$ (or better

he loses 1$) with probability p1 = 36/ 37 or he wins z2 = 35$ with probability

p2 = 1/37. The expected gain is then

E[z ] = p1 z1 +p2z2 =p1 ( 1) + p2 35 = 36 / 37 + 35 / 37 = 1 /37 = 0 . 027

This means that while casinos gain on average 2.7 cents for every staked dollar,

players on average are giving away 2.7 cents (whatever sophisticated their betting

strategy is).

A common way to summarise the spread of a distribution is provided by the

variance.

Definition 3.2 (Variance). The variance of a discrete random variable zis

Var [z ] = σ2 =E [(zE [z])2 ] = X

z∈Z

(z E [z])2 Pz (z)

The variance is a measure of the dispersion of the probability function of the

random variable around its mean µ. Note that the following relation holds

σ2 = E[(z E[z])2 ] = E[z2 2z E[z ]+( E[z])2 ] (3.3.30)

=E [z2 ] (E [z ])2 =E [z2 ] µ2 (3.3.31)

whatever is the probability function of z. Figure 3.7 illustrate two example discrete

r.v. probability functions that have the same mean but different variance. Note

that the variance Var [z] does not have the same dimension as the values of z . For

instance, if z is measured in the unit [m ], Var [z ] is expressed in the unit [m ]2.

Standard deviation is a measure for the spread that has the same dimension as z.

An alternative measure of spread is E [ |z µ|] but this quantity is less used since

more difficult to be analytically manipulated than the variance.

4This concept was first introduced in the 17th century by C. Huygens in order to study the

games of chance

3.3. DISCRETE RANDOM VARIABLES 57

Figure 3.7: Two discrete probability functions with the same mean and different

variance

Definition 3.3 (Standard deviation) . The standard deviation of a discrete random

variable z is the positive square root of the variance.

Std [z ] = p Var [z ] = σ

Example

Let us consider a binary random variable z ∈ Z = { 0,1} where Pz (1) = p , 0 p1

and Pz (0) = 1 p . In this case

E[z ] = p1+0(1 p) = p(3.3.32)

E[z2 ] = p1+0(1 p) = p(3.3.33)

Var [z ] = E [z2 ] (E [z ])2 =p p2 = p (1 p ) (3.3.34)

Definition 3.4 (Moment) . For any positive integer r , the r th moment of the prob-

ability function is

µr = E [zr ] = X

z∈Z

zr Pz ( z) (3.3.35)

Note that the first moment coincides with the mean µ , while the second moment

is related to the variance according to Equation (3.3.30). Higher-order moments

provide additional information, other than the mean and the spread, about the

shape of the probability function.

Definition 3.5 (Skewness) . The skewness of a discrete random variable zis defined

as

γ= E[(z µ)3]

σ3 (3.3.36)

Skewness is a parameter that describes asymmetry in a random variable's prob-

ability function. Probability functions with positive skewness have long tails to the

right, and functions with negative skewness have long tails to the left (Figure 3.8).

Definition 3.6 (Kurtosis) . The kurtosis of a discrete random variable z is defined

as

γ= E[(z µ)4]

σ4 (3.3.37)

Kurtosis is always positive. Its interpretation is that the probability function

of a distribution with large kurtosis has fatter tails, compared with the probability

function of a distribution with smaller kurtosis.

58 CHAPTER 3. FOUNDATIONS OF PROBABILITY

Figure 3.8: A discrete probability function with positive skewness (left) and one

with a negative skewness (right).

3.3.3 Entropy and relative entropy

Definition 3.7 (Entropy) . Given a discrete r.v. z , the entropy of the probability

function Pz (z ) is defined by

H(z ) = X

z∈Z

Pz ( z) log Pz ( z)

H(z ) is a measure of the unpredictability of the r.v. z. Suppose that there are

Mpossible values for the r.v. z. The entropy is maximized (and takes the value

log M ) if Pz (z )=1 /M for all z . It is minimized iff P ( z ) = 1 for a single value of z

(i.e. all others probability values are null).

Although entropy measures as well as variance the uncertainty of a r.v., it differs

from the variance since it depends only on the probabilities of the different values

and not on the values themselves. In other terms, Hcan be seen as a function of

the probability function Pz rather than of z.

Let us now consider two different discrete probability functions on the same set

of values

P0 =Pz 0 ( z) , P1 = Pz 1 ( z)

where P0 (z )> 0 if and only if P1 (z )> 0. The relative entropies (or the Kullback-

Leibler divergences) associated with these two functions are

H( P0 || P1 ) = X

z

P0 ( z) log P 0 ( z)

P1 ( z)= X

z

P0 ( z) log P0 ( z)X

z

P0 ( z) log P1 ( z)

(3.3.38)

H( P1 || P0 ) = X

z

P1 ( z) log P 1 ( z)

P0 ( z)= X

z

P1 ( z) log P1 ( z)X

z

P1 ( z) log P0 ( z)

(3.3.39)

where the term

X

z

P0 ( z) log P1 ( z) = Ez [log P1 ] (3.3.40)

is also called the cross-entropy. These asymmetric quantities measure the dissimi-

larity between the two probability functions. A symmetric formulation of the dis-

similarity is provided by the divergence quantity

J( P0 , P1 ) = H( P0 ||P1 ) + H( P1 ||P0 ) .

3.4. CONTINUOUS RANDOM VARIABLE 59

3.4 Continuous random variable

An r.v. zis said to be a continuous random variable if it can assume any of the

infinite values within a range of real numbers. The following quantities can be

defined:

Definition 4.1 (Cumulative distribution function) . The (cumulative) distribution

function of z is the function Fz :R [0,1]

Fz ( z ) = Prob {z z} (3.4.41)

This function satisfies the following two conditions:

1. it is right-continuous: Fz (z ) = limyz Fz (y),

2. it is non-decreasing: z1 < z2 implies Fz ( z1 )Fz (z2 ),

3. it is normalized, i.e

lim

z→−∞ F z (z)=0 ,lim

z→∞ F z (z ) = 1

Definition 4.2 (Density function) . The density function of a real random variable

zis the derivative of the distribution function

pz ( z) = dF z ( z)

dz (3.4.42)

at all points z where Fz (· ) is differentiable.

Probabilities of continuous r.v. are not allocated to specific values but rather to

interval of values. Specifically

Prob {a zb } =Z b

a

pz ( z) dz, Z Z

pz ( z) dz = 1

Some considerations about continuous r.v. are worthy to be mentioned:

the quantity Prob {z=z }= 0 for all z,

the quantity pz (z ) can be bigger than one (since it is a density and not a

probability) and even unbounded,

two r.v.s z1 and z2 with the same domain Zare equal in distribution if

Fz 1 ( z ) = Fz 1 ( z ) for all z ∈ Z .

Note that hence-after we will use p (z ) as a shorthand for pz (z ) when the identity

of the random variable is clear from the context.

3.4.1 Mean, variance, moments of a continuous r.v.

Consider a continuous scalar r.v. with range Z = (l, h ) and density function p (z).

We may define the following quantities.

Definition 4.3 (Expectation or mean) . The mean of a continuous scalar r.v. z is

the scalar value

µ= E[z ] = Z h

l

zp( z ) dz (3.4.43)

60 CHAPTER 3. FOUNDATIONS OF PROBABILITY

Figure 3.9: Cumulative distribution function and upper critical point.

Definition 4.4 (Variance). The variance of a continuous scalar r.v. z is the scalar

value

σ2 = E[(z µ)2 ] = Z h

l

(z µ )2 p(z)dz (3.4.44)

Definition 4.5 (Moments) . The r-th moment of a continuous scalar r.v. zis the

scalar value

µr = E [zr ] = Z h

l

zr p( z) dz (3.4.45)

Note that the moment of order r= 1 coincides with the mean of z.

Definition 4.6 (Quantile function) . Given the cumulative function Fz , the quantile

(or inverse cumulative) function is the function F 1

z: [0, 1] R such that

F1

z(q) = inf { z: F z (z) > q}

The quantities Fz (1/ 4), Fz (1/ 2), Fz (3/ 4) are called the first quartile, the median

and the third quartile, respectively.

Definition 4.7 (Upper critical point) . For a given 0 α1 the upper critical

point of a continuous r.v. zis the value zα such that

1α = Prob {z zα } =F ( zα ) zα =F1 (1 α )

Figure 3.9 shows an example of cumulative distribution together with the upper

critical point. A compact review of univariate discrete and continuous distributions

is available in Appendix C.1. In what follows we will detail only the univariate

normal case.

3.4.2 Univariate Normal (or Gaussian) distribution

A continuous scalar random variable xis said to be normally distributed with pa-

rameters µ and σ2 (also x ∼ N (µ, σ 2 )) if its probability density function is Normal

(or Gaussian). The analytical form of a Normal probability density function is

px ( x) = 1

2πσ e (xµ)2

2σ 2 (3.4.46)

where the coefficient before the exponential ensures that R px (x ) dx = 1. The mean

of the Normal random variable x is µ and its variance is σ2 . An interesting property

of a normal r.v.is that the probability that an observation xis within 1 (2) standard

deviations from the mean is 0.68 (0.95). You may find more probabilistic relation-

ships in Table 3.2. When µ = 0 and σ2 = 1 the distribution is called standard

normal (Figure 3.10) and its distribution function is denoted Fz (z ) = Φ(z ). All

3.5. JOINT PROBABILITY 61

Figure 3.10: Density of a standard r.v. N (0,1)

Prob {µ σ xµ +σ } ≈ 0 .683

Prob {µ 1.282σx µ + 1. 282σ } ≈ 0 .8

Prob {µ 1.645σx µ + 1. 645σ } ≈ 0 .9

Prob {µ 1 . 96σx µ + 1. 96σ } ≈ 0 .95

Prob {µ 2σx µ + 2σ } ≈ 0.954

Prob {µ 2 . 57σx µ + 2. 57σ } ≈ 0.99

Prob {µ 3σx µ + 3σ } ≈ 0.997

Table 3.2: Some probabilistic relations holding for x ∈ N ( µ, σ 2 )

random variables x ∼ N (µ, σ2 ) are linked to a standard variable zby the following

relation

z= ( xµ) /σ. (3.4.47)

It follows that z ∼ N (0, 1) x =µ +σz ∼ N ( µ, σ2 ).

The practitioner might now wonder why the Normal distribution is so ubiqui-

tous in statistics books and literature. There are plenty of reasons both from the

theoretical and the practical side. From a theoretical perspective, the adoption of a

Normal distribution is justified by the Central Limit theorem (Appendix C.7) which

states that, under conditions almost always satisfied in practice, a linear combina-

tion of random variables converges to a Normal distribution. This is particularly

useful if we wish to represent in a compact lumped form the variability that escapes

to a modelling effort (e.g. the regression plus noise form in Section 10.1). Another

relevant property of Gaussian distributions is that they are invariant to linear trans-

formations, i.e. a linear transformation of a Gaussian r.v. is still Gaussian, and its

mean (variance) depends on the mean (variance) of the original r.v.. From a more

pragmatic perspective, an evident asset of a Gaussian representation is that only a

finite number of parameters (two in the univariate case) are sufficient to characterise

the entire distribution.

Exercise

Test yourself the relations in Table 3.2 by random sampling and simulation using

the script norm.R.

3.5 Joint probability

So far, we considered scalar random variables only. However, the most interesting

probabilistic (and machine learning) applications are multivariate, i.e. concerning a

number of variables larger than one. Let us consider a probabilistic model described

by n discrete random variables. A fully-specified probabilistic model gives the joint

62 CHAPTER 3. FOUNDATIONS OF PROBABILITY

probability for every combination of the values of the nr.v.s. In other terms, the

joint probability contains all the information about the random variables.

In the discrete case, the model is specified by the values of the probabilities

Prob {z1 = z1 ,z2 = z2 ,...,zn = zn } =P (z1 , z2, . . . , zn ) (3.5.48)

for every possible assignment of values z1 , . . . , zn to the variables.

Spam mail example

Let us consider a bivariate probabilistic model describing the relation between the

validity of a received email and the presence of the word Viagra in the text. Let

z1 be the random variable describing the validity of the email (z1 = 0 for no-spam

and z1 = 1 for spam) and z2 the r.v. describing the presence (z2 = 1) or the

absence (z2 = 0) of the word Viagra. The stochastic relationship between these two

variables can be defined by the joint probability distribution given by the table

z1 = 0 z1 = 1 P z 2

z2 = 0 0.8 0.08 0.88

z2 = 1 0.01 0.11 0.12

Pz 1 0.81 0.19 1

In the case of ncontinuous random variables, the model is specified by the joint

distribution function

Prob {z1 z1 ,z2 z2 ,...,zn zn } =F ( z1 , z2, . . . , zn )

which returns a value for every possible assignment of values z1 , . . . , zn to the vari-

ables.

3.5.1 Marginal and conditional probability

Let {z1 ,...,zm } be a subset of size m of the n discrete r.v.s for which a joint

probability function (3.5.48) is defined. The marginal probabilities for the subset

can be derived from expression (3.5.48) by summing over all possible combinations

of values for the remaining variables.

P( z1 , . . . , zm ) = X

˜ zm+1 ··· X

˜ zn

P( z1 , . . . , zm , ˜ zm+1 ,..., ˜ zn ) (3.5.49)

Exercise

Compute the marginal probabilities P (z1 = 0) and P (z1 = 1) from the joint prob-

ability of the spam mail example.

For continuous random variables the marginal density is

p(z1 , . . . , zm ) = Z p(z1 , . . . , zm, zm+1 , . . . , zn ) dzm+1 . . . dzn (3.5.50)

This is also known as the sum rule or the marginalisation property.

The following definition for r.v. derives directly from Equation (3.1.8).

3.5. JOINT PROBABILITY 63

Definition 5.1 (Conditional probability function) . The conditional probability func-

tion for one subset of discrete variables {zi :i S1 } given values for another disjoint

subset {zj :j S2 } where S1 S2 = , is defined as the ratio

P({ zi : i S1 }|{zj : j S2 }) = P({zi :i S1 } , {zj :j S2 } )

P({ zj : j S2 })

Definition 5.2 (Conditional density function) . The conditional density function

for one subset of continuous variables {zi :i S1 } given values for another disjoint

subset {zj :j S2 } where S1 S2 = , is defined as the ratio

p({ zi : i S1 }|{zj : j S2 }) = p({zi :i S1 } , {zj :j S2 } )

p({ zj : j S2 })(3.5.51)

where p ({ zj :j S2 } ) is the marginal density of the set S2 of variables. When

p({ zj : j S2 } ) = 0 this quantity is not defined.

The simplified version of (3.5.51) for two r.v.s z1 and z2 is

p(z1 = z1 ,z2 = z2 ) =

=p (z2 =z2 |z1 =z1 )p(z1 =z1 ) = p (z1 =z1 |z2 =z2 )p (z2 =z2 ) (3.5.52)

which is also known as the product rule.

By combining (3.4.43), the sum rule (3.5.50) and the product rule (3.5.52) we

obtain

p( z1 ) = Z p( z1 , z2 ) dz2 =Z p(z1 |z2 ) p( z2 )dz2 = E z 2 [p(z1 |z2 )]

where the subscript z2 makes clear that the expectation is computed with respect

to the distribution of z2 only (while z1 is fixed).

3.5.2 Independence

Having defined the joint and the conditional probability, we can now define when

two random variables are independent.

Definition 5.3 (Independent discrete random variables) . Let x and y be two dis-

crete random variables. Two variables x and y are defined to be statistically inde-

pendent (written as x y ) if the joint probability

Prob {x = x, y =y} = Prob {x =x} Prob {y =y} , x, y (3.5.53)

The definition can be easily extended to the continuous case.

Definition 5.4 (Independent continuous random variables). Two continuous vari-

ables x and y are defined to be statistically independent (written as x y ) if the

joint density

p(x = x, y= y ) = p(x = x ) p(y = y) , x, y (3.5.54)

From the definition of independence and conditional density it follows that

xy p (x =x |y=y ) = p ( x=x ) x, y (3.5.55)

In layman's terms, the independence of two variables means that we do not

expect that the observed outcome of one variable will affect the probability of ob-

serving the other, or equivalently that knowing something about one variable adds

no information about the other. For instance, hair colour and gender are indepen-

dent. Knowing someone's hair colour adds nothing to the knowledge of his gender.

64 CHAPTER 3. FOUNDATIONS OF PROBABILITY

Height and weight are dependent, however. Knowing someone's height does not

determine precisely their weight: nevertheless, you have less uncertainty about his

probable weight after you have been told the height.

Though independence is symmetric

xy y x

it is neither reflexive (i.e. a variable is not independent of itself) nor transitive. In

other terms, if x and y are independent and y and z are independent, then xand

zneed not be independent.

If we consider three instead of two variables, they are said to be mutually inde-

pendent if and only if each pair of rv.s. is independent and

p( x, y, z ) = p ( x ) p ( y) p( z )

Also the relationship

x(y, z ) x z, x y

holds, but not the one in the opposite direction.

Note that in mathematical terms an independence assumption implies that a

bivariate density function can be written in a simple form, i.e. as the product of

two univariate densities. This results in an important benefit in terms of the size of

the parametrisation. For instance, consider two discrete random variables z1 ∈ Z1 ,

z2 ∈ Z2 such that the cardinality of the two ranges is k1 and k2 , respectively. In the

generic case, if z1 and z2 are not independent, the definition of the joint probability

requires the definition of k1k2 1 terms5(or parameters). In the independent case

because of the property (3.5.54), the definition requires k1 1 terms for z1 and

k2 1 terms for z2 , so overall k1 + k2 2. This makes a big difference in case of

large values of k1 and k2 .

Independence allows an economic parametrisation in the multivariate case as

well. Consider the case of a large number nof binary discrete r.v.s., i.e. each

having a range made of two values. If we need to define the joint probability, we

require 2n 1 terms (or parameters) in the generic case. If the nvariables are

independent, this number is reduced to n.

Exercise

Check whether the variable z1 and z2 of the spam mail example are independent.

Note that hence-after, for the sake of brevity, we will limit to introduce definitions

for continuous random variables only. All of them can however be extended to the

discrete case too.

3.5.3 Chain rule

Given a set of nrandom variables, the chain rule (also called the general product

rule) returns the joint density as a function of conditional densities:

p( zn , . . . , z1 ) = p(zn |zn1 , . . . , z1 ) p ( zn1 | z n2 , . . . , z1 ). . . p( z2 |z1 )p(z1 ) (3.5.56)

This rule is convenient to simplify the representation of large variate distributions

by describing them in terms of conditional probabilities.

5minus one because of the normalisation constraint

3.5. JOINT PROBABILITY 65

3.5.4 Conditional independence

Independence is not a stable relation. Though x y , the r.v. xmay become

dependent with y once we observe the value zof a third variable z. In the same

way, two dependent variables x and y may become independent once the value of

zis known. This leads us to introduce the notion of conditional independence.

Definition 5.5 (Conditional independence) . Two r.v.s x and y are conditionally

independent given the value z=z ( x y |z=z ) iff

p(x = x, y= y |z= z ) = p(x = x |z= z ) p (y = y|z= z) x, y (3.5.57)

Two r.v.s x and y are conditionally independent given z (x y |z ) iff they are

conditionally independent for all values of z.

Since from the chain rule (3.5.56) we may write

p(x = x, y= y|z= z ) = p(x = x|z= z) p(y = y |x= x, z= z )

it follows that x y |z=z implies the relation

p(y = y|x= x, z= z ) = p(y = y|z= z) (3.5.58)

In plain words, the notion of conditional dependence makes formal the intuition

that a variable may bring (or not) information about a second one, according to

the context.

Note that the statement x y |z=z means that x and y are independent

if z =z occurs but does not say anything about the relation between x and yif

z=z does not occur. It could follow that two variables are independent but not

conditional independent (or the other way round). In general independence does

not imply conditional independence and conditional independence does not imply

independence [27] (as in the example below).

Example: pizzas, dependence and conditional independence

Let y a variable representing the quality of a pizza restaurant and xa variable

quantifying the Italian assonance of the restaurant name. Intuitively, you would

prefer (because of higher quality y) a pizza served in the restaurant "Sole Mio"

(large x ), rather than in the restaurant "Tot Straks" (low x). In probabilistic

terms, this means that x and y are dependent (x 6⊥y ), i.e. knowing x reduces the

uncertainty we have about y. However, it is not the restaurant owner who makes

your pizza, but the cook (pizzaiolo ). Let z represent the assonance of his name.

Now you would prefer eating a pizza in a Belgian restaurant where the pizzaiolo

has Italian origins rather than in an Italian restaurant with a Flemish cook. In

probabilistic terms x and y become independent once z (the pizzaiolo's name) is

known (x y |z ).

It can be shown that the following two assertions are equivalent

(x (z1 ,z2 ) |y) (x z1 |(y, z2 )),(x z2 |(y, z1 ))

Also

(xy |z ),(x z |y) (x (y, z ))

If (x y |z ), (z y | x), ( z x |y ) then x ,y ,z are mutually independent.

If z is a random vector, the order of the conditional independence is equal to the

number of variables in z.

66 CHAPTER 3. FOUNDATIONS OF PROBABILITY

3.5.5 Entropy in the continuous case

Consider a continuous r.v. y . The (differential) entropy of y is defined by

H(y ) = Z log( p( y ))p( y) dy = Ey [log( p ( y ))] = Ey log 1

p( y)

with the convention that 0 log 0 = 0. Entropy is a functional of the distribution of

yand is a measure of the predictability of a r.v. y. The higher the entropy, the

less reliable are our predictions about y. For a scalar normal r.v. y ∼ N (µ, σ2 )

H(y ) = 1

2 1 + ln 2πσ2 =1

2 ln 2πeσ2 (3.5.59)

In the case of a normal random vector Y ={ y1 ,...,yn }∼N (0,Σ)

H(Y ) = 1

2(ln(2πe)n det(Σ))

3.5.5.1 Joint and conditional entropy

Consider two continuous r.v.s x and y and their joint density p (x, y ). The joint

entropy of x and y is defined by

H(x ,y) = Z Z log( p( x, y)) p( x, y ) dxdy =

=Ex,y [log(p ( x, y ))] = Ex,y log 1

p( x, y)

The conditional entropy is defined as

H(y |x ) = Z Z log( p( y | x ))p( x, y) dxdy = Ex,y [log(p( y | x ))] =

=Ex,y log 1

p( y| x) = E x [ H(y| x)]

This quantity quantifies the remaining uncertainty of y once x is known. Note that

in general H (y |x ) 6 =H (x |y ), H (y )H (y| x ) = H (x )H ( x| y ) and that the chain

rule holds

H(y ,x) = H(y| x ) + H(x ) (3.5.60)

Also, conditioning reduces entropy

H(y |x ) H(y)

with equality if x and y are independent, i.e. x y . This property formalises a

fundamental principle underlying machine learning, data science and prediction in

general, i.e. that by conditioning on some variables x(e.g. inputs) we may reduce

the uncertainty about a variable y(target). Another interesting property is the

independence bound

H(y ,x) H(y ) + H(x)

with equality if x y .

3.6. BIVARIATE CONTINUOUS DISTRIBUTION 67

Figure 3.11: 3D visualisation of a bivariate joint density.

3.6 Bivariate continuous distribution

Let us consider two continuous r.v. x and y and their bivariate joint density func-

tion px,y (x, y ). An example of bivariate joint density function is illustrated in

Figure 3.11. From (3.5.50), we define marginal density the quantity

px ( x ) = Z

−∞

px,y ( x, y)dy

and conditional density the quantity

py|x ( y| x ) = p ( x, y)

p( x)(3.6.61)

which is, in loose terms, the probability that ybelongs to an interval dy about y

assuming that x =x . Note that, if x and y are independent

px,y ( x, y) = px ( x) py ( y ) , p( y| x) = py ( y)

The definition of conditional expectation is obtained from (3.6.61) and (3.4.43).

Definition 6.1 (Conditional expectation) . The conditional expectation of y given

x=x is

Ey [y |x = x ] = Z ypy|x ( y | x ) dy = µy|x ( x ) (3.6.62)

From (3.3.29) we may derive that

Ey [y |x = x] = arg min

mE y [(ym) 2 |x=x ] (3.6.63)

Note that Ey [y| x =x ] is a function of xalso known as the regression function.

The definition of conditional variance derives from (3.6.61) and (3.4.44).

Definition 6.2 (Conditional variance).

Var [y| x =x ] = Z (y µy|x (x))2 py|x (y |x )dy (3.6.64)

68 CHAPTER 3. FOUNDATIONS OF PROBABILITY

Figure 3.12: Bivariate distribution: the figure shows the two marginal distribu-

tions (beside the axis), the conditional expectation function (dashed line) and some

conditional distributions (dotted).

Note that both these quantities are a function of x. If we replace the given value

xby the r.v. xthe terms Ey [y |x ] and Var [y |x ] are random, too.

Some important results on their expectation are contained in the following the-

orems [192].

Theorem 6.3. For two r.v.s x and y , assuming their expectations exist, we have

that

Ex [Ey [y |x = x]] = Ey [y ] (3.6.65)

and

Var [y ] = Ex [Var [y |x = x ]] + Var [Ey [y |x =x ]] (3.6.66)

where Var [y| x =x ] and Ey [y |x =x ] are functions of x.

We remind that for a bivariate function f (x, y)

Ey [ f( x, y)] = Z f (x, y) py ( y) dy, Ex [ f ( x, y)] = Z f (x, y ) px ( x)dx.

A 2D representation of a bivariate continuous distribution is illustrated in Figure

3.12. It is worthy noting that, although the conditional distribution is bell-shaped,

this is not necessarily the case for the marginal distributions.

3.6.1 Correlation

Consider two random variables x and y with means µx and µy and standard devi-

ations σx and σy .

Definition 6.4 (Covariance). The covariance between x and y is defined as

Cov[x, y ] = E [(x µx )(y µy )] = E [xy ] µxµy (3.6.67)

3.6. BIVARIATE CONTINUOUS DISTRIBUTION 69

Figure 3.13: Dependent but uncorrelated random variables

A positive (negative) covariance means that the two variables are positively

(inversely) related, i.e. that once one is above its mean, then the other tends to

be above (below) its mean as well. The covariance can take any value in real

numbers. A limitation of covariance is that it depends on variables' scales and

units: for instance, if variables were measured in meters instead of centimetres, this

would induce a change of their covariance. For this reason, it is common to replace

covariance with correlation, a dimensionless measure of linear association.

Definition 6.5 (Correlation) . The correlation coefficient is defined as

ρ(x ,y) = Cov[ x ,y]

pVar [x ] Var [y ] (3.6.68)

It is easily shown that 1ρ (x, y ) 1. For this reason, the correlation is

sometimes expressed as a percentage.

Definition 6.6 (Uncorrelated variables) . Two r.v.s x and y are said to be uncor-

related if ρ (x, y ) = 0 or equivalently if

E[xy ] = E[x]E[y ] (3.6.69)

Note that if x and y are two independent random variables, then

E[xy ] = Z xyp( x, y ) dxdy =Z xyp( x) p( y) dxdy =Z xp(x) dx Z yp( y) dy = E[x ] E[y ]

This means that independence implies uncorrelation. However, the contrary does

not hold for a generic distribution. The equivalence between independence and

uncorrelation

ρ(x ,y)=0 xy (3.6.70)

holds only if x and y are jointly Gaussian.

See Figure 3.13 for an example of uncorrelated but dependent variables.

Exercises

1. Let x and y two discrete independent r.v. such that

Px ( 1) = 0 .1 , Px (0) = 0 .8 , Px (1) = 0 . 1

and

Py (1) = 0 .1 , Py (2) = 0 . 8, Py (3) = 0 .1

If z =x +y show that E [z ] = E [ x ] + E [y]

70 CHAPTER 3. FOUNDATIONS OF PROBABILITY

2. Let x be a discrete r.v. which assumes {− 1,0,1} with probability 1/ 3 and

y= x2 . Let z= x+ y. Show that

E[z ] = E[x ] + E[y ].

xand yare uncorrelated but dependent random variables.

3.7 Normal distribution: the multivariate case

Let z = [z1 ,...,zn ]T be a [n, 1] random vector . The vector is said to be normally

distributed with parameters µ and Σ (also z ∼ N (µ, Σ)) if its probability density

function is given by

pz ( z ) = 1

( 2π)n p det(Σ) exp 1

2(z µ )T Σ1 (z µ )(3.7.71)

where det(Σ) denotes the determinant of the matrix Σ. It follows that

the mean E [z ] = µ is an [n, 1] vector,

the matrix

Σ = E [(z µ)(z µ)T ] (3.7.72)

is the [n, n] covariance matrix. This matrix is symmetric and positive semidef-

inite. It has n (n + 1)/ 2 parameters: the diagonal terms Σjj are the variances

Var [zj ] of the vector components and the off-diagonal terms Σjk , j 6 =k are the

covariance terms Cov[zj ,zk ]. The inverse Σ1 is also called the concentration

matrix.

The quantity

∆=( z µ)T Σ 1 ( z µ) (3.7.73)

which appears in the exponent of pz is called the Mahalanobis distance from zto

µ. It can be shown that the n -dimensional surfaces of constant probability density

are hyper-ellipsoids on which ∆2is constant;

their principal axes are given by the eigenvectors uj ,j = 1, . . . , n of Σ which

satisfy

Σuj =λj uj j = 1, . . . , n

where λj are the corresponding eigenvalues.

the eigenvalues λj give the variances along the principal directions (Figure

3.14).

If the covariance matrix Σ is diagonal then

the contours of constant density are hyper-ellipsoids with the principal direc-

tions aligned with the coordinate axes.

the components of z are then statistically independent since the distribution of

zcan be written as the product of the distributions for each of the components

separately in the form

pz ( z) =

n

Y

j=1

pz j (zj )

the total number of independent parameters in the distribution is 2n (nfor

the mean vector and nfor the diagonal covariance matrix).

if σj =σ for all j, the contours of constant density are hyper-spheres.

3.7. NORMAL DISTRIBUTION: THE MULTIVARIATE CASE 71

Figure 3.14: Contour curves of normal distribution for n = 2.

3.7.1 Bivariate normal distribution

Let us consider a bivariate (n= 2) normal density whose mean is µ = [µ1 , µ2 ]Tand

the covariance matrix is

Σ = σ 2

1σ 12

σ21 σ2

2

The correlation coefficient is

ρ= σ12

σ1σ2

It can be shown that the general bivariate normal density has the form

p(z1 , z2 ) =

1

2πσ1σ2 p 1ρ2 exp " 1

2(1 ρ2 )" z 1 µ 1

σ1 2

2ρ z 1 µ1

σ1 z 2 µ 2

σ2 + z 2 µ 2

σ2 2 ##

A plot of a bivariate normal density with µ = [0, 0] and Σ = [1. 2919,0. 4546; 0. 4546,1.7081]

and a corresponding contour curve are traced in Figure 3.15 by means of the script

gaussXYZ.R.

We suggest the reader to play with the Shiny dashboard gaussian.R in order

to visualize the impact of the parameters on the Gaussian distribution.

One of the important properties of the multivariate normal density is that all

conditional and marginal probabilities are also normal. Using the relation

p( z2 |z1 ) = p(z1 , z2 )

p(z1 )

we find that p (z2 |z1 ) is a normal distribution N (µ2|1 , σ2

2| 1), where

µ2|1 =µ2 + ρσ 2

σ1

(z1 µ1 )

σ2

2| 1=σ 2

2(1 ρ 2 )

Note that

µ2|1 is a linear function of z1 : if the correlation coefficient ρ is positive, the

larger z1 , the larger µ2|1 .

if there is no correlation between z1 and z2 , the two variables are independent,

i.e. we can ignore the value of z1 to estimate µ2 .

72 CHAPTER 3. FOUNDATIONS OF PROBABILITY

Figure 3.15: Bivariate normal density function

3.7.2 Gaussian mixture distribution

A continuous r.v. zhas a Gaussian mixture distribution with mcomponents if

p(z = z ) =

m

X

k=1

wk N( z; µk ,Σk ) (3.7.74)

where N (z ; µk , Σk ) denotes the Normal density with mean µk and covariance Σk,

and the mixture weights wk satisfy

m

X

k=1

wk = 1 ,0 wk 1

A Gaussian mixture is a linear superposition of m Gaussian components and, as

such, has a higher expressive power than a unimodal Gaussian distribution: for

instance, it can be used to model multimodal density distributions.

The script gmm.R samples a bidimensional mixture of Gaussians with 3 compo-

nents with diagonal covariances. The density and the sampled points are in Fig-

ure 3.16. An interesting property of Gaussian mixtures is that they are universal

approximator of densities which means that any smooth density can be approx-

imated with any specific nonzero amount of error by a Gaussian mixture model

(GMM) with enough components.

3.7.3 Linear transformations of Gaussian variables

If z1 ∼ N (µ1 , Σ1 ) and z2 ∼ N (µ2 , Σ2 ) are independent Gaussian r.v.s., then the

sum z =z1 +z2 is a Gaussian r.v. z ∼ N (µ1 + µ2 , Σ1 + Σ2).

Given two real constants c1 and c2 , the linear combination z =c1 z1 +c2 z2 is a

Gaussian r.v. z ∼ N (c1µ1 + c2µ2 , c2

1Σ 1 +c 2

2Σ 2 ).

If z ∼ N (µ, Σ) is a [n, 1] Gaussian random vector and y =Az , with A a [n, n]

real matrix, then y ∼ N (Aµ, A ΣAT ) is a Gaussian vector.

3.8. MUTUAL INFORMATION 73

Figure 3.16: Density and observations of a bidimensional mixture of Gaussians with

3 components. Each colour corresponds to a different component.

3.8 Mutual information

Mutual information is one of the most widely used measures to convey the depen-

dency of variables. It is a measure of the amount of information that one random

variable contains about another random variable. It can also be considered as the

distance from independence between the two variables. This quantity is always non-

negative and zero if and only if the two variables are stochastically independent.

Given two random variables x and y , their mutual information is defined in

terms of their probabilistic marginal density functions px (x ), py (y ) and the joint

p(x, y) ( x, y):

I(x ;y ) = Z Z log p( x, y)

p( x) p( y) p( x, y) dxdy = H (y ) H (y |x ) = H (x ) H (x |y )

(3.8.75)

with the convention that 0 log 0

0= 0. From (3.5.60), we derive

I(x ;y ) = H(y ) H(y| x ) = H(y ) + H(x ) H(x ,y) (3.8.76)

Mutual information is null if and only if x and y are independent, i.e.

I(x ;y )=0x y. (3.8.77)

In other words, the larger the mutual information term, the stronger is the degree

of dependency between two variables.

In the Gaussian case, an analytical link between correlation and mutual infor-

mation exists. Let (x, y ) a normally distributed random vector with a correlation

coefficient ρ . The mutual information between x and y is given by

I(x ;y ) = 1

2log(1 ρ2 )

74 CHAPTER 3. FOUNDATIONS OF PROBABILITY

Equivalently the correlation coefficient (3.6.68) can be written as

ρ=p 1 exp( 2 I(x ;y ))

In agreement with (3.8.77) and (3.6.70), it follows that in the Gaussian case

ρ(x ,y)=0 I(x ;y ) = 0 (3.8.78)

3.8.1 Conditional mutual information

Consider three r.v.s x ,y and z . The conditional mutual information is defined by

I(y ;x| z ) = H(y| z ) H(y| x ,z) (3.8.79)

It can also be written as

I(y ;x| z ) = Z Z log p(x, y| z )

p( x| z ) p ( y| z) p( x, y, z ) dxdydz

While mutual information quantifies the degree of (in)dependence between two

variables, conditional mutual information quantifies the degree of conditional (in)dependence

(Section 3.5.4) between three variables. The conditional mutual information is null

iff x and y are conditionally independent given z , i.e.

I(x ;y |z )=0x y |z (3.8.80)

Note that I ( x ;y |z ) can be null though I (x ;y )> 0, like in the pizzas example

in Section 3.5.4. Also a symmetric configuration is possible, e.g. I ( x ;y ) = 0 but

I(x ;y| z ) >0 as in the case of complementary variables which will be discussed in

Section 12.8.

3.8.2 Joint mutual information

This section derives the information of a pair of variables (x1 ,x2 ) about a third one

y.

From (3.8.79) and (3.5.60) it follows:

I(x ;y |z ) = H(y |z ) H(y| x ,z) = H(y |z ) + H(x| z ) H((x ,y)| z) =

=H ((x, z )) + H ((y, z )) H (z )H ((x, y, z )) (3.8.81)

From (3.8.76) it follows

I((x1 ,x2 ); y) = H(x1 ,x2 ) + H(y ) H(x1 ,x2 ,y)

and

I(x1 ;y ) = H(x1 ) + H(y ) H(x1 ,y)

From (3.8.81) it follows

I(x2 ;y |x1 ) = H(y| x1 ) H(y |x1 ,x2 ) =

=H (y, x1 )H (x1 )H (y, x1 ,x2 ) + H (x1 ,x2 )

On the basis of the results above, we derive the chain rule of mutual information

I(x1 ;y ) + I(x2 ;y| x1 ) =

=H (x1 ) + H (y )H (x1 , y ) + H (y, x1 )H (x1 )H (y, x1 ,x2 ) + H (x1 ,x2 ) =

=H (y )H (y, x1 ,x2 ) + H (x1 ,x2 ) = I ((x1 ,x2 ); y ) (3.8.82)

3.8. MUTUAL INFORMATION 75

This formula shows that the information that a pair of variables (x1 ,x2 ) brings

about a third variable yis not simply the sum of the two mutual information terms

I(x1 ;y ) and I(x2 ;y ) but is the sum of I(x1 ;y ) and the conditional information of

x2 and ygiven x1 . This aspect is particularly important in the feature selection

context (Section 12.8) where simplistic assumptions of monotonicity and additivity

do not hold.

For n > 2 variables X = {x1 ,...,xn } the chain rule formulation is

I(X ;y ) = I(Xi ;y| xi )+I(xi ;y ) = I(xi ;y |Xi )+ I(Xi ;y ) i= 1 , . . . , n (3.8.83)

where Xi denote the X set with the i th term set aside.

3.8.3 Partial correlation coefficient

We have seen in Section 3.6.1 that correlation is a good measure of independence

in the case of Gaussian distributions. The same role for conditional independence

is played by partial correlation.

Definition 8.1 (First-order partial correlation) . Let us consider three r.v.s x,y

and z . The first-order partial correlation is

ρxy|z = ρ xy ρ xz ρ zy

q(1 ρ2

xz)(1 ρ 2

yz)

where ρ xy is defined in (3.6.68).

This quantity returns a measure of the correlation between x and y once the

value of z is known. It is possible to extend the partial correlation to the condition-

ing of two variables.

Definition 8.2 (Second-order correlation).

ρx 1 y|zx2 = ρ x 1 y|z ρ x 1 x 2 |z ρ yx 2 |z

q(1 ρ 2

x1x2 | z )(1 ρ 2

yx2 | z )

This can be used also to define a recurrence relationship where qth order partial

correlations can be computed from (q 1)th order partial correlations.

Another interesting property is the link between partial correlation and con-

centration matrix (Section 3.7). Let Σ and Ω = Σ1 denote the covariance and

the concentration matrix of the normal set of variables Z ∪ {x, y } . The partial

correlation coefficient ρ xy|Z can be obtained by matrix inversion:

ρxy|Z = ωxy

ω xx ωxy

where ωxy is the element of the concentration matrix corresponding tp x and y.

Consider a multivariate normal vector X, such that xi ,xj X ,XS X and s

is the dimension of XS . Then

ρx i x j |XS = 0 I(xi ,xj | XS )=0

Note that this is the conditional version of the relation (3.8.78).

76 CHAPTER 3. FOUNDATIONS OF PROBABILITY

3.9 Functions of random variables and Monte Carlo

simulation

For any function g (· ) of the random variable z

E[ g(z )] = Z g( z) pz ( z) dz (3.9.84)

This is also know as the law of the unconscious statistician (LOTUS). Note that in

general E [g (z )] 6 =g (E [z ]), with the exception of the linear function g (z ) = az + b

which will be discussed in the following section.

Exercise

Let z be a scalar r.v. and

g( z) = ( 1 z[a, b]

0 else

with a < b . Compute E [g (z)].

For a generic g , the analytical computation or numerical integration of (3.9.84) may

be extremely complex. A numerical alternative is represented by the Monte Carlo

simulation which requires a pseudo-random generator of examples according to the

distribution of z. In a nutshell Monte Carlo computes E [g (x )] by

1. generating a large number S of sample points zi Fz , i = 1, . . . , S ,

2. computing g (zi ),

3. returning the estimation

E[ g(z )] P S

i=1 g(z i )

S

If S is sufficiently large, we may consider such approximation as reliable. The same

procedure may be used to approximate other parameters of the distribution (e.g.

the variance). In this book, we will have recourse to Monte Carlo simulation to

provide a numerical illustration of probabilistic formulas or concepts (e.g. bias,

variance and generalisation error), which otherwise might appear too abstract for

the reader.

Monte Carlo computation

The script mcarlo.R contains the Monte Carlo computation of the mean and vari-

ance of z ∼ N (µ, σ 2 ) as well as the computation of E [z2 ] and E [ |z | ].

The Shiny dashboard mcarlo.R visualises the result of some operations on a

single and two random variables by using a Monte Carlo simulation.

3.10. LINEAR COMBINATIONS OF R.V. 77

3.10 Linear combinations of r.v.

The expected value of a linear combination of r.v.s is simply the linear combination

of their respective expectation values

E[ ax+ by ] = aE[x ] + bE [y] , a R , b R

i.e., expectation is a linear statistic. On the contrary, the variance is not a linear

statistic. We have

Var [ax + by ] = a2 Var [x ] + b2 Var [y ]+2 ab ( E[ xy] E [x] E [y ]) (3.10.85)

=a2 Var [x ] + b2 Var [y ]+2 abCov[x ,y ] (3.10.86)

where the quantity Cov[x, y ] is defined in (3.6.67).

Given n r.v. zj , j = 1, . . . , n

Var

n

X

j=1

cj zj

=

n

X

j=1

c2

jVar [zj ]+2X

i<j

ci cj Cov[zi ,zj ] (3.10.87)

Let us consider now nrandom variables with the same variance σ2 and mutual

correlation ρ . Then the variance of their average is

Var "P n

j=1 z j

n# = 2

n2 + 2 1

n2

n( n 1)

2ρσ 2 =

=σ 2

n+ ρσ 2 ρσ2

n= (1 ρ) σ 2

n+ ρσ 2 (3.10.88)

3.10.1 The sum of i.i.d. random variables

Suppose that z1 ,z2 ,. . . ,zN are i.i.d. (identically and independently distributed)

random variables, discrete or continuous, each having a probability distribution

with mean µ and variance σ2 . Let us consider the two derived r.v., that is the sum

SN = z1 +z2 + · ·· + z N

and the average

¯

z= z 1 + z 2 +· ·· +z N

N(3.10.89)

The following relations hold

E[SN ] = Nµ, Var [SN ] = 2 (3.10.90)

E[¯

z] = µ, Var [¯

z] = σ 2

N(3.10.91)

An illustration of these relations by simulation can be obtained by running the R

script sum rv.R.

3.11 Conclusion

The reader (in particular, if practitioner) might think that a chapter on probabil-

ity theory is an unnecessary frill in a book on machine learning. The author has

a different opinion. Probability extends the logical formalism and makes formal

human patterns of reasoning under uncertainty (e.g. abduction). Also, probability

provides an effective language to formalise the task of machine learning, i.e. using

78 CHAPTER 3. FOUNDATIONS OF PROBABILITY

some variables (e.g. inputs) to explain, provide information (or reduce uncertainty)

about other ones (e.g. targets). According to Aristotles, philosophy begins with

wonder. From a scientific perspective, wonder originates from uncertainty, and sci-

ence has the role of reducing it by explanation. The author hopes that this chapter

showed that uncertainty and information are not only philosophical concepts but

quantities whose nature and relationship can be described in probabilistic terms.

So far, we only considered low variate settings, although the ambition of statis-

tical machine learning is attacking complex high variate problems. For this reason,

the next chapter will provide a probabilistic formalism to deal with high variate

(and then complex) settings. What is still missing for the moment is the second

major ingredient (besides uncertainty) of machine learning: data. Please be patient:

the relation between uncertainty and observations will be discussed in Chapter 5,

which introduces estimation as the statistical way of combining probabilistic models

with real-world data.

3.12 Exercises

1. Suppose you collect a dataset about spam in emails. Let the binary variables x1 ,

x2 and x3 represent the occurrence of the words "Viagra", "Lottery" and "Won",

respectively, in a email. Let the dataset of 20 emails being summarised as follows

Document x1 (Viagra) x2 (Lottery) x3 (Won) y(Class)

E1 0 0 0 NOSPAM

E2 0 1 1 SPAM

E3 0 0 1 NOSPAM

E4 0 1 1 SPAM

E5 1 0 0 SPAM

E6 1 1 1 SPAM

E7 0 0 1 NOSPAM

E8 0 1 1 SPAM

E9 0 0 0 NOSPAM

E10 0 1 1 SPAM

E11 1 0 0 NOSPAM

E12 0 1 1 SPAM

E13 0 0 0 NOSPAM

E14 0 1 1 SPAM

E15 0 0 1 NOSPAM

E16 0 1 1 SPAM

E17 1 0 0 SPAM

E18 1 1 1 SPAM

E19 0 0 1 NOSPAM

E20 0 1 1 SPAM

where

0 stands for the case-insensitive absence of the word in the email.

1 stands for the case-insensitive presence of the word in the email.

Let y = 1 denote a spam email and y= 0 a no-spam email.

The student should estimate on the basis of the frequency of the data above

Prob {x1 = 1, x2 = 1}

Prob {y= 0 |x2 = 1, x3 = 1}

Prob {x1 = 0 |x2 = 1}

Prob {x3 = 1 |y= 0, x2 = 0}

Prob {y= 0 |x1 = 0,x2 = 0,x3 = 0}

3.12. EXERCISES 79

Prob {x1 = 0 |y= 0}

Prob {y= 0}

Solution:

Prob {x1 = 1, x2 = 1 }= 0.1

Prob {y= 0 |x2 = 1, x3 = 1 }= 0

Prob {x1 = 0 |x2 = 1 }= 0 .8

Prob {x3 = 1 |y= 0, x2 = 0 }= 0.5

Prob {y= 0 |x1 = 0, x2 = 0, x3 = 0 }= 1

Prob {x1 = 0 |y= 0 }= 0 .875

Prob {y= 0 }= 0.4

2. Let us consider a fraud detection problem. Suppose we collect the following trans-

actional dataset where v = 1 means that the transaction came from a suspicious

web site and f= 1 means that the transaction is fraudulent.

f= 1 f= 0

v= 1 500 1000

v= 0 1 10000

Estimate the following quantities by using the frequency as estimator of probability:

Prob {f= 1}

Prob {v= 0}

Prob {f= 1 |v= 1}

Prob {v= 1 |f= 1}

Use the Bayes theorem to compute Prob {v = 1|f = 1} and show that the result is

identical to the one computed before.

Solution:

Prob {f= 1 }= 501/ 11501 = 0.043

Prob {v= 0 }= 10001 /11501 = 0 .869

Prob {f= 1 |v= 1 }= 500/1500 = 1 /3

Prob {v= 1 |f= 1 }= 500/501

By Bayes theorem: Prob {v = 1|f = 1} = Prob{f =1|v =1}Prob{v =1}

Prob{f =1 } = 1/3(1500/11501)

501/ 11501 =

500/501

3. Let us consider a dataset with 4 binary variables

x1x2x3 y

1 1 0 1

0 0 1 0

0 1 0 0

1 1 1 1

0 0 0 0

0 1 0 0

0 1 1 0

0 0 1 0

0 0 0 0

0 1 0 0

1 1 1 1

Estimate the following quantities by using the frequency as estimator of probability

80 CHAPTER 3. FOUNDATIONS OF PROBABILITY

Prob {y= 1}

Prob {y= 1 |x1 = 0}

Prob {y= 1 |x1 = 0, x2 = 0, x3 = 0}

Solution:

Prob {y= 1 }= 3 /11

Prob {y= 1 |x1 = 0 }= 0

Prob {y= 1 |x1 = 0, x2 = 0, x3 = 0 }= 0

4. Let us consider a task with three binary inputs and one binary target where the

input distribution is

x1x2x3 P( x1 , x2, x3 )

0 0 0 0.2

0 0 1 0.1

0 1 0 0.1

0 1 1 0.1

1 0 0 0.1

1 0 1 0.1

1 1 0 0.1

1 1 1 0.2

and the conditional probability is

x1x2x3 P( y= 1| x1 , x2, x3 )

0 0 0 0.8

0 0 1 0.1

0 1 0 0.5

0 1 1 0.9

1 0 0 0.05

1 0 1 0.1

1 1 0 0.05

1 1 1 0.5

Compute

Prob {x1 = 1, x2 = 1}

Prob {y= 0 |x2 = 1, x3 = 0}

Prob {x1 = 0 |x2 = 1}

Prob {x3 = 1 |y= 0, x2 = 1}

Prob {y= 0 |x1 = 0, x2 = 0, x3 = 0}

Prob {x1 = 0 |y= 0}

Solution:

Prob {x1 = 1, x2 = 1 }=0.1+0.2=0.3

By using (3.1.21) (where E0 stands for x2 = 1,x3 = 0) we obtain:

Prob {y = 0|x2 = 1,x3 = 0} = Prob {y = 0| x1 = 0,x2 = 1,x3 = 0}∗ Prob {x1 = 0|x2 = 1, x3 = 0}+

Prob {y = 0|x1 = 1,x2 = 1,x3 = 0}∗ Prob {x1 = 1|x2 = 1,x3 = 0} = 0. 5 0. 5+

0. 95 0. 5 = 0.725

Prob {x1 = 0 |x2 = 1 }= (0 .1 + 0 .1) / (0 .2 + 0.3) = 0 .4

From the joint four variate distribution computed in the exercise below

Prob { x3 = 1|y = 0,x2 = 1} = Prob {x3 = 1,y = 0,x2 = 1}

Prob {y = 0,x2 = 1} = 0 . 11

0. 255 = 0.4313725

3.12. EXERCISES 81

Prob {y= 0 |x1 = 0, x2 = 0, x3 = 0 }= 1 0. 8 = 0.2

From the joint four variate distribution computed in the exercise below

Prob {x1 = 0|y = 0} = Prob {x1 = 0,y = 0}

Prob {y = 0} = 0. 19

0. 57 = 0.3333

5. Consider the probability distribution of the previous exercise. Is yconditionally

independent of x1 given x2 ?

Solution:

According to Section 3.5.4, yis conditionally independent of x1 given x2 if for all

values x2 :

Prob {y =y| x1 = x1 ,x2 = x2 } = Prob {y =y|x2 = x2 }

Let us compute Prob {y = 1|x1 = 1,x2 =x2 } and Prob {y = 1| x2 = x2 } for x2 = 0.

From (3.1.21)

Prob {y = 1| x2 = 0,x1 = 1}=

X

x3

Prob {y = 1|x2 = 0,x1 = 1,x3 = x3 } Prob {x3 =x3 | x2 = 0,x1 = 1}=

= Prob {y = 1|x2 = 0,x1 = 1x3 = 0} Prob { x3 = 0|x2 = 0, x1 = 1}+

+ Prob {y = 1|x2 = 0, x1 = 1,x3 = 1} Prob { x3 = 1|x2 = 0,x1 = 1}=

= 0. 05 0. 1/ 0. 2 + 0. 1 0. 1 / 0. 2 = 0.075

and

Prob {y = 1| x2 = 0}=

=X

x1,x3

Prob {y = 1|x2 = 0, x1 =x1 ,x3 = x3 } Prob {x1 =x1 ,x3 =x3 | x2 = 0}=

= Prob {y = 1|x2 = 0,x1 = 0,x3 = 0} Prob {x1 = 0,x3 = 0|x2 = 0}+

+ Prob {y = 1|x2 = 0,x1 = 0,x3 = 1} Prob {x1 = 0,x3 = 1|x2 = 0}+

+ Prob {y = 1|x2 = 0,x1 = 1,x3 = 0} Prob {x1 = 1,x3 = 0|x2 = 0}+

+ Prob {y = 1|x2 = 0,x1 = 1,x3 = 1} Prob {x1 = 1,x3 = 1|x2 = 0}=

= 0. 8 0.2 / 0. 5 + 0. 1 0.1 / 0. 5 + 0. 05 0. 1/0 . 5 + 0. 1 0. 1/0 . 5 = 0.37

Since those two values are different, the two variables are not conditionally indepen-

dent.

An alternative would be first computing the joint distribution of the 4 variables and

then deriving the conditional terms. Since

Prob { y, x1, x2, x3 } = Prob {y | x1 , x2, x3 } Prob {x1 , x2, x3 }

the joint distribution is :

82 CHAPTER 3. FOUNDATIONS OF PROBABILITY

y x1 x2x3 P( y, x1, x2, x3 )

0 0 0 0 (1-0.8)*0.2=0.04

0 0 0 1 (1-0.1)*0.1=0.09

0 0 1 0 0.05

0 0 1 1 0.01

0 1 0 0 0.095

0 1 0 1 0.09

0 1 1 0 0.095

0 1 1 1 0.1

1 0 0 0 0.8*0.2=0.16

1 0 0 1 0.1*0.1=0.01

1 0 1 0 0.05

1 0 1 1 0.09

1 1 0 0 0.005

1 1 0 1 0.01

1 1 1 0 0.005

1 1 1 1 0.1

From the table above we compute the conditional terms as

Prob {y = 1|x2 = 0} = Prob {y = 1, x2 = 0}

Prob { x2 = 0} =

=0. 16 + 0.01 + 0.01 + 0.005

0. 04 + 0.09 + 0.095 + 0.09 + 0.16 + 0.01 + 0.005 + 0. 01 = 0.37

and

Prob {y = 1|x2 = 0, x1 = 1} = Prob {y = 1,x1 = 1,x2 = 0}

Prob {x1 = 1,x2 = 0} =

=0.005 + 0.01

0. 095 + 0.09 + 0.005 + 0. 01 = 0.075

Since the results are (obviously) identical to the ones obtained with the first method,

the conclusion is the same, i.e. the variables are conditionally dependent.

6. Let x, y, z be three binary random variables denoting the pathological mutation of

a given gene of the father, mother and child, respectively. The values 0 and 1 stand

for the absence and presence of the mutation, respectively. Suppose that

the two parents have the same probability 0.5 of having a pathological mutation

in a given gene

the variables x and y are independent

the child may inherit the mutation according to this conditional probability

table

Prob {z = 1|x = x, y =y} x y

0 0 0

0.6 0 1

0.4 1 0

0.7 1 1

1. What is the probability that the child has no mutation if both parents are not

affected?

2. What is the probability that the father had a mutated gene if the child has the

mutation and the mother is not affected?

3. What is the probability that the father has a mutated gene if the child has the

mutation and the mother is affected?

3.12. EXERCISES 83

4. What is the probability that the child has the mutation if the father has none?

5. What is the probability that the father has a mutated gene if the child has the

mutation?

6. What is the probability that the father has a mutated gene if the child has no

mutation?

Solution:

Let us derive first

P(z = 1|y = 0) =

P(z = 1|y = 0 ,x= 1) P(x = 1|y = 0) + P(z = 1|y = 0 ,x= 0) P(x = 0|y = 0) =

=P (z = 1|y = 0,x = 1)P (x = 1) + P (z = 1|y = 0,x = 0)P (x = 0) =

= 0. 4 0. 5 + 0 0 . 5 = 0.2

P(z = 1|y = 1) =

P(z = 1|y = 1 ,x= 1) P(x = 1|y = 1) + P(z = 1|y = 1 ,x= 0) P(x = 0|y = 1) =

=P (z = 1|y = 1,x = 1)P (x = 1) + P (z = 1|y = 1,x = 0)P (x = 0) =

= 0. 7 0 . 5 + 0. 6 0. 5 = 0.65

P(z = 1|x = 1) =

P(z = 1|x = 1 ,y= 0) P(y = 0|x = 1) + P(z = 1|y = 1 ,x= 1) P(y = 1|x = 1) =

= 0. 4 0 . 5 + 0. 7 0. 5 = 0.55

It follows

1.

P(z = 0|x = 0 ,y= 0) = 1

2.

P(x = 1|z = 1 ,y= 0) = P(z = 1|x = 1,y = 0) P(x = 1 |y = 0)

P(z = 1|y = 0 = 0 .4 0 .5

0. 2= 1

3.

P(x = 1|z = 1 ,y= 1) = P(z = 1|x = 1,y = 1) P(x = 1|y = 1)

P(z = 1|y = 1 = 0 .7 0 .5

0. 65 = 0.538

4.

P(z = 1|x = 0) = P(z = 1|x = 0 ,y= 1) P(y = 1|x = 0)+ P(z = 1|x = 0 ,y= 0) P(y = 0|x = 0) =

= 0. 6 0 . 5 + 0 = 0.3

5.

P(x = 1|z = 1) = P(z = 1 |x = 1) P(x = 1)

P(z = 1 = 0 .55 0 .5

0. 55 0. 5 + 0. 3 0. 5= 0.647

6.

P(x = 1|z = 0) = P(z = 0|x = 1) P(x = 1)

P(z = 0 = 0 .45 0 .5

0. 45 0. 5 + 0. 7 0. 5 = 0.3913

84 CHAPTER 3. FOUNDATIONS OF PROBABILITY

Chapter 4

Graphical models

Graphical Models combine probability theory and graph theory [113, 136, 119] to

deal with two pervasive issues in applied mathematics and engineering: uncertainty

and complexity. In particular, they rely on the notion of conditional independence

(Section 3.1.6) to simplify the representation of complex high-variate probability

distributions.

4.1 Conditional independence and multivariate dis-

tributions

One of the hardest challenges for machine learning is to model large variate tasks,

i.e. tasks characterised by a large number of variables. Section 3.5.2 shows that an

independence assumption reduces the size of the parameter set needed to describe

a probability distribution with many variables. Unfortunately, the assumption of

independence is very strong and rarely met in real tasks. Nevertheless, it is realistic

to assume the existence of conditional independence (Section 3.5.4) relationships in

large variate settings. This assumption implies sparseness, which is a dependence

pattern where variables tend to interact with few others. If conditional indepen-

dence between some variable holds, thanks to the (3.5.56), we reduce the size of the

parameter set required to describe the joint probability distribution.

Consider for instance the case of n= 4 binary discrete r.v.s. In the generic case,

we need 24 1 = 35 parameters to encode such probability, i.e. a quantity expo-

nential in the number of variables. This exponential nature makes the probabilistic

modelling unfeasible (i.e. too many parameters to elicit) and unmanageable (i.e.

too large required memory) in case of large n.

Let us now suppose that the 4 binary r.v.s are independent: in this case since

P( z4 , z3, z2, z1 ) = P( z4 ) P( z3 ) P( z2 ) P( z1 ) (4.1.1)

only 4 parameters are necessary to describe the joint distribution. No exponential

explosion of the number of required parameters happens. However, this is a very

simplistic and idealised setting, which rarely occurs in real interesting problems.

Moreover, if all the variables were independent, there would be no need of supervised

learning and predictive modelling since no variable brings information (or reduce

uncertainty) about the other.

A more realistic assumption is to consider some variables as conditionally inde-

pendent of others. For instance, suppose that z4 is conditionally independent of z 1

and z2 given z3 (z4 (z1 ,z2 ) |z3 )

P( z4 |z3 , z2, z1 ) = P( z4 |z3 ) (4.1.2)

85

86 CHAPTER 4. GRAPHICAL MODELS

and z3 is conditionally independent of z1 given z2 (z3 z1 |z2 )

P( z3 | z2 , z1 ) = P( z3 |z2 ) (4.1.3)

From the discrete version of (3.5.56) we can write

P( z4 , z3, z2, z1 ) = P( z4 | z3 , z2, z1 ) P( z3 |z2 , z1 ) P( z2 | z1 ) P( z1 )

From the conditional independence relations (4.1.2) and (4.1.3) we obtain the sim-

plified expression

P( z4 , z3, z2, z1 ) = P( z4 |z3 ) P( z3 |z2 ) P( z2 |z1 ) P( z1 )

Note that the conditional probability P ( zj |zi ) for two binary r.v.s can be now

encoded by a conditional table with two single parameters, e.g. P (zj = 1|zi =

1) and P (zj = 1|zi = 0). It follows that thanks to such assumptions, we may

describe the joint probability with 7 parameters only. The useful compactness of

the representation is still more striking in the case of large n, continuous variables,

or discrete r.v.s with a large range of values.

The representational advantage of conditional independence relationships is evi-

dent in Bayesian Networks, a formalism characterised by a correspondence between

topological properties (e.g. connectivity in a directed graph) and probabilistic ones

(notably independence). This formalism allows a compact, flexible, modular (since

localized and then natural for humans) representation of joint distributions.

4.2 Directed acyclic graphs

Adirected graph G is a pair (V, E ) where V is a finite non-empty set whose elements

are called nodes , and E is a set of ordered pairs of distinct elements of V . The

elements of E are called edges. A directed cycle is a path from a node to itself. A

directed graph is called a directed acyclic graph (DAG) if it contains no directed

cycles.

Given a DAG and two nodes z1 ∈ V , and z2 ∈ V ,

z2 is called a parent of z1 if there is an edge from z2 to z 1

z2 is called a descendent of z1 , and z1 is called an ancestor of z2 if there is a

directed path from z1 to z 2

z2 is called a non-descendent of z1 if it is not a descendent of z 1

Note that a node is not considered a descendant of itself.

4.3 Bayesian networks

DAGs are an effective way of representing multivariate distributions where the nodes

denote random variables, and the topology (notably the absence of edges) encodes

conditional independence assertions (e.g. elicited from an expert in this domain).

A main advantage of the approach is the notion of modularity (a complex system

is made by combining simpler parts) which makes possible visual interpretability.

A Bayesian Network (BN) is a pair (G , P ) where Gis a Directed Acyclic Graph

(DAG) (i.e. graph with no loops from a variable back to itself) and Pis a joint

probability distribution over Z, which is associated with Gby the Markov condition.

Definition 3.1 (Markov condition) . Given a DAG graph Gand the associated

joint probability distribution P over Z , the Markov condition (MC) holds if every

variable is independent of its graphical non-descendants conditional on its parents.

4.3. BAYESIAN NETWORKS 87

Figure 4.1: Bayesian Network.

If the Markov condition is satisfied (it is also said that G represents P ) the

following theorem holds.

Theorem 3.2. If (G , P ) satisfies the Markov condition, then Pis equal to the

product of its conditional distributions of all nodes given values of their parents,

whenever these conditional distributions exist.

This means that if we order the set of r.v. zi , such that if zj is a descendant of

zk ,zj follows zk in the ordering (k < j), we have the product form

P( z1 , . . . , zn ) =

n

Y

i=1

P( zi | Parents(zi ))

where Parents(zi ) is the set of parents of the node zi in G.

Example

An example of BN is shown in Figure 4.1. Note that the enumeration of the variable

indices satisfies the topological ordering mentioned before. Let us consider the node

z4 : the nodes z2 and z3 are its parents, z1 is its ancestor, and z6 is its descendant.

The associate probability distribution may be factorised as follows:

P( Z) = P( z6 |z4 ) P( z5 |z3 ) P( z4 |z3 , z2 ) P( z3 |z1 ) P( z2 |z1 ) P( z1 )

From the DAG, we can derive a number of conditional independent statements on

the basis of the Markov Condition:

z6 (z2 ,z3 ,z1 ,z5 ) |z4 (4.3.4)

z4 (z1 ,z5 ) | (z2 ,z3 ) (4.3.5)

z2 (z3 ,z5 ) |(z1 ) (4.3.6)

Note, for instance, that z4 is not independent of z6 since it is a descendant. May

you write more independence statements?

88 CHAPTER 4. GRAPHICAL MODELS

MARY CALLS

BURGLARY EARTHQUAKE

ALARM

JOHN CALLS

Figure 4.2: Alarm BN [166].

Example

This is a well-known example used in [166] to show a practical application of

Bayesian Networks. Suppose you want to model a burglar alarm that is fairly re-

liable at detecting a burglary but also responds on occasion to minor earthquakes.

You also have two neighbours, John and Mary, who promised to call you at work

when they hear the alarm. John always calls when he hears the alarm but some-

times confuses the telephone ringing with the alarm and calls then, too. Mary likes

loud music and sometimes misses the alarm. Given the evidence of who has or has

not called, we would like to estimate the probability of a burglary. We can describe

the problem by using a BN (Figure 4.3) where all the variables are Boolean and

denoted by capital letters to better remember their meaning.

The joint probability can be factorized as follows:

P( J, M, A, B , E) = P( J| A) P( M| A) P( A| B, E ) P( B) P( E)

Suppose that the unconditional probability of a burglar (B) or an earthquake (E)

are quite low, e.g. P (B ) = 0. 001 and P (E )=0 .002. Let the conditional probability

tables be

B E P(A | B,E) P(¬ A| B,E)

T T 0.95 0.05

T F 0.94 0.06

F T 0.29 0.71

F F 0.001 0.999

A P(J| A) P(¬ J |A)

T 0.9 0.1

F 0.05 0.95

A P(M| A) P(¬ M| A )

T 0.7 0.3

F 0.01 0.99

What is the probability Prob {B =T |J =T} (denoted Prob {B |J } below), i.e.

the probability that a burglar entered the house if John calls?

Prob {B |J } = Prob {J |B } Prob {B}

Prob {J } =

=Prob {J|B } Prob {B }

Prob {J|B } Prob {B } + Prob {JB } Prob B}

4.3. BAYESIAN NETWORKS 89

We have

Prob {J|B } = Prob {J|A, B } Prob {A |B } + Prob {J A, B } Prob A |B } =

= Prob {J|A } Prob {A|B } + Prob {J A } Prob A |B }

Since

Prob {A |B } = Prob {A|B , E } ∗ Prob {E } + Prob {A | B , ¬E } ∗ Prob E } =

= 0. 95 · 0 . 002 + 0. 94 · (1 0 . 002) = 0.94

Prob {AB } = Prob {AB , E } ∗ Prob {E } + Prob {AB , ¬E } ∗ Prob E } =

= 0. 29 · 0 . 002 + 0. 001 · (1 0 . 002) = 0.00158

Prob AB } = 1 Prob {AB } = 0.9984

it follows

Prob {J |B } = 0. 9· 0 . 94 + 0. 05 · (1 0. 94) = 0.8490

Prob {J B } = Prob {J|A } Prob {A B } + Prob {JA } Prob AB}

= 0. 9· 0 . 00158 + 0. 05 · 0. 9984 = 0.0513

and

Prob {B |J } = 0. 8490 · 0 .001

0. 8490 · 0. 001 + 0. 0513 · (1 0. 001) = 0.016

You can retrieve the same results by running the script alarm.R which re-

lies on first computing the entire joint probability distribution and then the ratio

Prob{B,J }

Prob{J } .

Example

An interesting BN topology is known as Naive Bayes (Figure 4.3). In this case all

variables zi , 0 < i n are conditionally independent given the variable z 0

zi zj |z0 i > 0 , j > 0

It follows that the associated joint probability can be written as

P( Z) = P( z0 )

n

Y

i=1

P( zi |z0 )

Note that in this case, we need overall 2n+1 parameters to encode the distribution,

i.e. two parameters for each conditional distributions and one parameter to encode

P( z0 ).

This probabilistic model is commonly used for probabilistic reasoning in medical

diagnostic where z0 denotes the pathology class (the cause) and z1 ,...,zn represent

nsymptoms (or effects) associated with the pathology. The assumption here is that

symptoms depend only on the underlying pathology. The Naive Bayes principle also

underlies a well-known classifier which will be presented in Section 10.2.3.1.

Definition 3.3 (Minimality condition) . Consider the BN (G , P ) satisfying the MC.

The BN satisfies the minimality condition iff for every proper subgraph H of Gthe

pair (H , P ) does not satisfy the MC.

90 CHAPTER 4. GRAPHICAL MODELS

Figure 4.3: Naive Bayes topology.

4.3.1 Bayesian network and d-separation

The Markov Condition induces a set of conditional independence relations in a

Bayesian Network. However, it is not easy to determine which other conditional

independence relationships possibly hold.

In order to show the link between DAG topology and conditional independence,

we introduce the criterion of d-separation. Let Z(i,j ) the set obtained by removing

zi and zj from Z.

Definition 3.4 (d-separation) . In a DAG (Directed Acyclic Graph), two nodes z i

and zj are d-separated by the conditioning set S Z (i,j ) (denoted by (zi lzj |S ))

if every path from zi to zj is blocked by S.

Two nodes are d-connected if they are not d-separated.

Definition 3.5 (Blocked path) . A path from zi to zj in a DAG is blocked by the

conditioning set Sif

at least one diverging or serially connected (i.e. non-collider ) node of the path

is in S, OR

at least one converging node (collider ) and all its descendants are not in S

Example

Consider the graph Gshown in Figure 4.1. If we consider the path z2 z4 z 3

the node z4 is a collider. If we consider the path z2 z4 z6 the node z4 is a

non-collider. It follows then that

z6 is d-separated from z2 by the conditioning set S= {z4 }

(z6 lz2 |z4 ) (4.3.7)

since the only path is blocked (the serially connected node z4 S is in the

path z2 z4 z6 )

z2 is not d-separated from z3 by the conditioning set S= {z4 }

(z3 z2 |z4 ) (4.3.8)

since there is at least a path that is not blocked (the collider node z4 S is

in the path z2 z4 z3 )

A non-blocked path is also called active . The activity of a path depends on the

activity of its (colliders and non-colliders) nodes:

4.3. BAYESIAN NETWORKS 91

Non-colliders: when the conditioning set is empty, they are active. When they

belong to the conditioning set they become inactive.

Colliders: when the conditioning set is empty, they are inactive. They become

active when they or some of their descendants are part of the conditioning

set.

It follows that a path is not blocked or active when all the nodes are active.

R example

The R package bnlearn allows us to encode DAGs and performs checks of d-

separation between sets of variables. This package is used in the script dsep.R to

encode and then visualise the DAG in Figure 4.1. The script also uses the function

dsep provided by the package bnlearn to check the existence of the d-separations

corresponding to the conditional independence statements (4.3.7) and (4.3.8).

Note that, for the moment, d-separation is a pure graphical (and non probabilis-

tic) notion related to the topology of the graph G . In what follows, we will discuss

how and when it may be informative about probabilistic properties.

4.3.2 D-separation and I-map

Definition 3.6 (I-map property) . A Bayesian Network (G , P ) satisfies the I-map

property if

zi ,zj Z, S Z (i,j ) : (zi lzj |S) I (zi ;zj |S)=0

i.e. if the d-separation (zi lzj |S ) implies the conditional independence zi zj |S

or, equivalently, the null conditional mutual information I (zi ;zj |S ) = 0.

We remind the relation (3.8.80) between conditional mutual information and

conditional independence.

The I-map property implies that the set of d-separations of G(denoted by I (G))

is contained in the set of conditional independencies of P(denoted by I (P )):

I( G) ⊆ I(P)

Note that a completely connected DAG always satisfies the property above since

I( G) = .

It can be shown that if (and only if) a probability Psatisfies the Markov con-

dition for a given graph G, each d-separation in the graph implies a conditional

independence (MC I-map).

In other terms, if one takes any instance of a distribution P which factorises

according to the graph structure and I (P ) is the list of all the conditional indepen-

dence statements that can be obtained from P , if z1 and z2 are d-separated by z3 ,

the independence z1 z2 |z3 belongs to I (P).

4.3.2.1 D-separation and faithfulness

If MC holds, d-separation implies independence. Can we also say the reverse, i.e.

that d-connection (i.e. absence of d-separation) implies dependence? Do all distri-

butions P factorised according to a graph Gpossess the dependencies related to

the d-connection of the graph? Unless we make additional assumptions, the answer

is no. The required additional assumption is called faithfulness between the graph

and the distribution.

92 CHAPTER 4. GRAPHICAL MODELS

Definition 3.7 (Faithfulness). A Bayesian Network (G , P ) satisfies the Faithfulness

property if

zi ,zj Z, S Z(i,j ) : (zi zj | S) I (zi ;zj | S) 6= 0

or equivalently if the conditional independence zi zj |S entails the d-separation

zi ,zj Z, S Z (i,j ) :I (zi ;zj |S)=0 (zi lzj |S)

Faithfulness means that independence between two variables zi and zj (in in-

formation theory terms I (zi ;zj | S ) = 0) implies d-separation, or equivalently that

d-connection implies dependency.

When both the Markov condition and the faithfulness hold, there is a bijection

between d-separation and conditional independence. The DAG is then said a perfect

map of the joint probability distribution.

If faithfulness holds, we have a probabilistic independence interpretation of the

graphical d-separation. This means, for instance, that when the conditioning set

is empty non-colliders transmit information (dependence) while colliders do not

(independence).

Example

Consider the BN in Figure 4.1 and suppose it is faithful. For an empty conditioning

set, we can derive a number of relations of dependence, such as

z2 6⊥z6 ,z4 6⊥z 5

Are there two independent variables?

Beware that many distributions have no perfect map in DAGs. The spectrum

of probabilistic dependencies is, in fact, so rich that it cannot be cast into any

representation scheme that uses a polynomial amount of storage ([Verma, 1987]).

So, how strong is the assumption of faithfulness? It is possible to show that

a DAG is a perfect map for almost all distributions that factorise over G (i.e. all

distributions except a set of measure zero) [119]. This means that assuming faith-

fulness is reasonable in most practical settings. There are, however, well-known

counterexamples to faithfulness as in the case of the XOR (eXclusive OR) func-

tion1 : if x1 y x2 and y is the output of a XOR function with inputs x1 and

x2 , it follows the yx1 and y x2 , though y is not d-separa ted from x1 and

x2 . In other terms, following the faithfulness assumptions, there should be no edges

connecting x1 , x2 and y , however, there are.

Example

Consider a symptom (e.g. headache) with two possible causes: a serious one (can-

cer) and a less serious one (virus). We can describe the problem by using a BN

(Figure 4.4) where all the variables are Boolean. First of all, if we assume a per-

fect map, from d-separation we obtain that the variable C and V are independent

but conditionally independent (i.e. conditioning on Hthe path from; C to Vis

unblocked).

Suppose that the a priori probability of the serious cause (P (C )=0 .1) is much

lower than the one of the less serious one (P (V ) = 0. 6) and that the conditional

probability table is

1the XOR function returns TRUE if one of the inputs is TRUE and FALSE if both inputs are

TRUE.

4.3. BAYESIAN NETWORKS 93

Figure 4.4: Common effect configuration.

C V P(H | C,V) P(¬ H| C,V)

T T 0.95 0.05

T F 0.8 0.2

F T 0.8 0.2

F F 0.1 0.9

From the script expl away.R we compute the conditional probability of cancer

in three different situations: headache only, headache and virus and headache but

no virus:

P( C= T| H)=0 .1597846 (4.3.9)

P( C= T| H, V )=0 .1165644 (4.3.10)

P( C= T| H, ¬ V )=0 .4705882 (4.3.11)

Let us remark that the conditional probability (4.3.9) is higher than the a priori

(P(C )=0 .1) if we observe that the patient has a headache. If we know that the

patient is infected by a virus as well, the probability of cancer decreases to (4.3.10).

We say that the virus explains away the cancer possibility. On the contrary, if we

know that headache is present but the virus is absent, the cancer probability surges

again to (4.3.11).

This non-monotone behaviour is caused by what is called the explaining away

effect, i.e. if we have two common causes of an observed effect, knowing that one

occurs (or not) reduces (or increases) the probability of the other. This is due to the

fact that though virus and cancer are marginally independent (P (C |V ) = P (C)),

they are dependent once we condition on headache (P (C |V, H ) < P (C |H )).

4.3.3 Skeleton and I-equivalence

In the case of perfect-map, a BN is fully specified by its conditional independent

statements. A definition of equivalence follows, then:

Definition 3.8 (I-equivalence:) . Two graphs are I-equivalent if they have the same

associated set of independencies.

All distributions that can be factorised on a graph G, can also be factorised on

an equivalent graph G0 .

In order to check the notion of equivalence visually, we introduce the notion of

skeleton.

94 CHAPTER 4. GRAPHICAL MODELS

Definition 3.9 (Skeleton:) . The skeleton of a Bayesian Network graph G over Zis

an undirected graph that contains an edge for every edge in G.

A sufficient (but not necessary) condition for equivalence of two graphs is that

they have the same skeleton and the same set of v-structures. A v-structure occurs in

a DAG when there is a node having two entering edges (e.g. Figure 4.4). Complete

graphs (e.g. completely connected triplets) are equivalent (no independence) but

may have different v-structures.

A v-structure with no direct edge between its parents is also called immorality

or unshielded collider. A sufficient and necessary condition for the equivalence of

two graphs is that they have the same skeleton and the same set of immoralities.

Definition 3.10 (Markov equivalence:).Two DAGs are said to be Markov equiv-

alent, if and only if they have the same skeleton and the same v-structures

Observational equivalence places a limit on the ability to infer directionality from

conditional probabilities alone. This means that there are classes of equivalence

of graphs that cannot be distinguished using conditional independence tests. Two

graphs that are I-equivalent cannot be distinguished without resorting to alternative

strategies (e.g. manipulative experimentation or temporal information).

By considering conditional independence only, it is not possible to detect changes

in graphs (e.g. arc reversing) that do not change the skeleton and which do not

introduce or destroy a v-structure. The simplest example is provided by two graphs

z1 z2 and z2 z1 . They are Markov equivalent (same skeleton and no v-

structure) and they have associated an empty set of independencies.

At the same time the unshielded collider z1 z2 z3 is a singleton, i.e. the

only member of its equivalence class. This means that independence constraints

alone suffice to determine without ambiguity its structure.

4.3.4 Stable distributions

The use of the independence relationships made so far implies that the set of inde-

pendences of the probability distribution associated with a graph depends only on

the structure of the graph and not on the parametrisation.

This restriction is also known as stability: in other terms, we consider distri-

butions whose independencies remain invariant to any change in the parameters.

The stability assumption presumes that unstable independencies (i.e. dependencies

that disappear with a parameter change) are unlikely to occur in the data, so all

the independencies are structural.

In general, it is important to be aware of the limited expressibility of BNs. BNs

cannot necessarily graphically represent all the independence properties of a given

distribution. Consider, for instance, the distribution associated to x1 y1

z y2 x2 . If we marginalise the distribution wrt z(i.e. zis not observable),

there is no DAG containing only the vertices x1 ,x2 ,y1 ,y2 , which represents the

independence relations of the original DAG without adding spurious independencies.

4.4 Markov networks

Markov networks (MN) are an undirected graphical representation of conditional

independencies. Let us consider a set Z ={z1 ,...,zn } of n random variables.

Definition 4.1. The conditional independent graph of Z is the undirected graph

G= (V, E ) where V = {1 , . . . , n } and (i, j) is NOT in the edge set Eiff

zi zj |Z −{i,j }

4.4. MARKOV NETWORKS 95

This graph is also called a pairwise Markov graph. Note that for n variables,

there are 2( n

2)potential undirected graphs.

4.4.1 Separating vertices, separated subsets and indepen-

dence

As in the directed case, it is possible to use topological notions (e.g. separation

of vertices in the network) to deduce probabilistic properties (e.g. conditional

(in)dependencies). Given an undirected graph G = (V , E),

a subset of vertices separates two vertices i and j if every path joining the two

vertices contains at least one vertex from the separating subset

a subset of vertices separates two subsets Va and Vb in G if it separates every

pair of vertices i Va ,j Vb .

The last property is also called the global Markov property. In general, it can

be show that the set of distributions that satisfies the pairwise Markov property

satisfies as well the global Markov property.

Given an undirected independence graph G = (V, E ) it can be shown that:

if V can be partitioned into two subsets Vb and Vc , such that there is no path

between any vertex in Vb and any vertex in Vc then

xi xj , forall xi Vb and xj V c

if Va is any subset of vertices of Gthat separates i and jthen

xi xj |X a

4.4.2 Directed and undirected representations

BNs and MNs are two closely related yet different, representations and it is im-

portant not to confuse them. So why consider undirected representations besides

directed Bayesian networks? What is their main difference? First of all, let us

remind the Box golden rule of modelling: no model representation is perfect or

exhaustive, all of them are wrong, but sometimes some of them are useful. Markov

networks visualise conditional independence properties in distributions without hav-

ing recourse to any notion of ordering. In that sense, they are more adequate when

the considered problem is not explicitly associated to a specific ordering of vari-

ables or is characterised by symmetric relations. At the same time, asymmetric

relationships (e.g. cause and effect, past and future) fit well the BN formalism.

As an example, let us consider two probabilistic distributions of n random vari-

ables. In the first case, the n variables represent a quantity measured during n

consecutive time instants. In the second case, the nvariables measure the same

quantity over n different spatial locations. In both cases properties of conditional

independence might help in representing and reasoning on the distribution. How-

ever, only the first case takes advantage of a Bayesian Network representation which

encodes the explicit and asymmetric time ordering. An undirected representation,

where a notion of symmetric neighbourhood is present, is more suitable to the

spatial distribution task.

A second interesting issue is whether a MN is equivalent to a BN which have

been deprived of the edge directionality. The answer is not so simple. Consider a

DAG G and a faithful probability P . Let U be the undirected skeleton associated

with G and U0 be the undirected conditional independence graph associated with

P. Which relationship exists between Uand U0 ? Uand U0 generally are not the

same, but the relation U U0 holds. As shown by Wermuth and Lauritzen, Uand

U0 are the same iff Gdoes not contain any unshielded collider.

96 CHAPTER 4. GRAPHICAL MODELS

4.5 Conclusions

The graphical modelling formalism enables a modular representation of large variate

problems thanks to the correspondence between topological properties and prob-

abilistic notions. Graphical models are then effective tools that can be used to

represent and communicate the relations between a large number of variables and

to perform probabilistic reasoning.

In general, an effective modelling approach has to manage the trade-off between

complexity and fidelity to reality. Graphical modelling uses the notion of conditional

independence to address such issue.

Note that the adoption of conditional independence assumptions to simplify

representations is pervasive in mathematical modelling and human reasoning: think,

for instance, to the notion of state in dynamical systems, which makes the future

behaviour independent from the past given the present. Simplifying by conditioning

is also a peculiar characteristic of human causal reasoning: once we find the cause

of a certain phenomenon, we can disregard all other variables as irrelevant.

We will see that in machine learning, graphical modelling is a powerful way

to explain why some variables are more important or relevant than others (Chap-

ter 12). At the same time, machine learning strategies may be used to infer compact,

graphical (and sometimes causal) representations from data (Chapter 13).

Chapter 5

Parametric estimation and

testing

Given the correct probabilistic model of a phenomenon, we may derive the properties

of observable data by logical deduction. The theory of statistics is designed to

reverse the deductive process (Chapter 2). It takes measured data and uses them to

propose a probabilistic model, to estimate its parameters and eventually to validate

it. This chapter will focus on the estimation methodology, intended as the inductive

process which leads from observed data to a probabilistic description of reality.

We will focus here on the parametric approach, which assumes that we know all

about the probabilistic model except the value of a finite number of parameters.

Parametric estimation algorithms build estimates from data and, more important,

statistical measures to assess their quality. There are two main approaches to

parametric estimation:

Classical or frequentist: it is based on the idea that sample data are the sole

quantifiable form of relevant information and that the parameters are fixed

but unknown. It is related to the frequency view of probability (Section 3.1.4).

Bayesian approach: the parameters are supposed to be random variables , having

a distribution prior to data observation and a distribution posterior to data

observation. This approach assumes that there exists something beyond data,

(i.e. a human sense of uncertainty or a subjective degree of belief), and that

this belief can be described in the probabilistic form.

It is well known, however, that in large-sample problems, frequentist and Bayesian

approaches tend to produce similar numerical results and that in small-medium

settings, though the two outcomes may not coincide, their difference is usually

small. For those reasons and, mainly for reasons of space, we will limit here to

consider the classical approach. It is important, however, not to underestimate the

important role of the Bayesian estimation philosophy, which led recently to a large

amount of research in Bayesian data analysis and important applications in machine

learning [78].

5.1 Classical approach

The classical approach to parameter estimation dates back to the period 1920-35

when J. Neyman and E.S. Pearson, stimulated by problems in biology and industry,

concentrated on the principles for testing hypothesis and R.A. Fisher, interested in

agricultural issues, focused on the estimation from data.

97

98 CHAPTER 5. PARAMETRIC ESTIMATION

We will introduce estimation by considering a simple univariate setting. Let z

be a continuous r.v. and suppose that

1. we know the analytical form of the distribution family

Fz ( z ) = Fz ( z, θ)

but the parameter vector θ Θ is unknown,

2. we have access to a set DN of N i.i.d. measurements of z , called sample data.

In the general case, few parameters are not enough to describe a function, like

the density function: in that sense, parametric densities are an obvious simplifica-

tion. An example of a parametric distribution function is the Normal distribution

(Section (3.4.2)), where the parameter vector is θ = [µ, σ ]. The goal of the esti-

mation procedure is to find a value ˆ

θof the parameter θso that the parameterised

distribution Fz (z, ˆ

θ) closely matches the distribution of data.

The notation i.i.d. stands for identically and i ndependently d istributed. Identi-

cally distributed means that all the observations have been sampled from the same

distribution, that is

Prob {zi =z} = Prob {zj =z} for all i, j = 1, . . . , N and z ∈ Z

Independently distributed means that the fact that we have observed a certain value

zi does not influence the probability of observing the value zj , that is

Prob {zj =z|zi = zi } = Prob {zj =z }

Example

Here you find some examples of estimation problems:

1. Let DN = { 20,31,14,11,19, . . . } be the times in minutes spent the last 2

weeks to go home. What is the mean time to reach my house from ULB?

2. Consider the car traffic in the boulevard Jacques. Suppose that the measures

of the inter-arrival times are DN = { 10,11, 1, 21,2, . . . } seconds. What does

this imply about the mean inter-arrival time?

3. Consider the students of the last year of Computer Science. What is the

variance of their grades?

4. Let z be the r.v. denoting tomorrow's temperature. How can I estimate its

mean value on the basis of past observations?

Parametric estimation is a mapping from the space of the sample data to the

space of parameters Θ. The two possible outcomes are:

1. some specific value of Θ. In this case, we have the so-called point estimation.

2. some particular region of Θ. In this case, we obtain an interval of confidence

on the value of the parameter.

5.2. EMPIRICAL DISTRIBUTIONS 99

5.1.1 Point estimation

Consider a random variable zwith a parametric distribution Fz (z, θ ), θ Θ. The

unknown parameter can be written as a function(al) of F

θ= t( F)

This corresponds to the fact that θis a characteristic of the population described

by Fz (· ). For instance the expected value parameter µ =t (F ) = R z dF (z ) is a

functional of F.

Suppose now that we have observed a set of N i.i.d. values DN ={ z1 , z2, . . . , zN } .

Apoint estimate is an example of statistic, where by statistic it is generally meant

any function of the sample data DN . In other terms a point estimate is a function

ˆ

θ= g( DN ) (5.1.1)

of the sample dataset DN , where g ( · ) stands for the estimation algorithm, that

is the procedure which returns the estimation starting from a dataset DN . Note

that, from a machine learning perspective, it is more appropriate to consider g,

rather than a conventional mathematical function, as a generic algorithm taking

the sample dataset as an input and returning an estimation as output1.

There are two main issues in estimation and, more generally, in data analysis,

statistics and machine learning: how to construct an estimator (i.e. which form

should g take) and how to assess the quality of the returned estimation ˆ

θ. In

Sections 5.3 and 5.8 we will discuss two strategies for defining an estimator; the

plug-in principle and the maximum likelihood. In Section 5.5 we will present the

statistical measures most commonly adopted to assess an estimator accuracy.

Before introducing the plug-in principle, we need, however, to present the notion

of empirical distribution.

5.2 Empirical distributions

Suppose we have observed a i.i.d. random sample of size Nfrom a probability

distribution Fz ( ·)

Fz → { z1 , z2, . . . , zN }

The empirical distribution probability ˆ

Fis defined as the discrete distribution

that assigns probability 1/N to each value zi , i = 1 , . . . , N. In other words, ˆ

F

assigns to a set Ain the sample space of zits empirical probability

Prob {z A } ≈ # z i A

N

that is the proportion of the observations in DN which occur in A.

It can be proved that the vector of observed frequencies in ˆ

Fis a sufficient

statistic for the true distribution F ( ·), i.e. all the information about F ( ·) contained

in DN is also contained in ˆ

F(·).

Consider now the distribution function Fz (z ) of a continuous rv zand a set of

Nobservations DN = { z1 , . . . , zN }. Since

Fz ( z) = Prob {z z}

we define N (z ) as the number of observations in DN that do not exceed z . We

obtain then the empirical estimate of F (·)

ˆ

Fz ( z) = N( z)

N=# z i z

N(5.2.2)

1For instance, an awkward, yet acceptable, estimation algorithm could take the dataset, discard

all the examples except the third one and return it as the estimation.

100 CHAPTER 5. PARAMETRIC ESTIMATION

Figure 5.1: Empirical distribution.

This function is a staircase function with discontinuities at the points zi (Figure

5.1).

Example

Suppose that our dataset is made of the following N = 14 observations

DN ={ 20 , 21 , 22 , 20 , 23 , 25 , 26 , 25 , 20 , 23 , 24 , 25 , 26 , 29}

The empirical distribution function ˆ

Fz (which can be traced by running the

script cumdis.R ) is plotted in Figure 5.1.

5.3 Plug-in principle to define an estimator

Consider an r.v. zand sample dataset DN drawn from the parametric distribution

Fz ( z, θ). The main issue of estimation is how to define an estimate of θ . A possible

solution is given by the plug-in principle, that is a simple method of estimating

parameters from observations. The plug-in estimate of a parameter (or target) θis

defined to be:

ˆ

θ= t(ˆ

F( z)) (5.3.3)

obtained by replacing the distribution function with the empirical distribution in

the analytical expression of the parameter.

The following section will discuss the plug-in estimators of the first two moments

of a probability distribution.

5.4. SAMPLING DISTRIBUTION 101

5.3.1 Sample average

Consider an r.v. zFz ( ·) such that

θ= E[z ] = Z zdF ( z )

with θ unknown. Suppose we have available the sample Fz DN , made of N

observations. The plug-in point estimate of θis given by the sample average

ˆ

θ=1

N

N

X

i=1

zi = ˆ µ(5.3.4)

Note that the sample average is not a parameter (i.e. it is not a function of the

probability distribution Fz ) but a statistic (i.e. a function of the dataset DN ).

5.3.2 Sample variance

Consider a r.v. z Fz (· ) where the mean µand the variance σ2 are unknown.

Suppose we have available the sample Fz DN . Once we have the sample average

ˆ µ, the plug-in estimate of σ2 is given by the sample variance

ˆ σ2 =1

N1

N

X

i=1

(zi ˆ µ)2(5.3.5)

The presence of N 1 instead of N at the denominator will be explained later.

Note also that the following relation holds for all zi

1

N

N

X

i=1

(zi ˆ µ)2 = 1

N

N

X

i=1

z2

i!ˆ µ2

The expression of the plug-in estimators of other interesting probabilistic pa-

rameters are in the Appendix (D).

5.4 Sampling distribution

Given a dataset DN of N observations sampled from z , let us consider a point

estimate ˆ

θ= g( DN ) (5.4.6)

Note that since DN is the outcome of Nrealisations of a r.v. z, the vector DN

can be considered as the realisation of a random vector DN2 .

By applying the transformation gto the random variable DN we obtain the

random variable ˆ

θ=g (DN ) (5.4.7)

which is called the point estimator of θ . A key point is the following: while θis

an (unknown) fixed value, the estimator ˆ

θis a random variable. For instance, if

we aim to estimate θ =µ (expected value of z) the parameter µis an unknown

and fixed value while the average ˆ

µis a random variable (since it is a function of a

random dataset).

2This is not a mathematical detail but an essential aspect of the data-driven discovery process

under uncertainty. Every model learned from data, or more in general all knowledge acquired from

data, is built on random foundations and, as such, it is a random quantity and has to be assessed

as such.

102 CHAPTER 5. PARAMETRIC ESTIMATION

Figure 5.2: From the parametric parent distribution of Fz ( ·, θ ) (underlying the data

generation) to the sampling distribution of the estimator ˆ

θN . Each dataset has the

same size N.

The probability distribution of the r.v. ˆ

θis called the sampling distribution,

while the distribution of the r.v. z (with parameter θ ) is called the parent dis-

tribution. An example of the process bringing from the parent to the sampling

distribution is plotted in Figure 5.2. Note that the sampling distribution, though

a theoretical quantity, is of great significance in estimation since it quantifies the

estimator's accuracy in probabilistic terms, or, in simpler words, the gap between

the estimation and the parameter θ.

5.4.1 Shiny dashboard

The dashboard estimation.R (Appendix G) provides an interactive visualisation

of the sampling distribution of the plug-in estimators of the parameters (mean

and variance) of a Normal parent distribution z. We invite the reader to modify

the values N ,µ and σ and to observe the impact on the sampling distribution.

Note that the sampling distribution is obtained by a Monte Carlo simulation of the

process illustrated in Figure 5.2. The simulation (Algorithm 1 and related R code in

Table 5.1) consists in repeating a number (adjustable) of trials where for each trial

a sample dataset of size Nis generated and the plug-in estimations are computed.

The dashboard shows the histograms of the estimations.

5.5. THE ASSESSMENT OF AN ESTIMATOR 103

Algorithm 1 Monte Carlo simulation to generate a sampling distribution

1: S={}

2: for r = 1 to R do

3: Fz DN = {z1 , z2, . . . , zN } // pseudo-random sample generation

4: ˆ

θ= g( DN ) // estimation computation

5: S= S∪ { ˆ

θ}

6: end for

7: Plot histogram of S

8: Compute statistics of S(mean, variance)

9: Study distribution of S with respect to θ (e.g. estimate bias)

mu<-0 # parameter

R<-10000 # number trials

N<-20 # size dataset

S<-numeric(R)

for (r in 1:R){

D<-rnorm(N,mean=mu,sd=10)

# pseudo-random sample generation

S[r]<-mean(D)

# compute estimate

}

hist(S)

# Plot histogram of S

bias=mean(S)-mu

# Estimate bias

Table 5.1: R version of Algorithm 1 pseudo-code to generate the sampling distribu-

tion of ˆ

µ.

5.5 The assessment of an estimator

Once defined an estimator ˆ

θ(e.g. in algorithmic or mathematical form), it is

possible to assess its accuracy from its sampling distribution.

5.5.1 Bias and variance

The following measures rely on the sampling distribution3 to assess the estimator

accuracy.

Definition 5.1 (Bias of an estimator) . An estimator ˆ

θof θ is said to be unbiased

if and only if

ED N [ˆ

θ] = θ

Otherwise, it is said to be biased with bias

Bias[ˆ

θ] = ED N [ ˆ

θ]θ (5.5.8)

Definition 5.2 (Variance of an estimator).The variance of an estimator ˆ

θof θis

the variance of its sampling distribution

Var h ˆ

θi =ED N [(ˆ

θE[ˆ

θ])2]

3please note that we refer to the ˆ

θdistribution and not to the zdistribution

104 CHAPTER 5. PARAMETRIC ESTIMATION

Definition 5.3 (Standard error) . The square root of the variance

ˆ

σ=r Var h ˆ

θi

is called the standard error of the estimator ˆ

θ.

An unbiased estimator is an estimator that, on average, has the right value

but averaged over what? It is important to retain that this average is over dif-

ferent realisations of the dataset DN as made explicit by the notation ED N [ ˆ

θ],

represented visually by Figure 5.2 and simulated by the Monte Carlo repetitions in

Section (5.4.1).

Note that different unbiased estimators may exist for a parameter θ . Also, a

biased estimator with a known bias (i.e. not depending on θ) is equivalent to an

unbiased estimator since we can easily compensate for the bias. We will see in

Section 5.5.3 that for some specific estimators it is possible to derive analytically

the bias. Unfortunately, in general, the bias is not measurable since this would

require the knowledge of θwhich is in fact the target of our estimation procedure:

nevertheless, the notion of bias is an important theoretical quantity to reason about

the accuracy of an estimation process.

Sometimes we are accurate (e.g. unbiased) in estimating θ though we are inter-

ested in f (θ ). Given a generic transformation f (· ), if ˆ

θis unbiased for θ this does

not imply that that f ( ˆ

θ) is unbiased for f (θ ) as well. This implies, for instance,

that the standard error ˆ

σis not an unbiased estimator of standard deviation σ

despite ˆ

σ2 being an unbiased estimator of σ2 .

5.5.2 Estimation and the game of darts

An intuitive manner of visualising the notion of sampling distribution of an estima-

tor and the related concepts of bias and variance is to use the analogy of the darts

game.

The unknown parameter θ can be seen as the darts game target and the estimator

ˆ

θas a player. Figure 5.3 shows the target (black dot) together with the distribution

of the draws of two different players: the C (cross) player and the R (round) player.

In terms of our analogy the cross player/estimator has small variance but large bias,

while the round one has small bias and large variance. Which one is the best?

Now it's your turn to draw the shot distribution of a player with low bias and

low variance and of a player which large bias and large variance.

5.5.3 Bias and variance of ˆ

µ

This section shows that for a generic r.v. zand an i.i.d. dataset DN , the sample

average ˆ

µis an unbiased estimator of the mean E [z ].

Consider a random variable zFz ( · ). Let µ and σ2 the mean and the variance

of Fz (· ), respectively. Suppose we have observed the i.i.d. sample DN Fz .

From (5.3.4) we obtain

ED N [ˆ

µ] = ED N " 1

N

N

X

i=1

zi # =PN

i=1 E[z i ]

N= Nµ

N= µ(5.5.9)

This means that the sample average estimator is not biased, whatever the distribu-

tion Fz (· ) is. And what about its variance? Since according to the i.i.d. assumption

Cov[zi ,zj ] = 0, for i 6 =j , from (3.10.85) we obtain that the variance of the sample

5.5. THE ASSESSMENT OF AN ESTIMATOR 105

Figure 5.3: The dart analogy: the target is the unknown parameter, the round dots

represent some realisations of the estimator R, while the crosses represent some

realisations of the estimator C.

average estimator is

Var [ ˆ

µ] = Var " 1

N

N

X

i=1

zi # =1

N2 Var " N

X

i=1

zi # =1

N2 Nσ 2 = σ 2

N.(5.5.10)

In fact, ˆ

µacts like the "round player" in darts game (Figure 5.3) with some variance

but no bias.

You can visualise the bias and variance of the sample average estimator by

running the Shiny dashboard estimation.R introduced in Section 5.4.

5.5.4 Bias of the estimator ˆ

σ2

Let us study now the bias of the estimator of the variance of z.

ED N [ˆ

σ2 ] = ED N " 1

N1

N

X

i=1

(zi ˆ

µ)2 #(5.5.11)

=N

N1 E D N " 1

N

N

X

i=1

(zi ˆ

µ)2 #(5.5.12)

=N

N1 E D N " 1

N

N

X

i=1

z2

i!ˆ

µ2 # (5.5.13)

Since E [z2 ] = µ2 +σ2 and Cov[zi ,zj ] = 0, the first term inside the E [· ] is

ED N " 1

N

N

X

i=1

z2

i!# =1

N

N

X

i=1

ED N z 2

i=1

NN(µ2 +σ2 )

Since E P N

i=1 z i 2 =N 2 µ 2 +Nσ 2 the 2nd term is

106 CHAPTER 5. PARAMETRIC ESTIMATION

ED N [ˆ

µ2 ] = 1

N2 E D N

N

X

i=1

zi ! 2

=1

N2 ( N 2 µ 2 + Nσ 2 ) = µ 2 + σ 2 /N

It follows that

ED N [ˆ

σ2 ] = N

N1 ( µ 2 + σ 2 )(µ2 +σ2 /N ) = N

N1 N1

Nσ 2 = σ 2

This result justifies our definition (5.3.5). Once the term N 1 is inserted at the

denominator, the sample variance estimator is not biased.

Some points are worth considering:

The results (5.5.9),(5.5.10) and (5.5.11) are independent of the family of the

distribution F (·).

According to (5.5.10), the variance of ˆ

µis 1/N times the variance of z . This is

a formal justification of the reason why taking averages on a large number of

observations is recommended: the larger N , the smaller is Var [ ˆ

µ], so a bigger

Nfor a given σ2 implies a better estimate of µ.

According to the central limit theorem (Section C.7), under quite general

conditions on the distribution Fz , the distribution of ˆ

µwill be approximately

normal as Ngets large, which we can write as

ˆ

µ∼ N (µ, σ 2 /N ) for N → ∞

The standard error p Var [ˆ

µ] = σ

N is a common measure of statistical ac-

curacy. Roughly speaking, if the estimator is not biased and the conditions

of the central limit theorem apply, we expect ˆ

µto be less than one standard

error away from µabout 68% of the time, and less than two standard errors

away from µabout 95% of the time (see Table 3.2) .

Script

You can visualize the bias and variance of the sample variance estimator by running

the following R script sam dis2.R or by running the Shiny dashboard estimation.R

introduced in Section 5.4..

5.5.5 A tongue-twister exercise

It sounds like a tongue-twister but it is important that the reader takes some time

to reason on the substantial difference between two quantities like

1. the variance of an estimator and

2. the estimator of the variance.

The first quantity is denoted by Var h ˆ

θi , is a real number and measures the accuracy

of an estimator. It has been introduced in Section 5.5.

The second is denoted ˆ

σ2 , is a random quantity since it is an estimator and its

properties (e.g. bias) has been discussed in Section 5.5.4.

Now, if you understand the difference between the two quantities above, you

could reason on Var ˆ

σ2 , which is nothing more than the variance of the estimator

of the variance. Clear, isn't it? And what about the estimator of the variance of

the estimator of the variance?

5.5. THE ASSESSMENT OF AN ESTIMATOR 107

5.5.6 Bias/variance decomposition of MSE

Bias and variance are two independent criteria to assess the quality of an estimator.

As shown in Figure 5.3 we could have two estimators behaving in opposite ways:

the first has large bias and low variance, while the second has large variance and

small bias. How can we choose among them? We need a measure able to combine or

merge the two to a single criteria. This is the role of the mean-square error (MSE)

measure.

When ˆ

θis a biased estimator of θ, its accuracy is usually assessed by its MSE

rather than simply by its variance. The MSE is defined by

MSE = ED N [(θ ˆ

θ)2]

For a generic estimator it can be shown that

MSE = (E [ ˆ

θ]θ )2 + Var h ˆ

θi =h Bias[ˆ

θ]i 2 + Var h ˆ

θi (5.5.14)

i.e., the mean-square error is equal to the sum of the variance and the squared bias

of the estimator . Here it is the analytical derivation

MSE = ED N [(θ ˆ

θ)2 ] = ED N [(θ E [ ˆ

θ] + E [ ˆ

θ] ˆ

θ)2 ] = (5.5.15)

=ED N [(θ E [ ˆ

θ])2 ] + ED N [(E [ ˆ

θ] ˆ

θ)2 ] + ED N [2(θ E [ ˆ

θ])(E [ ˆ

θ] ˆ

θ)] =

(5.5.16)

=ED N [(θ E [ ˆ

θ])2 ] + ED N [(E [ ˆ

θ] ˆ

θ)2 ] + 2(θ E [ ˆ

θ])(E [ ˆ

θ]E [ˆ

θ]) =

(5.5.17)

= (E [ ˆ

θ]θ )2 + Var h ˆ

θi (5.5.18)

This decomposition is typically called the bias-variance decomposition. Note that,

if an estimator is unbiased then its MSE is equal to its variance.

5.5.7 Consistency

Suppose that the sample data contains Nindependent observations z1 , . . . , zN of

a univariate random variable. Let the estimator of θbased on N observations be

denoted ˆ

θN . As N becomes larger, we might reasonably expect that ˆ

θN improves

as estimator of θ(in other terms it gets closer to θ ). The notion of consistency

formalizes this concept.

Definition 5.4. The estimator ˆ

θN is said to be weakly consistent if ˆ

θN converges

to θ in probability, that is

 > 0 lim

N→∞Prob n | ˆ

θN θ | ≤ o = 1

Definition 5.5. The estimator ˆ

θN is said strongly consistent if ˆ

θN converges to θ

with probability 1 (or almost surely).

Prob nlim

N→∞

ˆ

θN =θo = 1

For a scalar θthe property of convergence guarantees that the sampling distribu-

tion of ˆ

θN becomes less disperse as N → ∞. In other terms a consistent estimator

is asymptotically unbiased. It can be shown that a sufficient condition for weak

consistency of unbiased estimators ˆ

θN is that Var h ˆ

θN i 0 as N → ∞.

It is important to remark that the property of unbiasedness (for finite-size sam-

ples) and consistency are largely unrelated.

108 CHAPTER 5. PARAMETRIC ESTIMATION

Exercise

Consider an estimator of the mean that takes into consideration only the first 10

sample points, whatever the total number N > 10 of observations is. Is such

estimator consistent?

5.5.8 Efficiency

Suppose we have two unbiased and consistent estimators. How to choose between

them?

Definition 5.6 (Relative efficiency) . Let us consider two unbiased estimators ˆ

θ1

and ˆ

θ2 . If

Var h ˆ

θ1 i <Var h ˆ

θ2 i

we say that ˆ

θ1 is more efficient than ˆ

θ2 .

If the estimators are biased, typically the comparison is done on the basis of the

mean square error.

Exercise

Suppose z1 , . . . , zN is a random sample of observations from a distribution with

mean θ and variance σ2 . Study the unbiasedness and the consistency of the three

estimators of the mean µ:

ˆ

θ1 = ˆ µ=PN

i=1 z i

N

ˆ

θ2 = N ˆ

θ1

N+ 1

ˆ

θ3 = z1

5.6 The Hoeffding's inequality

A probabilistic measure of the discrepancy between the estimator ˆ

µand the quantity

µ= E[z ] to be estimated is returned by the Hoeffding's inequality.

Theorem 6.1. [103] Let z1 ,...,zN be independent bounded random variables such

that zi falls in the interval [ai , bi ] with probability one. Let their sum be SN =

PN

i=1 z i . Then for any ε > 0 we have

Prob {|SN E[ SN ] | > ε } ≤ exp ( 2 ε2/

N

X

i=1

(bi ai )2 )

Corollary 6.2. If the variables z1 ,...,zN are independent and identically dis-

tributed, the following bound on the discrepancy between the sample mean ˆ µ=

PN

i=1 z i

Nand the expected value E[z ] holds

Prob {|ˆ

µE[z ] |> ε } ≤ exp 2Nε2 / (b a)2

5.7. SAMPLING DISTRIBUTIONS FOR GAUSSIAN R.V.S 109

Assume that δ is a confidence parameter, that is we are 100(1 δ )% confident

that the estimate ˆ µis within the accuracy εof the true expectation. It is possible

to derive the expression

ε( N) = r ( ba)2log(2)

2N

which measures with confidence 1 δ how the sample mean ˆ µ, estimated on the

basis of N points, is close to the expectation E [z ]. We can also determine the

number of observations Nnecessary to obtain an accuracy ε and a confidence δ by

using the relation

N > ( b a)2 log(2 )

2ε 2

Hoeffding's bound is a general bound that only relies on the assumption that

sample points are drawn independently. Bayesian bounds are another example of

statistical bounds which give tighter results under the assumption that the examples

are drawn from a normal distribution.

5.7 Sampling distributions for Gaussian r.v.s

The results in Section 5.5 are independent of the type of distribution function Fz .

Additional results are available in the specific case of a normal random variable.

Let z1 ,...,zN be i.i.d. realisation of z ∼ N ( µ, σ2 ) and let us consider the

following sample statistics

ˆ

µ=1

N

N

X

i=1

zi ,c

SS =

N

X

i=1

(zi ˆ

µ)2 , ˆ

σ2 =c

SS

N1

It can be shown that the following relations hold

ˆ

µ∼ N (µ, σ 2 /N ) and N(ˆ

µµ)2 σ2 χ2

1where the χ 2 distribution is presented

in Appendix C.2.2.

zi µ ∼ N (0, σ 2 ), so P N

i=1(z i µ) 2 σ 2 χ 2

N.

PN

i=1(z i µ) 2 =c

SS +N (ˆ µ µ)2 .

c

SS σ2 χ 2

N1 or equivalently (N 1) ˆ

σ2

σ2 χ2

N1 . See R script sam dis2.R.

N(ˆ

µµ) / ˆ

σ∼ TN1 where T stands for the Student distribution (Section

C.2.3).

if E [ |z µ |4 ] = µ4 then Var ˆ

σ2 =1

Nµ 4 N3

N1σ 4 .

5.8 The principle of maximum likelihood

Maximum-likelihood is a major strategy used in statistics to design an estimator,

i.e. the algorithm gin (5.4.7). Its rationale is to transform a problem of estimation

into a problem of optimisation. Let us consider

1. a density distribution pz ( z, θ ) which depends on a parameter θ Θ,

2. a dataset DN = {z1 , z2, . . . , zN } i.i.d. drawn from this distribution.

110 CHAPTER 5. PARAMETRIC ESTIMATION

According to (3.5.54), the joint probability density of the i.i.d. dataset is the product

pD N (DN , θ) =

N

Y

i=1

pz (zi , θ) = LN ( θ ) (5.8.19)

where for a fixed DN , LN ( · ) is a function of θ and is called the empirical likelihood

of θ given DN .

The principle of maximum likelihood was first used by Lambert around 1760 and

by D. Bernoulli about 13 years later. It was detailed by Fisher in 1920. The idea

is simple: given an unknown parameter θand a sample data DN , the maximum

likelihood estimate ˆ

θis the value for which the empirical likelihood LN (θ ) has a

maximum ˆ

θml = arg max

θΘ L N (θ)

The estimator ˆ

θml is called the maximum likelihood estimator (m.l.e.). In prac-

tice, it is usual to consider the log-likelihood lN (θ ) instead of LN (θ ). Since log(· ) is

a monotone function, we have

ˆ

θml = arg max

θΘ L N (θ ) = arg max

θΘlog(L N (θ )) = arg max

θΘl N (θ) (5.8.20)

The likelihood function quantifies the relative abilities of the various parameter

values to explain the observed data. The principle of m.l. is that the value of the

parameter under which the obtained data would have had highest probability of

arising must be intuitively our best estimator of θ. In other terms the likelihood

can be considered a measure of how plausible the parameter values are in light of

the data. Note however that the likelihood function is NOT a probability function:

for instance, in general, it does not integrate to 1 (with respect to θ). In terms

of conditional probability, LN (θ ) represents the probability of the observed dataset

given θ and not the probability of θ (which is not a r.v. in the frequentist approach)

given DN .

Example

Consider a binary variable (e.g. a coin tossing) which takes z = 15 times the value

1 (e.g. "Tail") in N= 40 trials. Suppose that the probabilistic model underlying

the data is Binomial (Section C.1.2) with an unknown probability θ =p . We want

to estimate the unknown parameter θ =p [0, 1] on the basis of the empirical

evidence from the N trials. The likelihood L (p ) is a function of (only) the unknown

parameter p . By applying the maximum likelihood technique we have

ˆ

θml = ˆ p= arg max

pL N (p ) = arg max

pN

z p z (1p )(N z) = arg max

p40

15 p 15 (1p )(25)

Figure 5.4 plots L (p ) versus p [0, 1] (R script ml bin.R). The most likely value

of p is the value where L (· ) attains its maximum. According to Figure 5.4 this value

is ˆ p= z/N. The log-likelihood for this model is

lN ( p ) = log LN ( p ) = log N

z + zlog(p )+( N z) log(1 p ) =

= log 40

15 + 15 log p+ 25 log(1 p)

The reader can analytically find the maximum of this function by differentiating

l( p) with respect to p.

5.8. THE PRINCIPLE OF MAXIMUM LIKELIHOOD 111

Figure 5.4: Likelihood function

5.8.1 Maximum likelihood computation

In many situations the log-likelihood lN (θ ) is particularly well behaved in being

continuous with a single maximum away from the extremes of the range of variation

of θ . Then ˆ

θml is obtained simply as the solution of

∂lN ( θ )

∂θ = 0

subject to

2 lN ( θ)

∂θ 2 ˆ

θml

<0

to ensure that the identified stationary point is a maximum.

5.8.2 Maximum likelihood in the Gaussian case

Let DN be a random sample from the r.v. z ∼ N (µ, σ2 ). It is possible to derive

analytically the expression of the maximum likelihood estimators of the mean and

variance of z . According to (5.8.19), the likelihood of the N observations is

LN ( µ, σ2 ) =

N

Y

i=1

pz (zi , µ, σ2 ) =

N

Y

i=1 1

2πσ2 exp (zi µ)2

2σ2

and the log-likelihood is

lN ( µ, σ2 ) = log LN ( µ, σ 2 ) = log " N

Y

i=1

pz (zi , µ, σ2 )# =

=

N

X

i=1

log pz ( zi , µ, σ2 ) = P N

i=1(z i µ)2

2σ2 +Nlog 1

2πσ2

Note that, for a given σ, maximising the log-likelihood is equivalent to minimising

the sum of squares of the difference between zi and the mean. Taking the derivatives

with respect to µ and σ2 and setting them equal to zero, we obtain

ˆ µml = P N

i=1 z i

N= ˆ µ(5.8.21)

ˆ σ2

ml = P N

i=1(z i ˆ µml )2

N6= ˆ σ2 (5.8.22)

112 CHAPTER 5. PARAMETRIC ESTIMATION

Note that the m.l. estimator (5.8.21) of the mean coincides with the sample

average (5.3.4) but that the m.l. estimator (5.8.22) of the variance differs from the

sample variance (5.3.5) in terms of the denominator.

In the multivariate Normal case, where zis a vector with [n, 1] mean µ and [n, n]

covariance matrix Σ, the maximum likelihood estimators are

ˆ µml = P N

i=1 z i

N(5.8.23)

ˆ

Σml =PN

i=1(z i ˆ µml )(zi ˆ µml )T

N(5.8.24)

where zi and ˆ µare [n, 1] vectors.

Exercise

Let z ∼ U (0, M ) follow a uniform distribution and Fz DN = {z1 , . . . , zN } .

Find the maximum likelihood estimator of M.

Let z have a Poisson distribution, i.e.

pz ( z, λ) = e λ λ z

z!

If Fz ( z, λ ) DN = {z1 , . . . , zN } , find the m.l.e. of λ

In case of generic distributions Fz computational difficulties may arise: for ex-

ample in some cases no explicit solution might exist for ∂lN (θ) /∂θ = 0. Iterative

numerical methods must be used in this case. The computational cost becomes

heavier if we consider a vector of parameters instead of a scalar θor when there are

several relative maxima of the function lN .

Another complex situation occurs when lN (θ ) is discontinuous, or have a dis-

continuous first derivative, or a maximum at an extremal point.

R script

Suppose we know the analytical form of a one dimensional function f (x ) : I R

but not the analytical expression of its extreme points. In this case numerical

optimisation methods can be applied. The implementation of some continuous

optimisation routines is available in the R statistical tool.

Consider for example the function f (x ) = (x 1 / 3)2 and I = [0, 1]. The value

of the point x where f takes a minimum value can be approximated numerically by

this set of R commands

f <- function (x,a) (x-a)^2

xmin <- optimize(f, c(0, 1), tol = 0.0001, a = 1/3)

xmin

These routines may be applied to solve the problem of maximum likelihood

estimation which is nothing more than a particular case of optimisation problem.

Let DN be a random sample drawn from z ∼ N (µ, σ 2 ). The negative log-likelihood

function of the Nobservations can be written in R by

eml <- function(m,D,var) {

N<- length(D)

Lik<-1

5.8. THE PRINCIPLE OF MAXIMUM LIKELIHOOD 113

for (i in 1:N)

Lik<-Lik*dnorm(D[i],m,sqrt(var))

-log(Lik)

}

and the numerical minimisation of lN ( µ, s2 ) for a given σ =s in the interval

I= [ 10 ,10] can be written in R as

xmin<-optimize( eml,c(-10,10),D=DN,var=s)

In order to run the above code and compute numerically the m.l. solution we invite

the reader to run the R script emp ml.R.

5.8.3 Cramer-Rao lower bound

Assume that θ is a scalar parameter, that the first two derivatives of LN (θ ) with re-

spect to θ exist for all θand that certain operations of integration and differentiation

may be interchanged. Let ˆ

θbe an unbiased estimator of θ and lN (θ ) = loge [LN (θ)].

Suppose that the regularity condition

E lN (θ)

∂θ = 0 (5.8.25)

holds where the quantity ∂l (θ) /∂θ is called score . The Cramer-Rao bound is a lower

bound to the variance of the estimator ˆ

θwhich states that

Var h ˆ

θi 1

E lN (θ)

∂θ 2 = 1

NE h 2 l N (θ)

∂θ2 i =1

IN

where the denominator term IN is known as the Fisher information. Note that

2 lN ( θ)

∂θ2 is the second derivative of l N (·) and, as such, it defines the curvature of the

log-likelihood function. At the maximum ˆ

θ, the second derivative takes a negative

value. Also, the larger its absolute value the larger is the curvature around the

function peak and then the lower is the uncertainty about the m.l. estimation [145].

An estimator having a variance as low as 1/IN is called a Minimum Variance

Bound (MVB) estimator.

Example

Consider a r.v. z ∼ N (µ, σ2 ) where σ2 is known and the unknown parameter is

θ= µ. Let us consider the bound on the variance of the estimator (5.8.21). Since

log p( z, θ)

∂θ = z θ

σ2

2 log p( z, θ)

∂θ 2 = 1

σ2

It follows that

Var h ˆ

θi 1

N

σ2

=σ 2

N

From (5.5.10) it derives then that the m.l. estimator (5.8.21) of the mean µis

minimum variance.

114 CHAPTER 5. PARAMETRIC ESTIMATION

5.8.4 Properties of m.l. estimators

Under the (strong) assumption that the probabilistic model structure is known, the

maximum likelihood technique features the following properties:

ˆ

θml is asymptotically unbiased but usually biased in small-size samples (e.g.

ˆ σ2

ml in (5.8.22)).

ˆ

θml is consistent.

If ˆ

θml is the m.l.e. of θ and γ (· ) is a monotone function then γ ( ˆ

θml ) is the

m.l.e. of γ (θ).

If γ ( ·) is a non monotonic function, then even if ˆ

θml is an unbiased estimator

of θ , the m.l.e. γ ( ˆ

θml ) of γ (θ ) is usually biased.

the variance of ˆ

θml is often difficult to determine. For large-size samples we

can use as approximation

E 2 l N

∂θ 2  1

or 2 l N

∂θ 2 ˆ

θml 1

ˆ

θml is asymptotically normally distributed, that is

ˆ

θml ∼ N ( θ, [ IN ( θ )]1 ) , N → ∞

5.9 Interval estimation

Unlike point estimation which is based on a one-to-one mapping from the space of

data to the space of parameters, interval estimation maps DN to an interval of Θ.

A point estimator is a function which, given a dataset DN generated from Fz (z, θ),

returns an estimate of θ . An interval estimator is a transformation which, given a

dataset DN , returns an interval estimate [θ, ¯

θ] of θ. While an estimator is a random

variable, an interval estimator is a random interval. Let θ and ¯

θbe the random

lower and the upper bounds respectively. While an interval either contains or not

a certain value, a random interval has a certain probability of containing a value.

Suppose that

Prob θθ ¯

θ = 1 α α [0, 1] (5.9.26)

then the random interval [θ, ¯

θ] is called a 100(1 α )% confidence interval of θ.

If (5.9.26) holds, we expect that by repeating the sampling of DN and the con-

struction of the confidence interval many times, our confidence interval will contain

the true θat least 100(1 α )% of the time. Notice, however, that being θa fixed

unknown value, at each realisation DN the interval [θ, ¯

θ] either contains or not the

true θ . Therefore, from a frequentist perspective, it is erroneous to think that 1 α

is the probability of θbelonging to the interval [θ, ¯

θ] computed for a given DN . In

fact, 1 α is not the probability of the event θ [θ, ¯

θ] (since θis fixed) but the

probability that the interval estimation procedure returns a (random) interval [θ, ¯

θ]

containing θ.

While a point estimator is characterised by bias and variance (Section 5.5), an

interval estimator is characterised by its endpoints θ and ¯

θ(or its width) and by

its confidence α . In Figure 5.3 we used an analogy between point estimation and

dart game to illustrate the bias/variance notions. In the case of interval estimation

the best analogy is provided by the horseshoes game4 (Figure 5.5). A horseshoe

4https://en.wikipedia.org/wiki/Horseshoes

5.9. INTERVAL ESTIMATION 115

Figure 5.5: Horseshoes game as an analogy of interval estimation

player is like an interval estimator and her interval estimation corresponds to the

tossing of a horseshoe. The horseshoe width corresponds to the interval size and

the probability of encircling the stake corresponds to the confidence α.

5.9.1 Confidence interval of µ

Consider a random sample DN of a r.v. z ∼ N (µ, σ2 ) where σ2 is known. Suppose

we want to estimate µwith the estimator ˆ

µ. From Section 5.7 we have that ˆ

µ

N(µ, σ2 /N ) is Gaussian distributed. From (3.4.47) it follows that

ˆ

µµ

σ/ N ∼ N (0,1)

and consequently, according to the Definition 4.7

Prob zα/2 ˆ

µµ

σ/ N z α/2 = 1 α (5.9.27)

Prob ˆ

µzα/2

σ

N µ ˆ

µ+zα/2

σ

N = 1 α (5.9.28)

where zα is the upper critical point of the standard Gaussian distribution. It follows

that θ = ˆ

µzασ/ Nis a lower 1 αconfidence bound for µwhile ¯

θ=ˆ

µ+zασ/N

is an upper 1 α confidence bound for µ. By varying α we can vary the width and

the confidence of the interval.

Example

Let z ∼ N ( µ, 0. 01) and DN ={ 10,11,12,13,14,15} . We want to estimate the

confidence interval of µ with level α = 0. 1. Since N = 6, ˆ µ= 12. 5, and

= zα/2 σ/ N = 1 . 645 · 0 . 01 / 6=0 .0672

the 90% confidence interval for the given DN is

{µ: | ˆ

µµ | ≤ }= { 12. 5 0. 0672 µ 12. 5+0 .0672}

116 CHAPTER 5. PARAMETRIC ESTIMATION

Figure 5.6: Fraction of times that the interval of confidence contains the parameter

µvs. the number of repetitions for α= 0 .1

R script

The R script confidence.R allows the test of the formula (5.9.27) by simulation.

The user sets µ ,σ,N,α and a number of iterations N iter.

The script generates N iter times DN ∼ N ( µ, σ 2 ) and computes ˆ µ. The script

returns the percentage of times that

ˆ µ zα/2 σ

N<µ< ˆ µ+ zα/2 σ

N

This percentage versus the number of iterations is plotted in Figure 5.6 (R script

confidence.R). We can easily check that this percentage converges to 100(1 α )%

for N iter → ∞.

Consider now the interval of confidence of µwhen the variance σ2 is not known.

Let ˆ

µand ˆ

σ2 be the estimators of µ and σ2 computed on the basis of the i.i.d.

dataset DN . From Section 5.7, it follows that

ˆ µ µ

qˆ σ2

N

∼ TN1

Analogously to (5.9.28) we have

Prob ˆ

µtα/2

σ

N µ ˆ

µ+tα/2

σ

N = 1 α (5.9.29)

where tα is the upper critical point of the Student distribution.

Example

Let z ∼ N ( µ, σ 2 ), with σ2 unknown and DN = { 10,11,12,13,14,15} . We want to

estimate the confidence region of µ with level α = 0. 1. We have ˆ µ= 12. 5, ˆ σ2 = 3. 5 .

According to (5.9.29) we have

= t{α/2,N 1} ˆ σ/ N = 2.015 1 .87/ 6=1 .53

The (1 α ) confidence interval of µis

ˆ µ  < µ < ˆ µ+

5.10. COMBINATION OF TWO ESTIMATORS 117

Example

We want to estimate θ , the proportion of people who support the politics of Mr.

Berlusconi amongst a very large population. We want to define how many interviews

are necessary to have a confidence interval of 6% width with a significance of 5%.

We interview Npersons and estimate θas

ˆ

θ= x 1 +· ·· +xN

N= S

N

where xi = 1 if the i th person supports Berlusconi and xi = 0 otherwise. Note that

Sis a binomial variable. We have

E[ˆ

θ] = θ, Var h ˆ

θi = Var [S/N ] = N ( θ )(1 θ )

N2 = θ(1 θ)

N 1

4N

If we approximate the distribution of ˆ

θby N (θ, θ(1θ )

N) it follows that ˆ

θθ

θ(θ 1)N

N(0, 1). The following relation holds

Prob n ˆ

θ0. 03 θ ˆ

θ+ 0. 03o=

Prob ( 0.03

pθ (1 θ )/N ˆ

θθ

pθ(1 θ ) /N 0.03

pθ(1 θ )/N ) =

Φ 0.03

pθ (1 θ )/N ! Φ 0.03

pθ(1 θ) /N !

Φ(0.03 4N ) Φ( 0 .03 4N)

In order to have this probability to be at least 0.95 we need 0. 034N 1 . 96 or

equivalently N 1068.

5.10 Combination of two estimators

Consider two unbiased estimators ˆ

θ1 and ˆ

θ2 of the same parameter θ

E[ˆ

θ1 ] = θ E [ ˆ

θ2 ] = θ

having equal and non zero variance

Var h ˆ

θ1 i = Var h ˆ

θ2 i =v

and being uncorrelated, i.e. Cov[ ˆ

θ1 ,ˆ

θ2 ] = 0. Let ˆ

θcm be the combined estimator

ˆ

θcm =ˆ

θ1 +ˆ

θ2

2

This estimator has the nice properties of being unbiased

E[ˆ

θcm ] = E [ ˆ

θ1 ] + E [ ˆ

θ2 ]

2=θ(5.10.30)

and with a smaller variance than the original estimators

Var h ˆ

θcm i =1

4Var h ˆ

θ1 +ˆ

θ2 i =

Var h ˆ

θ1 i + Var h ˆ

θ2 i

4= v

2(5.10.31)

This trivial computation shows that the simple average of two unbiased estimators

with a non zero variance returns a combined estimator with reduced variance.

118 CHAPTER 5. PARAMETRIC ESTIMATION

5.10.1 Combination of mestimators

Here, we report the general formula of the linear combination of a number mof

estimators [179, 181]. Assume we want to estimate the unknown parameter θ by

combining a set of m estimators { ˆ

θj },j = 1, . . . , m . Let

E[ˆ

θj ] = µj Var h ˆ

θj i =vj Bias[ˆ

θj ] = bj

be the expected values, the variances and the bias of the m estimators, respectively.

We are interested in estimating θby forming a linear combination

ˆ

θcm =

m

X

j=1

wj ˆ

θj =wT ˆ

θ(5.10.32)

where ˆ

θ= [ˆ

θ1 ,...,ˆ

θm ]T is the vector of estimators and w = [w1 , . . . , wm ]T is the

weighting vector.

The mean-squared error of the combined system is

MSE = E [(ˆ

θcm θ)2 ] = E[( wT ˆ

θE[ wT ˆ

θ])2 ]+( E[ wT ˆ

θ]θ )2

=E [(wT ( ˆ

θE[ˆ

θ]))2 ]+( wT µ θ)2=

=wT w + (wT µ θ )2

where Ω is a [m× m ] covariance matrix whose ij th term is

ij =E [(ˆ

θi µi )(ˆ

θj µj )]

and µ = (µ1 , . . . , µm )T is the vector of expected values. Note that the MSE error

has a variance term (dependent on the covariance of the single estimators) and a

bias term (dependent on the bias of the single estimators).

5.10.1.1 Linear constrained combination

A commonly used constraint is

m

X

j=1

wj = 1 , wj 0 , j = 1 , . . . , m (5.10.33)

This means that the combined estimator is unbiased if the individual estimators are

unbiased. Let us write was

w= ( uT g )1 g

where u = (1,..., 1)T is an m-dimensional vector of ones, g = (g1 , . . . , gm )T and

gj >0 , j = 1 , . . . , m.

The constraint can be enforced in minimising the MSE by using the Lagrangian

function

L= wT w+ ( wT µ θ)2 + λ( wT u1)

with λ Lagrange multiplier.

The optimum is achieved if we set

g = [Ω + ( µ θu)( µ θu)T ]1 u

With unbiased estimators (µ =θ ) we obtain

g = Ω1 u

and with uncorrelated estimators

g

j=1

vj

j= 1 , . . . , m (5.10.34)

This means that the optimal term g

jof each estimator is inversely proportional to

its own variance.

5.11. TESTING HYPOTHESIS 119

5.11 Testing hypothesis

Hypothesis testing is together with estimation a major area of statistical inference.

Astatistical hypothesis is an assertion or conjecture about the distribution of one

or more random variables. A test of a statistical hypothesis is a rule or procedure

for deciding whether to reject the assertion on the basis of the observed data. The

basic idea is formulate some statistical hypothesis and look to see whether the data

provides any evidence to reject the hypothesis. Examples of hypothesis tests follow:

Consider the model of the traffic in the boulevard. Suppose that the measures

of the inter-arrival times are DN ={ 10,11,1,21, 2 , . . . } seconds. Can we say

that the mean inter-arrival time θis different from 10?

We want to know the effect of a drug on rats' survival to cancer. We randomly

divide some rats in two groups and we administrate a drug only to one of them.

Is the survival rate of the groups the same?

Consider the grades of two different school sections. Section A had { 15,10,12,19, 5, 7}.

Section B had { 14,11,11,12,6,7} . Can we say that Section A had better

grades than Section B?

Consider two protein coding genes and their expression levels in a cell. Are

the two genes differentially expressed ?

A statistical test is a procedure that aims to answer such questions.

5.11.1 Types of hypothesis

We start by declaring the working (basic, null) hypothesis H to be tested, in the

form θ =θ0 or θ ω Θ, where θ0 or ω are given.

The hypothesis can be

simple: this means that it fully specifies the distribution of the r.v. z .

composite: this means that it partially specifies the distribution of z .

For example if DN is a random sample of size N drawn from N (µ, σ2 ) the

hypothesis H :µ = µ0 , σ = σ0 , (with µ0 and σ0 known values) is simple while the

hypothesis H :µ = µ0 is composite since it leaves open the value of σin (0, ).

5.11.2 Types of statistical test

Suppose we have sampled a dataset DN ={ z1 , . . . , zN } from a distribution Fz and

we have declared a null hypothesis H about F . The three most common types of

statistical test are:

Pure significance test: data DN are used to assess the inferential evidence against

H.

Significance test: the inferential evidence against His used to judge whether H

is inappropriate. This test returns a decision rule for rejecting or not rejecting

H.

Hypothesis test: data DN are used to assess the hypothesis Hagainst a specific

alternative hypothesis ¯

H.This test returns a rule for rejecting Hin favour of

¯

H.

The three tests will be discussed in the following sections.

120 CHAPTER 5. PARAMETRIC ESTIMATION

5.11.3 Pure significance test

Consider a simple null hypothesis H . Let t (DN ) be a statistic (i.e. a function of the

dataset) such that the larger its value the more it casts doubt on H . The quantity

t(DN ) is called test statistic or discrepancy measure. Suppose that the distribution

of t (DN ) under H is known. This is possible since the function t ( · ) is fixed by

the user and the simple hypothesis H entirely specifies the distribution of zand

consequently the distribution of t (DN ). Let tN =t (DN ) the observed value of t

calculated on the basis of the sample data DN . Let us define the p-value quantity

as

p= Prob { t( DN ) > tN | H} (5.11.35)

i.e. the probability of observing a statistic greater than tN if the hypothesis H were

true. Note that in the expression (5.11.35), the term t (DN ) is a random variable

having a known distribution, while tN is a value computed on the basis of the

observed dataset.

If the p quantity is small then the sample data DN are highly inconsistent

with H , and p (significance probability or significance level) is the measure of such

inconsistency. If p is small, then either a rare event has occurred or perhaps His

not true. In other terms, if Hwere true, the quantity p would be the proportion of

situations where we would observe a degree of inconsistency, at least to the extent

represented by tN . The smaller the p-value, the stronger the evidence against H5 .

Note that p depends on DN since different DN would yield different values of

tN and consequently different values of p [0, 1]. Moreover, it can be shown that,

if the null hypothesis is true, the p-value has a Uniform U [0, 1] distribution. Also,

in a frequentist perspective, we cannot say that p is the probability that His true

but rather that p is the probability that the dataset DN is observed given that His

true.

5.11.4 Tests of significance

The test of significance proposes the following decision rule: if pis less than some

stated value α , we reject H . Once a critical level α is chosen, and the dataset D N

is observed, the rule rejects H at level αif

P{ t( DN ) > tα | H ) = α (5.11.36)

This is equivalent to choosing some critical value tα and to reject H if tN > tα .

This implies the existence of two regions in the space of sample data:

critical region: this is the set of values of DN

S0 = { DN : t (DN ) > tα }

such that if DN S0 , we reject the null hypothesis H.

non-critical region: this is the set of values of DN such that there is no reason

to reject Hon the basis of the level-αtest.

The principle is that we will accept Hunless what we observed has a too small

probability of happening when H is true. The upper bound of this probability is α,

i.e. the significance level αis the highest p-value for which we reject H. Note that

the p-value changes with the observed data (i.e. it is a random variable) while αis

a level fixed by the user.

5It is common habit in life-sceince research to consider a p-value smaller than 0.05 (0.01) a

(very) strong evidence against H

5.11. TESTING HYPOTHESIS 121

Example

Let DN consist of N independent observations of x ∼ N ( µ, σ2 ), with known variance

σ2 . We want to test the hypothesis H :µ = µ0 with µ0 known. Consider as test

statistic the quantity t (DN ) = | ˆ

µµ0 |where ˆ

µis the sample average estimator.

If H is true we know from Section 5.4 that ˆ

µ∼ N (µ0 , σ 2 /N ). Let us calculate the

value t (DN ) = | ˆ µ µ0 | and fix a significance level α = 10%. This means that the

decision rule needs the definition of the value tα such that

Prob { t ( DN ) > tα | H} = Prob {| ˆ

µµ0 |> tα |H } =

Prob { ( ˆ

µµ0 > tα ) (ˆ

µµ0 < tα ) |H }= 0.1

For a Normal variable z ∼ N ( µ, σ2 ), we have that

Prob {|z µ |> 1.645σ} = Prob |zµ|

σ>1.645 = 2 0 .05

It follows that being ˆ

µ∼ N ( µ0 , σ 2 /N )

Prob n | ˆ

µµ0 |>1 .645σ/ No = 0 .05 + 0.05 = 0 .1

and consequently

tα = 1 .645 σ/ N (5.11.37)

The critical region is

S0 = nDN :| ˆ µ µ0 | > 1 . 645σ/ No

Example

Suppose that σ = 0. 1 and that we want to test if µ = µ0 = 10 with a significance

level 10%. Let N = 6 and DN = { 10,11,12,13,14,15} . From the dataset we

compute

ˆ µ=10 + 11 + 12 + 13 + 14 + 15

6= 12.5

and

t( DN ) = | ˆ µ µ0 | = 2.5

Since according to (5.11.37) tα = 1. 645 0. 1 / 6 = 0. 0672, and t ( DN ) > tα ,

the observations DN are in the critical region. The conclusion is: the hypothesis

H: µ= 10 is rejected and the probability that we are making an error by rejecting

His smaller than 0 .1.

5.11.5 Hypothesis testing

So far we have dealt with single hypothesis tests. Let us now consider two mutually

exclusive hypothesis: H and ¯

H. Suppose we have a dataset { z1 , . . . , zN } ∼ F drawn

from a distribution F. On the basis of this dataset, one hypothesis will be accepted

and the other one rejected. In this case, given the stochastic setting, two type of

errors are possible.

Type I error. This is the kind of error we make when we reject H but H is true.

For a given critical level tα the probability of making this error is

Prob {t (DN ) > tα | H} =α (5.11.38)

122 CHAPTER 5. PARAMETRIC ESTIMATION

Type II error. This is the kind of error we make when we accept H and His

false. In order to define this error, we are forced to declare an alternative

hypothesis ¯

Has a formal definition of what is meant by Hbeing false . The

probability of type II error is

Prob t ( DN )tα | ¯

H (5.11.39)

that is the probability that the test leads to acceptance of Hwhen in fact ¯

H

holds.

Note that

when the alternative hypothesis is composite, there could be no unique Type

II error.

although H and ¯

Hare complementary events, the quantity (5.11.39) cannot

be derived from (5.11.38) (see Equation (3.1.23)).

Example

In order to better illustrate these notions, let us consider the analogy with a murder

trial, where the suspect is Mr. Bean. The null hypothesis His "Mr. Bean is

innocent". The dataset is the amount of evidence collected by the police against

Mr. Bean. The Type I error is the error that we make if, Mr. Bean being innocent,

we send him to death-penalty. The Type II error is the error that we make if,

being Mr. Bean guilty, we acquit him. Note that the two hypotheses have different

philosophical status (asymmetry). His a conservative hypothesis, not to be rejected

unless evidence against Mr Bean's innocence is clear. This means that a type I error

is more serious than a type II error (benefit of the doubt ).

Example

Let us consider a professor who has to decide, on the basis of empirical evidence,

whether a student copied or not during a class test. The null hypothesis His that

the student is honest. The alternative hypothesis ¯

His that the student cheated.

Let the empirical evidence tN be represented by the number of lines of the classwork

that a student shares with at least one of his classmates.

The decision rule of the professor is the following: a student passes (i.e. the null

hypothesis that she is honest is accepted) if there is not enough empirical evidence

against her (e.g. if tN tα = 2), otherwise she fails (i.e. the alternative hypothesis

is chosen). Will the professor make any error? why? and does this depend on what?

5.11.6 The hypothesis testing procedure

In general terms a hypothesis testing procedure can be decomposed in the following

steps:

1. Declare the null and the alternative hypothesis

2. Choose the numeric value αof the type I error (e.g. the risk I want to run

when I reject the null hypothesis).

3. Define a test statistic.

4. Determine the critical value tα of the test statistic that leads to a rejection of

Haccording to the Type I error defined in Step 2.

5.11. TESTING HYPOTHESIS 123

5. Among the set of tests of level α, choose the test that minimises the probability

of type II error.

6. Obtain the data and determine whether the observed value of the test statistic

leads to an acceptation or rejection of H.

Note that a number of tests, having a different type II error, can guarantee the

same type I error. An appropriate choice of test as a function of the type II error

is therefore required and will be discussed in the following section.

5.11.7 Choice of test

The choice of test and consequently the choice of the partition {S0 , S1 } is based on

two steps

1. Define a significance level α, that is the probability of type I error (or the

probability of incorrectly rejecting H)

Prob { reject H |H } = Prob { DN S0 | H} = α

2. Among the set of tests {S0 , S1 } of level α, choose the test that minimises the

probability of type II error

Prob accept H| ¯

H = Prob DN S1 | ¯

H

that is the probability of incorrectly accepting H. This is equivalent to max-

imising the power of the test

Prob reject H| ¯

H = Prob DN S0 | ¯

H = 1 Prob DN S1 | ¯

H

which is the probability of correctly rejecting H . Note that for a given signif-

icance level, the higher the power, the better !

Example

In order to reason about the Type II error, let us consider an r.v. z ∼ N (µ, σ2 ),

where σ is known and a set of Niid observations are given. We want to test the

null hypothesis µ = µ0 = 0, with α= 0. 1 Consider three different tests and the

associated critical regions S0

1. | ˆ µ µ0 | > 1 . 645σ/N

2. ˆ µ µ0 > 1 . 282σ/ N (Figure 5.7)

3. | ˆ µ µ0 | < 0 . 126σ/ N (Figure 5.8)

Assume that the area blackened in Figure (5.7) equals the area blackened in

Figure (5.8). For all these tests Prob { DN S0 | H } ≤ α , hence the significance

level (i.e. Type I error) is the same. However if ¯

H: µ1 = 10 the type II error of

the three tests is significantly different. Which test is the best one, that is the one

which guarantees the lowest Type II error?

124 CHAPTER 5. PARAMETRIC ESTIMATION

Figure 5.7: On the left: distribution of the test statistic ˆ

µif H : µ0 = 0 is true. On

the right: distribution of the test statistic ˆ

µif ¯

H: µ1 = 10 is true. The interval

marked by S1 denotes the set of observed ˆ µvalues for which His accepted (non-

critical region). The interval marked by S0 denotes the set of observed ˆ µvalues for

which H is rejected (critical region). The area of the black pattern region on the

right equals Prob { DN S0 | H } , i.e. the probability of rejecting H when H is true

(Type I error). The area of the grey shaded region on the left equals the probability

of accepting H when H is false (Type II error).

Figure 5.8: On the left: distribution of the test statistic ˆ

µif H : µ0 = 0 is true.

On the right: distribution of the test statistic ˆ

µif ¯

H: µ1 = 10 is true. The two

intervals marked by S1 denote the set of observed ˆ µvalues for which His accepted

(non-critical region). The interval marked by S0 denotes the set of observed ˆ µ

values for which His rejected (critical region). The area of the pattern region

equals Prob {DN S0 |H } , i.e. the probability of rejecting H when H is true (Type

I error). Which area corresponds to the probability of the Type II error?

5.12. PARAMETRIC TESTS 125

5.11.8 UMP level-αtest

Given a significance level αwe denote by uniformly most powerful (UMP) test, the

test

1. which satisfies

Prob { reject H|H } = Prob {DN S0 |H } = α

2. for which

Prob reject H| ¯

H = Prob DN S0 | ¯

H

is maximed simultaneously for all θ Θ ¯

H.

How is it possible to find UMP tests? In a simple case, an answer is given by

the Neyman-Pearson lemma.

5.11.9 Likelihood ratio test

Consider the simplest case Θ = {θ0 , θ1 } , where H :θ = θ0 and ¯

H: θ= θ1 and

θ0 , θ1 are two different values of the parameter of a r.v. z. Let us denote the two

likelihoods by L0 (θ ) and L1 (θ ), respectively.

The idea of Neyman and Pearson was to base the acceptance/rejection of Hon

the relative values L (θ0 ) and L (θ1 ). In other terms we reject Hif the likelihood

ratio L (θ1 )

L( θ0 )

is sufficiently big.

We reject Honly if the sample data DN are sufficiently more probable when

θ= θ1 than when θ= θ0 .

Lemma 2 (Neyman-Pearson lemma) . Let H :θ =θ0 and ¯

H: θ= θ1 . If a partition

{S0 , S1 } of the sample space D is defined by

S0 = {DN :L( θ1 ) > kL( θ0 )} S1 ={ DN :L( θ1 ) < kL( θ0 )}

with R S 0 p (DN , θ0 )dDN =α , then {S0 , S1 } is the most powerful level-α test of H

against ¯

H.

This lemma demonstrates that among all tests of level α , the likelihood ratio

test is the optimal procedure, i.e. it has the smallest probability of type II error.

Although, for a generic distribution, the definition of an optimal test is very

difficult, all the tests that will be described in the following are optimal in the UMP

sense.

5.12 Parametric tests

Suppose we want to test an assertion about a random variable with a known para-

metric distribution F (· , θ ). Besides the distinction between simple and composite

tests presented in Section 5.11.1, there are two more ways of classifying hypothesis

tests:

One-sample vs. two-sample: one-sample tests concern an hypothesis about the

properties of a single r.v. z ∼ N ( µ, σ2 ) while two-sample test concern the

relationship between two r.v. z1 ∼ N (µ1 , σ 2

1) and z 2 ∼ N (µ 2 , σ 2

2).

126 CHAPTER 5. PARAMETRIC ESTIMATION

Single-sided (one-tailed) vs. Two-sided (two-tailed): in single-sided tests the

region of rejection concerns only one tail of the distribution of the null hypoth-

esis. This means that ¯

Hindicates the predicted direction of the difference.

In two-sided tests the region of rejection concerns both tails of the null distri-

bution. This means that ¯

Hdoes not indicate the predicted direction of the

difference.

The most common parametric tests rely on hypothesis of normality. A non-

exhaustive list of conventional parametric test is available in the following table:

Name single/two sample known H ¯

H

z-test single σ2 µ = µ0 µ 6 = µ0

z-test two σ 2

1=σ 2

2µ 1 =µ 2 µ 1 6=µ 2

t-test single µ = µ0 µ 6 = µ0

t-test two µ1 = µ2µ1 6 = µ2

χ2 -test single µ σ2 = σ 2

0σ 2 6=σ 2

0

χ2 -test single σ2 =σ2

0σ 2 6=σ 2

0

F-test two σ 2

1=σ 2

2σ 2

16=σ 2

2

The columns H and ¯

Hcontain the parameter taken into consideration by the

test.

All the parametric test procedures can be decomposed into five main steps:

1. Define the null hypothesis and the alternative one.

2. Fix the probability α of having a Type I error.

3. Choose a test statistic t (DN ).

4. Define the critical value tα that satisfies the Type I error constraint.

5. Collect the dataset DN , compute t ( DN ) and decide if the hypothesis is either

accepted or rejected.

Note that the first 4 steps are independent of the data and should be carried out

before the collection of the dataset. A more detailed description of some of these

tests is contained in the following sections and Appendix C.3.

5.12.1 z-test (single and one-sided)

Consider a random sample DN x ∼ N (µ, σ 2 ) with µ unknown and σ2 known.

Let us see in detail how the five steps of the testing procedure are instantiated in

this case.

STEP 1:

Consider the null hypothesis and the alternative (composite and one-sided)

H: µ= µ0 ;¯

H: µ > µ0

STEP 2: fix the value αof the type I error.

STEP 3: If H is true then the distribution of ˆ

µis N ( µ0 , σ2 /N ). This means

that the test statistic t (DN ) is

tN = t (DN ) = ( ˆ

µµ0 ) N

σ∼ N (0,1)

STEP 4: determine the critical value tα .

We reject the hypothesis H if tN > tα = zα where zα is such that Prob {N (0, 1) > zα } =

α.

5.12. PARAMETRIC TESTS 127

Example: for α = 0. 05 we would take zα = 1. 645 since 5% of the standard

normal distribution lies to the right of 1.645. Note that the value zα for a given α

can be obtained by the R command qnorm(alpha,lower.tail=FALSE).

STEP 5: Once the dataset DN is measured, the value of the test statistic is

tN =(ˆ µ µ0 ) N

σ

and the hypothesis is either accepted (tN zα ) or rejected.

Example z-test

Consider a r.v. z ∼ N ( µ, 1). We want to test H :µ = 5 against ¯

H: µ > 5 with

significance level 0.05. Suppose that the dataset is DN = { 5 . 1,5.5,4.9,5.3} . Then

ˆ µ= 5 .2 and zN = (5. 2 5) 2 /1=0 .4. Since this is less than 1.645, we do not

reject the null hypothesis.

5.12.2 t-test: single sample and two-sided

Consider a random sample from N ( µ, σ2 ) with σ2 unknown . Let

H: µ= µ0 ;¯

H: µ6= µ0

Let

t(DN ) = tN = N (ˆ µ µ0 )

q1

N1P N

i=1(z i ˆ µ)2

=(ˆ µ µ0 )

qˆ σ2

N

a statistic computed using the data set DN .

If the hypothesis Hholds, from Sections C.2.3 and 5.7 it follows that t ( DN )

TN1 is a r.v. with a Student distribution with N 1 degrees of freedom. The size

αt-test consists in rejecting Hif

|tN | > k = tα/2,N 1

where t α/2,N1 is the upper αpoint of a T-distribution on N 1 degrees of freedom,

i.e.

Prob tN1 > tα/2,N1 = α/ 2, Prob |tN1 | > tα/2,N1 = α.

where tN1 ∼ TN1 . In other terms His rejected when tN is too large.

Note that the value t α/2,N1 for a given N and α can be obtained by the R

command qt(alpha/2,N-1,lower.tail=TRUE).

Example [65]

Suppose we want an answer to the following question: Does jogging lead to a re-

duction in pulse rate?. Let us engage eight non jogging volunteers in a one-month

jogging programme and let us take their pulses before and after the programme

pulse rate before 74 86 98 102 78 84 79 70

pulse rate after 70 85 90 110 71 80 69 74

decrease 4 1 8 -8 7 4 10 -4

Let us assume that the decreases are randomly sampled from N ( µ, σ2 ) where

σ2 is unknown. We want to test H :µ = µ0 = 0 against ¯

H: µ6= 0 with a

significance α = 0. 05. We have N = 8, ˆ µ= 2. 75, T= 1 .263, tα/2,N 1 = 2 .365 Since

|T | ≤ t α/2,N 1 , the data is not sufficient to reject the hypothesis H . In other terms

the experiment does not provide enough evidence that jogging leads to reduction in

pulse rate.

128 CHAPTER 5. PARAMETRIC ESTIMATION

Figure 5.9: On the left: distribution of the test statistic (number of identical lines)

if H is true, i.e. the student is honest. Typically honest students have very few

lines in common with others though it could happen by chance that such number

is more than 2. On the right: distribution of the test statistic (number of identical

lines) if ¯

His true, i.e. the student is dishonest. Typically dishonest students have

several lines in common with others though some of them are cunning enough to

conceal it.

So far we assumed that the distribution of the test statistic is known under the

null hypothesis. In this case it is possible to fix a priori the Type I error. But what

about if we do not know anything about the distribution? Is it possible to assess a

posteriori the quality (in terms of errors of Type I or II) of a certain test (e.g. using

a certain threshold) ?

5.13 A posteriori assessment of a test

Let us consider the professor example (page 122) and the hypothesis test strategy

which leads to the refusal of a student when tN > tα = 2. In this case the distri-

butions of the tN statistic for an honest student (or a dishonest one) has no known

parametric form (Figure 5.9). Moreover, the professor has no information about

such distributions and, consequently, he has no way to measure or control the Type

I error rate (i.e. the grey area in Figure 5.9). Nevertheless, it is possible to estimate

a posteriori the Type I and Type II error rate if we have access to the decisions of

the professor and the real nature of student (honest or dishonest).

Suppose that N students took part in the exam and that NN did not copy while

NP did. According to the professor's decision rule, ˆ

NN were considered honest and

passed the exam, while ˆ

NP were considered dishonest and rejected. Because of

the overlapping of the distributions in Figure 5.9, it happens that FP > 0 honest

students (the ones in the grey area) failed and FN > 0 dishonest students (the ones

in the blue area) passed. Note that the honest students who failed indeed did not

copy but they had by chance more than one line in common with a classmate. At

the same time there are dishonest students who succeeded by copying but who were

clever enough to avoid more than 2 identical lines.

The resulting situation can be summarised in Table 5.2 and Table 5.3 where

we associated the null hypothesis Hto the minus sign (non guilty or honest) and

the hypothesis ¯

Hto the plus sign. In Table 5.2, FP denotes the number of False

Positives, i.e. the number of times that the professor considered the student as guilty

5.14. CONCLUSION 129

Passed Failed

H: Honest student (-) TN FP NN =TN + F P

¯

H: Guilty student (+) FN TPNP = FN + TP

ˆ

NN = TN + FN ˆ

NP = FP + TPN

Table 5.2: Reality vs. decision: given Nstudents (NN honest and NN dishonest

ones) the table reports the breakdown of the Nprofessor decisions ( ˆ

NN passes and

ˆ

NP rejections).

H accepted H rejected

H: null hypothesis (-) 1 α α

¯

H: alternative hypothesis (+) β1 β

Table 5.3: Reality vs. decision: the table reports the probability of correct and bad

decisions in a hypothesis test. In particular α denotes the type I error while 1 β

the test power.

(+) but in reality she was innocent (-). The ratio FP /N represents an estimate of

the type I error (probability of rejecting the null hypothesis when it is true) which is

denoted by α in Table 5.3. The term FN represents the number of False Negatives,

i.e. the number of times that the professor considered a student to be honest (-),

yet he copied (+). The ratio FN /N is an estimation of the type II error (probability

of accepting the null hypothesis when it is false) which is denoted by βin Table 5.3.

Note that the Type I and II errors are related. For instance, the professor could

decide he does not want to unfairly fail even a single student by setting tN to infinity.

In this case, all honest students, like the dishonest ones, would succeed: this means

we would have a null Type I error (α= 0) at the cost of the highest Type II error

(β NP /N).

5.14 Conclusion

The reader wishing to know more about machine learning could be disappointed.

She has been reading more than one hundred pages and has still the sensation

that she did not learn much about machine learning. All she read seems very far

from intelligent agents, neural networks and fancy applications... Nevertheless, she

already came across the most important notions of machine learning: conditional

probability, estimation and bias/variance trade-off. Is it all about that? From an

abstract perspective, yes. All the fancy algorithms that will be presented afterwards

(or that the reader is used to hear about) are nothing more (often without the

designer's knowledge) estimators of conditional probability, and as such, submitted

to a bias/variance tradeoff. Such algorithms are accurate and useful only if they

manage well such trade-off.

But we can go a step further and see the bias/variance tradeoff not only as

a statistical concept but as a metaphor of human attitude towards models and

data, beliefs and experience, ideology and observations, preconceptions and events 6.

Humans define models (not only in science but also in politics, economics, religion)

to represent the regularity of nature. Now, reality often escapes or diverges from

such regularity. In front of the gap between the Eden of regularity and the natural

Hell of observations, humans waver between two extremes: i) negate or discredit

reality and reduce all divergences to some sort of noise (measurement error) or ii)

6https://tinyurl.com/y25l4xyp

130 CHAPTER 5. PARAMETRIC ESTIMATION

adapt, change their belief, to incorporate discording data and measures in their

model (or preconceptions).

The first attitude is exposed to bias (or dogmatism or worse conspiracy think-

ing): the second to variance (or instability). A biased human learner behaves as

an estimator which is insensitive to data: her strength derives from the intrinsic

robustness and coherence, and his weakness is due to the (in)sane attitude of dis-

regarding data and flagrant evidence. On the other side, a highly variant human

learner adapts rapidly and swiftly to data and observations, but he can be easily

criticised for his excessive instability, in simple words for going where the wind

blows.

When the evidence does not confirm your expectations (or what your parents,

teachers or media told you), what is the best attitude to take? Is there an optimal

attitude? Which side are you on?

5.15 Exercises

1. Derive analytically the bias of the sample average estimator in a non i.i.d. setting.

2. Derive analytically the variance of the sample average estimator in an i.i.d. setting.

3. Consider a regression problem where

y= sin( x) + w

and x is uniformly distributed on the interval [0, 2π ] and w =N (1, 1) is a Normal

variable with both mean and variance equal to 1. Let us consider a predictor h ( x)

that is distributed like w. Compute the bias and variance of the predictor in the

following coordinates: x = 0, x =π ,x = π/2.

Solution:

x= 0 Bias=0, Var=1

x= πBias=0, Var= 1

x= π/2 Bias=1, Var=1

4. Let us consider a dataset DN = {z1 ,...,z 20}of 20 observations generated according

to an uniform distribution over the interval [ 1, 1]. Suppose I want to estimate the

expected value of the distribution. Compute the bias and variance of the following

estimators:

ˆ

θ1 =P 10

i=1 z i

10

ˆ

θ2 =ˆ

µ=P 20

i=1 z i

20

ˆ

θ3 =1

ˆ

θ4 = 1

ˆ

θ5 =z2

Suppose I want to estimate the variance of the distribution. Compute the bias of

the following estimators:

ˆ

σ2

1= P (zi ˆ µ)2

19

ˆ

σ2

2= P (zi ˆ µ)2

20

ˆ

σ2

3= 1/3

5.15. EXERCISES 131

Solution: Note that θ = 0 and σ 2

z= 1/3

ˆ

θ1 :B1 = 0, V1 = 0. 03. Justification: E [ ˆ

θ1 ] = θ and Var h ˆ

θ1 i =σ2 / 10

ˆ

θ2 :B2 = 0, V2 = 0. 015. Justification: E [ ˆ

θ2 ] = θ and Var h ˆ

θ2 i =σ2 / 20

ˆ

θ3 :B3 = 1, V3 = 0. Justification: E [ ˆ

θ3 ] = 1 and Var h ˆ

θ3 i = 0 since constant

ˆ

θ4 :B4 = 1, V4 = 0. Justification: E [ ˆ

θ4 ] = 1 and Var h ˆ

θ4 i = 0 since constant

ˆ

θ5 :B5 = 0, V5 = 0. 33. Justification: E [ ˆ

θ5 ] = θ and Var h ˆ

θ5 i =σ2

ˆ

σ2

1:B= 0 .Justification: sample variance is unbiased then E[ˆ

σ2

1] = σ 2

z

ˆ

σ2

2:B1 /60 = 0 .0166 .Justification: Note first that ˆ

σ2 =19

20

P(zi ˆ µ)2

19 . Then

E[ˆ

σ2

2] = 19

20 E P (zi ˆ µ)2

19 = 19

20 σ 2

z

then

E[ˆ

σ2

2]σ 2

z=19

20 σ 2

zσ 2

z=σ 2

z/20

ˆ

σ2

3:B= 0 .Justification E[1 /3] = 1/3 = σ 2

z

5. Let us consider the following observations of the random variable z

DN ={ 0 .1 , 1 , 0 .3 ,1 .4 }

Write the analytical form of the likelihood function of the mean µfor a Gaussian

distribution with a variance σ2 = 1. The student should:

1. Trace the log-likelihood function on the graph paper

2. Determine graphically the maximum likelihood estimator.

3. Discuss the result.

Solution: Since N = 4 and σ= 1

L( µ ) =

N

Y

i=1

p(zi , µ) =

N

Y

i=1

1

2π exp (zi µ ) 2

2

21 0 1 2

14 12 10 86

mu

log likelihood

Note that ˆ µml coincides with the sample average ˆ µ= 0 .2 of DN .

6. Suppose you want to estimate the expectation µof the uniform r.v. z ∼ U [ 2, 3]

by using a dataset of size N= 10. By using R and its random generator, first plot

the sampling distribution then estimate the bias and the variance of the following

estimators:

132 CHAPTER 5. PARAMETRIC ESTIMATION

1. ˆ

θ=P N

i=1

zi

N

2. ˆ

θ= minN

i=1 z i

3. ˆ

θ= maxN

i=1 z i

4. ˆ

θ= z1

5. ˆ

θ= zN

6. ˆ

θ=P i=1 N|zi |

N

7. ˆ

θ= mediani zi

8. ˆ

θ= maxN

i=1 w iwhere w∼ N (0,1).

9. ˆ

θ= 1

Before each random generation set the seed to zero.

7. The student should first create a dataset of N= 1000 observations according to the

dependency

y=g (β0 +β1 x) + w

where x ∼ U [ 1, 1], β0 = 1, β1 = 1, w ∼ N (µ = 0, σ2 = 0. 1), g (x ) = e x

1+ex .

Then by using the same dataset he should:

estimate by maximum likelihood the parameters β0 and β1 ,

plot the contour of the likelihood function, showing in the same graph the

values of the parameters and their estimations.

Hint: use a grid search to perform the maximisation.

8. The student should first create a dataset of N= 1000 observations according to the

dependence

Prob {y = 1| x} =g ( β0 +β1 x )

where x ∼ U [0, 1], β0 = 1, β1 = 1, g (x ) = e x

1+ex and y∈ {0 ,1} .

Then by using the same dataset she should:

estimate by maximum likelihood the parameters β0 and β1 ,

plot the contour of the likelihood function, showing in the same graph the

values of the parameters and their estimations.

Hint: use a grid search to perform the maximisation.

9. Let z ∼ N (1, 1), DN a training set of N i.i.d. observations zi and ˆ µN the related

sample average estimator.

1. Compute analytically

Ez,DN [(z ˆ

µN )2 ]

Hint: consider that z =θ +w where θ =E [z ] and w ∼ N (0,1).

2. Compute analytically

Ez,DN [(z ˆ

µN )]

3. Validate by Monte Carlo simulation the two theoretical results above.

Solution: Since E [ ˆ

µ] = µ , Var [ˆ

µ] = σ 2

w/N and w is independent of D N:

Ez,DN [(z ˆ

µN )2 ] = E z,DN [(θ +w ˆ

µN )2 ] =

=Ez,DN [w2 + 2w (θ ˆ

µN ) + (θ ˆ

µN )2 ] =

=Ez [w2 ] + ED N [(θ ˆ

µN )2 ] = σ 2

w+σ 2

w/N = 1 + 1/N

R code to perform Monte Carlo validation :

5.15. EXERCISES 133

rm(list=ls())

N=5

S=10000

sdw=1 ## noise variance

E=NULL

for (s in 1:S){

DN=rnorm(N,1,sdw)

muhat=mean(DN)

z=rnorm(1,1,sdw)

e=z-muhat

E=c(E,e^2)

}

cat("th=",sdw^2+sdw^2/N, "MC estimation=", mean(E),"\n")

10. Let us supposed that the only measurement of a Gaussian random variable z

N(µ, 1) is the interval [ 3. 5, 1. 5]. Estimate µ by maximum-likelihood and show the

likelihood-function L (µ ). Hint: use the R function pnorm.

11. Let us suppose that 12 of the 31 days of August in Brussels are rainy. Estimate the

probability of a rainy day by maximum likelihood by using the Binomial distribution

(Section C.1.2).

134 CHAPTER 5. PARAMETRIC ESTIMATION

Chapter 6

Nonparametric approaches

to estimation and testing

6.1 Nonparametric methods

In the previous chapter, we considered estimation problems where the probability

distribution is known, parameters' value (e.g. mean and/or variance) aside. Such

estimation methods are called parametric. The meaningfulness of a parametric test

depends entirely on the validity of the assumptions made about the analytical form

of the distribution. However, in real configurations, it is not uncommon for the

experimenter to question parametric assumptions.

Consider a random sample DN z collected through some experimental obser-

vation and for which no hint about the underlying probability distribution Fz ( ·) is

available. Suppose we want to estimate a parameter of interest θof the distribution

of z by using the plug-in estimate ˆ

θ= t(ˆ

F) (Section 5.3). What can we say about

the accuracy of the estimator ˆ

θ? As shown in Section 5.5.3, for some specific param-

eters (e.g. mean and variance) the accuracy can be estimated independently of the

parametric distribution. In most cases, however, the assessment of the estimator is

not possible unless we know the underlying distribution. What to do, hence, if the

distribution is not available? A solution is provided by the so-called nonparametric

or distribution-free methods that work independently on any specific assumption

about the probability distribution.

The adoption of these methods enjoyed considerable success in the last decades

thanks to the evolution and parallelisation of computational processing power. In

fact, most techniques for nonparametric estimation and testing are based on re-

sampling procedures, which require a large number of repeated (and almost similar)

computations on the data.

This chapter will deal with two resampling strategies for estimation and two

resampling strategies for hypothesis testing, respectively.

Jacknife: this approach to nonparametric estimation relies on repeated computa-

tions of the statistic of interest for all the combinations of the data where one

or more of the original examples are removed. It will be presented in Section

6.3.

Bootstrap: this approach to nonparametric estimation aims to estimate the sam-

pling distribution of an estimator by sampling (with replacement) from the

original data. It will be introduced in Section 6.4.

Randomisation: This is a resampling without replacement testing procedure. It

135

136 CHAPTER 6. NONPARAMETRIC ESTIMATION AND TESTING

consists in taking the original data and either scrambling the order or the

association of the original data. It will be discussed in Section 6.5.

Permutation: This is a resampling two-sample hypothesis-testing procedure based

on repeated permutations of the dataset. It will be presented in Section 6.6.

6.2 Estimation of arbitrary statistics

Consider a set DN of N data points sampled from a scalar r.v. z . Let E [z ] = µthe

parameter to be estimated. In Section 5.3.1 we derived the bias and the variance

of the estimator ˆ

µ:

ˆ µ=1

N

N

X

i=1

zi , Bias[ ˆ

µ] = 0, Var [ ˆ

µ] = σ 2

N

Consider now another quantity of interest, for example, the median or a mode of the

distribution. While it is easy to design a plug-in estimate of these quantities, their

accuracy is difficult to be computed. In other terms, given an arbitrary estimator

ˆ

θ, the analytical form of the variance Var h ˆ

θi and the bias Bias[ˆ

θ]is typically not

available.

Example

According to the plug-in principle (Section 5.3) we can design other estimators

besides sampled mean and variance, like:

Estimation of skewness (3.3.36) of z: see Equation (D.0.2).

Estimation of correlation (3.6.67) between x and y : : see Equation (D.0.3).

What about the accuracy (e.g. bias, variance) of such estimators?

Example

Let us consider an example of estimation taken from an experimental medical

study [65]. The goal of the study is to show bioequivalence between an old and

a new version of a patch designed to infuse a certain hormone in the blood. Eight

subjects take part in the study. Each subject has his hormone levels measured after

wearing three different patches: a placebo, an "old" patch and a "new" patch. It is

established by the Food and Drug Administration (FDA) that the new patch will

be approved for sale only if the new patch is bioequivalent to the old one according

to the following criterion:

θ=| E(new) E (old)|

E(old) E(placebo) 0 .2 (6.2.1)

Let us consider the following plug-in estimator (Section 5.3) of (6.2.1)

ˆ

θ=| ˆ

µnew ˆ

µold |

ˆ

µold ˆ

µplacebo

Suppose we have collected the following data (details in [65])

6.3. JACKKNIFE 137

subj plac old new z=old-plac y=new-old

1 9243 17649 16449 8406 -1200

2 9671 12013 14614 2342 2601

3 11792 19979 17274 8187 -2705

. . . . . . . . . . . . . . . . . .

8 18806 29044 26325 10238 -2719

mean: 6342 -452.3

The estimate is

ˆ

θ= t(ˆ

F) = | ˆ µnew ˆ µold |

ˆ µold ˆ µplacebo

=| ˆ µy |

ˆ µz

=452.3

6342 = 0.07

Can we say on the basis of this value that the new patch satisfies the FDA

criterion in (6.2.1)? What about the accuracy, bias or variance of the estimator?

The techniques introduced in the following sections may provide an answer to these

questions.

6.3 Jackknife

The jackknife (or leave-one-out ) resampling technique aims at providing a compu-

tational procedure to estimate the variance and the bias of a generic estimator ˆ

θ.

The technique was first proposed by Quenouille in 1949 and is based on removing

examples from the available dataset and recalculating the estimator. It is a general-

purpose tool that is easy to implement and able to solve a number of estimation

problems.

6.3.1 Jackknife estimation

In order to show the theoretical foundation of the jackknife, we first apply this

technique to the estimator ˆ

µof the mean. Let DN ={ z1 , . . . , zN } be the available

dataset. Let us remove the ith example from DN and let us calculate the leave-

one-out (l-o-o) mean estimate from the N 1 remaining examples

ˆ µ(i) =1

N1

N

X

j6=i

zj = N ˆ µ zi

N1

Observe from above that the following relation holds

zi = Nˆ µ( N1)ˆ µ(i) (6.3.2)

that is, we can calculate the i th example zi ,i = 1, . . . , N if we know both ˆ µand

ˆ µ(i) . Suppose now we wish to estimate some parameter θ by using as estimator

some complex statistic of the Ndata points

ˆ

θ= g( DN ) = g( z1 , z2, . . . , zN )

The jackknife procedure consists in first computing

ˆ

θ(i) = g( z1 , z2, . . . , zi1 , zi+1 , . . . , zN ) , i = 1 , . . . , N

which is called the i th jackknife replication of ˆ

θ. Then by analogy with the rela-

tion (6.3.2) holding for the mean estimator, we define the i -th pseudo value by

η(i) = Nˆ

θ( N1) ˆ

θ(i) .(6.3.3)

138 CHAPTER 6. NONPARAMETRIC ESTIMATION AND TESTING

These pseudo values assume the same role as the zi in calculating the sample aver-

age (5.3.4). Hence the jackknife estimate of θ is given by

ˆ

θjk =1

N

N

X

i=1

η(i) =1

N

N

X

i=1 N ˆ

θ( N1) ˆ

θ(i) = Nˆ

θ( N1) ˆ

θ(·) (6.3.4)

where

ˆ

θ(·) =P N

i=1 ˆ

θ(i)

N.

The rationale of the jackknife technique is to use the quantity (6.3.4) in order to

estimate the bias of the estimator. Since, according to (5.5.8), θ =E [ ˆ

θ] Bias[ˆ

θ],

the jackknife approach consists in replacing θ by ˆ

θjk and E[ˆ

θ] by ˆ

θ, thus obtaining

ˆ

θjk =ˆ

θBiasjk [ ˆ

θ].

It follows that the jackknife estimate of the bias of ˆ

θis

Biasjk [ ˆ

θ] = ˆ

θˆ

θjk =ˆ

θ Nˆ

θ+ ( N1) ˆ

θ(·) = ( N1)( ˆ

θ(·) ˆ

θ) .

Note that in the particular case of a mean estimator (i.e. ˆ

θ=ˆ

µ), we see that we

obtain, as expected, Biasjk [ ˆ

µ] = 0.

A jackknife estimate of the variance of ˆ

θcan be obtained from the sample

variance of the pseudo-values. We define the jackknife estimate of the variance of

ˆ

θas

Varjk [ ˆ

θ] = Var h ˆ

θjk i (6.3.5)

Under the hypothesis of i.i.d. η (i)

Var h ˆ

θjk i = Var " P N

i=1 η ( i)

N# =

Var h η(i) i

N

From (6.3.3) we have

PN

i=1 η ( i)

N= N ˆ

θ( N1)

N

N

X

i=1

ˆ

θ(i)

Since

η(i) = Nˆ

θ( N1) ˆ

θ(i) η(i) PN

i=1 η ( i)

N=(N 1) ˆ

θ(i) PN

i=1 ˆ

θ(i)

N!

from (6.3.5) and (6.3.4) we obtain

Varjk [ ˆ

θ] = P N

i=1 η ( i)ˆ

θjk 2

N( N1) = N1

N

N

X

i=1 ˆ

θ(i) ˆ

θ(·) 2 !

Note that in the case of the estimator of the mean (i.e. ˆ

θ=ˆ

µ), since η (i) = zi

and ˆ

θjk = ˆ µ, we find again the result (5.5.10)

Varjk [ ˆ

θ] = P N

i=1(z i ˆ µ) 2

N( N1) = ˆ σ2

N= Var [ˆ

µ] (6.3.6)

The major motivation for jacknife estimates is that they reduce bias. Also,

it can be shown that under suitable conditions on the type of estimator ˆ

θ, the

quantity (6.3.6) converges in probability to Var h ˆ

θi . However, the jacknife can fail

if the statistic ˆ

θis not smooth (i.e. small changes in data cause small changes in the

statistic). An example of non-smooth statistic for which the jacknife works badly

is the median.

6.4. BOOTSTRAP 139

6.4 Bootstrap

The method of bootstrap was proposed by Efron [62] as a computer-intensive tech-

nique to estimate the accuracy of a generic estimator ˆ

θ. Bootstrap relies on a

data-based simulation method for statistical inference. The term bootstrap derives

from the phrase to pull oneself up by one's bootstrap based on the fictional Adven-

tures of Baron Munchausen. The Baron had fallen to the bottom of a deep lake.

Just when it looked like all was lost, he thought to pick himself up by his own boot-

straps. In general terms, to pull yourself up by your bootstraps means to succeed

in something very difficult without any outside help1.

The idea of statistical bootstrap is very simple, namely that in the absence of

any other information, the sample itself offers the best guide of the sampling dis-

tribution. The method is completely automatic, requires no theoretical calculation,

and is available no matter how mathematically complicated the estimator (5.4.6)

is. By resampling with replacement from DN we can build a set of Bdatasets

D(b) , b = 1, . . . , B . From the empirical distribution of the statistics g ( D(b) ) we can

construct confidence intervals and tests for significance.

6.4.1 Bootstrap sampling

Consider a data set DN . A bootstrap data set D(b) , b = 1 , . . . , B is created by

randomly selecting N points from the original set DN with replacement (Figure

6.1).

Since DN itself contains N points, there is nearly always duplication of individual

points in a bootstrap data set. Each point has an equal probability 1/N of being

chosen on each draw. Hence, the probability that a point is chosen exactly ktimes

is given by the binomial distribution (Section C.1.2)

Prob {k } =N !

k!( N k)! 1

N k N1

N Nk

0kN

Given a set of Ndistinct values, there is a total of 2N1

Ndistinct bootstrap

datasets. The number is quite large already for N > 10. For example, if N = 3

and DN = {a, b, c} , we have 10 different bootstrap sets: { a,b,c} , {a,a,b} , {a,a,c},

{b,b,a}, {b,b,c}, {c,c,a },{c,c,b}, {a,a,a }, {b,b,b }, {c,c,c }.

Under balanced bootstrap sampling, the B bootstrap sets are generated in such a

way that each original data point is present exactly Btimes in the entire collection

of bootstrap samples.

6.4.2 Bootstrap estimate of the variance

Given the estimator (5.4.6), for each bootstrap dataset D(b) ,b = 1, . . . , B , we can

define a bootstrap replication

ˆ

θ(b) = g( D(b) ) b= 1 , . . . , B

that is the value of the statistic for the specific bootstrap sample. The bootstrap

approach computes the variance of the estimator ˆ

θthrough the variance of the set

ˆ

θ(b) , b= 1 , . . . , B, given by

Varbs [ ˆ

θ] = P B

b=1(ˆ

θ(b) ˆ

θ(·) )2

(B 1) where ˆ

θ(·) =PB

b=1 ˆ

θ(b)

B(6.4.7)

1This term has not the same meaning (though the derivation is similar) as the one used in

computer operating systems where bootstrap stands for starting a computer from an hardwired

set of core instructions

140 CHAPTER 6. NONPARAMETRIC ESTIMATION AND TESTING

Figure 6.1: Bootstrap replications of a dataset and bootstrap statistic computation

6.4. BOOTSTRAP 141

It can be shown that if ˆ

θ= ˆ µ, then for B → ∞, the bootstrap estimate Varbs [ ˆ

θ]

converges to the variance Var[ ˆ

µ].

6.4.3 Bootstrap estimate of bias

Let ˆ

θbe a plug-in estimator (Equation (5.3.3)) based on the sample DN and

ˆ

θ(·) =P B

b=1 ˆ

θ(b)

B(6.4.8)

Since Bias[ˆ

θ] = E [ ˆ

θ]θ , the bootstrap estimate of the bias of the plug-in esti-

mator ˆ

θis obtained by replacing E [ ˆ

θ] with ˆ

θ(·) and θwith ˆ

θ:

Biasbs [ ˆ

θ] = ˆ

θ(·) ˆ

θ(6.4.9)

Then, since

θ= E[ˆ

θ] Bias[ˆ

θ]

the bootstrap bias corrected estimate is

ˆ

θbs =ˆ

θBiasbs [ ˆ

θ] = ˆ

θ(ˆ

θ(·) ˆ

θ)=2 ˆ

θˆ

θ(·) (6.4.10)

Note that if we want to estimate the bias of a a generic non plug-in estimator

g( DN ), the ˆ

θterm in the right-hand terms of (6.4.9) should anyway refer to the

plug-in estimator t ( ˆ

F) (Equation (5.3.3)).

R script

Run the R file patch.R for the estimation of bias and variance in the case of the

patch data example.

6.4.4 Bootstrap confidence interval

Standard bootstrap confidence limits are based on the assumption that the estima-

tor ˆ

θis normally distributed with mean θ and variance σ2 . Taking the bootstrap

estimate of variance, an approximate 100(1 α )% confidence interval is given by

ˆ

θ± zα/2 q Varbs [ ˆ

θ] = ˆ

θ± zα/2 s P B

b=1(ˆ

θ(b) ˆ

θ(·) )2

(B 1) (6.4.11)

An improved interval is given by using the bootstrap correction for bias

2ˆ

θˆ

θ(·) ± zα/2 s P B

b=1(ˆ

θ(b) ˆ

θ(·) )2

(B 1) (6.4.12)

Another bootstrap approach for constructing a 100(1α )% confidence interval is

to use the upper and lower α/2 values of the bootstrap distribution. This approach

is referred to as bootstrap percentile confidence interval. If ˆ

θL,α/2 denotes the value

such that only a fraction α/2 of all bootstrap estimates are inferior to it, and

likewise ˆ

θH,α/2 is the value exceeded by only α/2 of all bootstrap estimates, then

the confidence interval is given by

[ˆ

θL,α/2 ,ˆ

θH,α/2 ] (6.4.13)

where the two extremes are also called the Efron's percentile confidence limits.

142 CHAPTER 6. NONPARAMETRIC ESTIMATION AND TESTING

6.4.5 The bootstrap principle

Given an unknown parameter θof a distribution Fz and an estimator ˆ

θ, the goal of

any estimation procedure is to derive or approximate the distribution of ˆ

θθ. For

example, the calculation of the variance of ˆ

θrequires the knowledge of Fz and the

computation of ED N [(ˆ

θE[ˆ

θ])2 ]. Now, in practical contexts, Fz is unknown, and

the calculus of ED N [(ˆ

θE[ˆ

θ])2 ] is not possible in an analytical way. The rationale of

the bootstrap approach is (i) to replace Fz by the empirical counterpart (5.2.2) and

(ii) to compute ED N [(ˆ

θE[ˆ

θ])2 ] by a Monte Carlo simulation approach (Section 3.9)

where several samples of size Nare generated by resampling DN .

The outcome of a bootstrap technique is a Monte Carlo approximation of the

distribution ˆ

θ(b) ˆ

θ. In other terms the variability of ˆ

θ(b) (based on the empirical

distribution) around ˆ

θis expected to be similar (or mimic) the variability of ˆ

θ

(based on the true distribution) around θ.

The bootstrap principle relies on the two following properties (i) as Ngets

larger and larger, the empirical distribution ˆ

Fz (· ) converges (almost surely) to Fz (·)

(Glivenko-Cantelli theorem (C.9.17)) and (ii) as Bgets larger, the quantity (6.4.7)

converges (in probability) to the variance of the estimator ˆ

θbased on the empirical

distribution (as stated in (C.8.14)). In other terms

Varbs [ ˆ

θ]B→∞

Ed

DN [(ˆ

θE[ˆ

θ])2 ]N→∞

ED N [(ˆ

θE[ˆ

θ])2 ] (6.4.14)

where E d

DN [(ˆ

θE[ˆ

θ])2 ] stands for the plug-in estimate of the variance of ˆ

θbased

on the empirical distribution.

In practice, for a small finite N, bootstrap estimation inevitably returns some

error. This error is a combination of a statistical error and a simulation error.

The statistical error component is due to the difference between the underlying

distribution Fz ( · ) and the empirical distribution ˆ

Fz (· ). The magnitude of this error

depends on the choice of the estimator ˆ

θ(DN ) and decreases by increasing the

number N of observations.

The simulation error component is due to the use of empirical (Monte Carlo)

properties of ˆ

θ(DN ) rather than exact properties. Simulation error decreases by

increasing the number Bof bootstrap replications.

Unlike the jackknife method, in the bootstrap, the number of replicates Bcan

be adjusted to the computer resources. In practice, two rules of thumb are typically

used:

1. Even a small number of bootstrap replications, e.g. B= 25, is usually infor-

mative. B = 50 is often enough to give a good estimate of Var h ˆ

θi .

2. Very seldom are more than B= 200 replications needed for estimating Var h ˆ

θi .

Much bigger values of B are required for bootstrap confidence intervals.

Note that the use of rough statistics ˆ

θ(e.g. unsmooth or unstable) can make the

resampling approach behave wildly. Examples of nonsmooth statistics are sample

quantiles and the median.

In general terms, for i.i.d. observations, the following conditions are required

for the convergence of the bootstrap estimate

1. the convergence of ˆ

Fto F(satisfied by the Glivenko-Cantelli theorem) for

N→ ∞;

2. an estimator such that the estimate ˆ

θis the corresponding functional of the

empirical distribution.

θ= t( F) ˆ

θ= t(ˆ

F)

6.5. RANDOMISATION TESTS 143

This is satisfied for sample means, standard deviations, variances, medians

and other sample quantiles.

3. a smoothness condition on the functional. This is not true for extreme order

statistics such as the minimum and the maximum values.

But what happens when the dataset DN is not i.i.d. sampled from a distribution

F? In such non conventional configurations, the most basic version of bootstrap

might fail. Examples are incomplete data (survival data, missing data), dependent

data (e.g. variance of a correlated time series) and dirty data (outliers) configura-

tions. In these cases, specific adaptations of the bootstrap procedure are required.

For reason of space, we will not discuss them here. However, for a more exhaustive

discussion on the limits of bootstrap, we invite the reader to refer to [123].

6.5 Randomisation tests

Randomisation tests were introduced by R.A. Fisher in 1935. The goal of a ran-

domisation test is to help to discover some regularity (e.g. a non random property

or pattern) in a complicated data set. A classic example is to take a pack of poker

play-cards and check whether they were well shuffled by our poker opponent. Ac-

cording to the hypothesis testing terminology, randomisation tests make the null

hypothesis of randomness and test this hypothesis against data. In order to test

the randomness hypothesis, several random transformations of data are generated.

Suppose we are interested in some property which is related to the order of data.

Let the original data set DN ={x1 , . . . , xN } and t (DN ) some statistic which is a

function of the order in the data DN . We want to test if the value of t ( DN ) is due

only to randomness.

An empirical distribution is generated by scrambling (or shuffling )R times

the N elements at random. For example the j th, j = 1, . . . , R scrambled data

set could be D (j)

N={x 23 , x 4 , x 343 , . . . }

For each of the j th scrambled sets we compute a statistic t (i) . The resulting

distribution is called the resampling distribution.

Suppose that the value of t ( DN ) is only exceeded by k of the R values of the

resampling distribution.

The probability of observing t (DN ) under the null hypothesis (i.e. random-

ness) is only pt = k/R . The null hypothesis can be accepted/rejected on the

basis of pt .

The quantity pt plays the role of nonparametric p-value (Section 5.11.3) and it can

be used, like its parametric counterpart, both to assess the evidence of the null

hypothesis and to perform a decision test (e.g. refuse to play if we think cards were

not sufficiently shuffled).

A bioinformatics example

Suppose we have a DNA sequence and we think that the number of repeated se-

quences (e.g. AGTAGTAGT) in the sample is greater than expected by chance. Let

t= 17 be the number of repetitions. How to test this hypothesis? Let us formulate

the null hypothesis that the base order is random. We can construct an empirical

distribution under the null hypothesis by taking the original sample and randomly

scrambling the bases R= 1000 times. This creates a sample with the same base fre-

quencies as the original sample but where the order of bases is assigned at random.

144 CHAPTER 6. NONPARAMETRIC ESTIMATION AND TESTING

Suppose that only 5 of the 1000 randomised samples has a number of repetition

higher or equal than 17. The p-value (i.e. the probability of seeing t = 17 under

the null hypothesis) which is returned by the randomisation test amounts to 0.005.

You can run the randomisation test by using the R script file randomiz.R.

6.5.1 Randomisation and bootstrap

Both bootstrap and randomisation rely on resampling. But what are their peculiar-

ities? A randomised sample is generated by scrambling the existing data (sampling

without replacement) while a bootstrap sample is generated by sampling with re-

placement from the original sample. Also, randomisation tests are appropriate when

the order or association between parts of data are assumed to convey important in-

formation. They test the null hypothesis that the order or the association is random.

On the other side, bootstrap sampling aims to characterise the statistical distribu-

tion of some statistics t ( DN ) where the order makes no difference in the statistics

(e.g. mean). Randomisation would be useless in that case since t (D (1)

N) = t( D (2)

N)

if D (1)

Nand D (2)

Nare obtained by resampling D N without replacement.

6.6 Permutation test

Permutation test is used to perform a nonparametric two-sample test. Consider a

random sample {z1 , . . . , zM } drawn from an unknown distribution z Fz (· ) and a

random sample {y1 , . . . , yN } from an unknown distribution yFy (· ). For example,

in a bioinformatics task the two datasets could be expression measures of a gene

under M normal and N pathological conditions. Let the null hypothesis be that the

two distributions are the same regardless of the analytical forms of the distributions.

Consider a (order-independent) test statistic for the observed data and call

it t (DN , DM ). The rationale of the permutation test is to locate the statistic

t( DN , DM ) with respect to the distribution which could be obtained if the null

hypothesis were true. In order to build the null hypothesis distribution, all the pos-

sible R = M+N

Mpartitionings of the N+ M observations in two subsets of size N

and M are considered. If the null hypothesis were true, all the partitionings would

be equally likely. Then for each i-th permutation (i = 1, . . . , R ) the permutation

test computes the t (i) statistic. Eventually, the value t ( DN , DM ) is compared with

the set of values t(i) . If the the value t (DN , DM ) falls in the α/2 tails of the t (i)

distribution, the null hypothesis is rejected with type I error α.

The permutation procedure will involve substantial computation unless Mand

Nare small. When the number of permutations is too large a random sample of a

large number R of permutations can be taken.

Note that when observations are drawn according to a normal distribution, it

can be shown that the use of a permutation test gives results close to those obtained

using the ttest.

Example

Let us consider D4 = [74, 86,98,102, 89] and D3 = [10, 25, 80]. We run a permu-

tation test (R = 8

4= 70 permutations) to test the hypothesis that the two sets

belong to the same distribution (R script s perm.R).

Let t (DN ) = ˆ µ(D4 ) ˆ µ( D3 ) = 51 . 46. Figure 6.2 shows the position of t (DN )

with respect to the null sampling distribution.

6.7. CONSIDERATIONS ON NONPARAMETRIC TESTS 145

Figure 6.2: Null distribution returned by the permutation test and position (vertical

red line) of the observed statistic

6.7 Considerations on nonparametric tests

Nonparametric tests are a worthy alternative to parametric approaches when no

assumptions about the probability distribution may be made (e.g. in bioinformat-

ics). It is risky, however, to consider them as a panacea, and a critical attitude

towards them has to be preferred. In short terms, here you find some of the ma-

jor advantages and disadvantages concerning the use of a nonparametric approach.

Advantages:

If the sample size is very small, there may be no alternative to using a nonpara-

metric test unless the nature of the population distribution is known exactly.

Nonparametric tests make fewer assumptions about the data.

Nonparametric tests are available to analyse data that are inherently in ranks

(e.g. taste of food), classificatory or categorical.

Nonparametric tests are typically more intuitive and easier to implement.

Disadvantages:

They involve high computational costs.

The large availability of statistical software makes possible the potential mis-

use of statistical measures.

A nonparametric test is less powerful than a parametric one when the as-

sumptions of the parametric test are met.

Assumptions are associated with most nonparametric statistical tests, namely,

that the observations are independent.

146 CHAPTER 6. NONPARAMETRIC ESTIMATION AND TESTING

6.8 Exercises

1. Suppose you want to estimate the skewness γof the uniform r.v. z ∼ U [ 2, 3] by

using a dataset of size N = 10. By using R and its random generator, first plot

the sampling distribution then estimate the bias and the variance of the following

estimators:

1. ˆ γ=1

NP i(z i ˆ µ)3

ˆ σ3

2. ˆ γ=1

NP i|z i ˆ µ| 3

ˆ σ3

3. ˆ γ= 1

Before each random generation set the seed to zero. Hint: the skewness of a uniform

continuous variable is equal to 0.

2. Suppose you want to estimate the skewness γof the uniform r.v. z ∼ U [ 2, 3] by

using a dataset of size N= 10. By using R and its random generator, first generate

a dataset DN with N = 10. By using the jacknife, plot the sampling distribution,

then estimate the bias and the variance of the following estimators,

1. ˆ γ=1

NP i(z i ˆ µ)3

ˆ σ3

2. ˆ γ=1

NP i|z i ˆ µ| 3

ˆ σ3

3. ˆ γ= 1

Compare the results with the ones of the exercise before. Before each random

generation set the seed to zero.

3. Suppose you want to estimate the skewness γof the uniform r.v. z ∼ U [ 2, 3] by

using a dataset of size N= 10. By using R and its random generator, first generate

a dataset DN with N = 10. By using the bootstrap method, plot the sampling

distribution, then estimate the bias and the variance of the following estimators,

1. ˆ γ=1

NP i(z i ˆ µ)3

ˆ σ3

2. ˆ γ=1

NP i|z i ˆ µ| 3

ˆ σ3

3. ˆ γ= 1

Compare the results with the ones of the two exercises before. Before each random

generation set the seed to zero.

4. Let us consider a r.v. z such that E [ z ] = µ and Var[z ] = σ2 . Suppose we want to

estimate from i.i.d. dataset DN the parameter θ =µ2 = (E [z ])2 . Let us consider

three estimators:

ˆ

θ1 = PN

i=1 z i

N! 2

ˆ

θ2 = P N

i=1 z 2

i

N

ˆ

θ3 =( P N

i=1 z i ) 2

N

Are they unbiased?

Compute analytically the bias of the three estimators. Hint: use (3.3.30).

By using R, verify the result above by Monte Carlo simulation using different

values of N.

By using R, estimate the bias of the three estimators by bootstrap.

Solution: See the file Exercise1.pdf in the directory gbcode/exercises of the

companion R package (Appendix F).

Chapter 7

A statistical framework of

supervised learning

7.1 Introduction

A supervised learning problem can be described in statistical terms by the following

elements:

1. A vector of nrandom input variables x ∈ X Rn , whose values are

i.i.distributed according to an unknown probabilistic distribution Fx (·).

2. A target operator which transforms the input values into outputs y ∈ Y

according to an unknown conditional probability distribution Fy (y|x =x ).

3. A collection DN of N input/output data points hxi , yi i ,i = 1, . . . , N , called

the training set and drawn according to the joint input/output density F x,y ( x, y).

4. A learning machine or learning algorithm which, on the basis of the training

set DN , returns an estimation (or prediction) of the target for an input x . The

input/output function estimated by the learning machine is called hypothesis

or model.

Note that in this definition we encounter most of the notions presented in the

previous chapters: probability distribution, conditional distribution, estimation.

Examples

Several practical problems can be seen as instances of a supervised learning problem:

Predict whether a patient, hospitalised due to a heart attack, will have a sec-

ond heart attack, on the basis of demographic, diet and clinical measurements.

Predict the price of a stock in 6 months from now, on the basis of company

performance measures and economic data.

Identify the risk factors for breast cancer, based on clinical, demographic and

genetic variables.

Classify the category of a text email (spam or not) on the basis of its text

content.

Characterise the mechanical property of a steel plate on the basis of its phys-

ical and chemical composition.

147

148 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

Figure 7.1: The supervised learning setting. The target operator returns an output

for each input according to a fixed but unknown probabilistic law. The hypothesis

predicts the value of the target output when entered with the same input.

In the case of the spam categorisation problem, the input vector may be a vector of

size n where n is the number of the most used English words and the ith component

of x represents the frequency of the ith word in the email text. The output yis

a binary class which takes two values: { SPAM,NO.SPAM} . The training set is a

set of emails previously labeled by the user as SPAM and NO.SPAM. The goal of

the learning machine is to create a classification function which, once a vector xof

word frequencies is presented, should be able to classify correctly the nature of the

email.

A learning machine is nothing more than a particular instance of an estima-

tor (5.4.7) whose goal is to estimate the parameters of the joint distribution F x,y ( y, x)

(or sometimes of the conditional distribution Fy (y|x =x )) on the basis of a training

set DN , i.e. a set of i.i.d. realisations of the pair x and y . The goal of a learn-

ing machine is to return a hypothesis with low prediction error, i.e. a hypothesis

which computes an accurate estimate of the output of the target when the same

test value is an input to the target and the predictor (Fig. 7.1). The prediction

error is also usually called generalisation error, since it measures the capacity of

the learned hypothesis to generalise to previously unseen test samples. A learning

algorithm generalises well if it returns an accurate prediction for i.i.d. test data, i.e.

input/output pairs which are independent from the training set yet are generated by

the same joint distribution Fx,y ( x, y) . We insist on the importance of the two "i"

in the i.i.d. assumption: test data are supposed i) to be generated by the same

distribution underlying the training set but ii) to be independent from the training

set.

We will only consider hypotheses in the form h ( ·, α ) where α Λ is a vector of

model parameters1or weights. Therefore, henceforth, we will denote an hypothesis

h(· , α) by the corresponding vector α Λ . As we will see later, examples of

hypothesis are linear models h ( x, α ) = xT α (Section 9.1) where αrepresents the

coefficients of the model, or feed-forward neural networks (Section 10.1.1) where α

is the set of values taken by the weights of the neural architecture.

1It is important to remark that by model parameter we refer here to a tunable/trainable weight

of the hypothesis function and not to the target of the estimation procedure as in Section 5.1.1

7.1. INTRODUCTION 149

Let αN be the hypothesis returned by the learning machine on the basis of

the training set, and define GN its generalisation error. The goal of the learning

machine is then to seek the hypothesis αN which minimises the value GN .

In these terms, the learning problem could appear as a simple problem of op-

timisation which consists of searching the hypothesis αwhich yields the lowest

generalisation error. Unfortunately the reality is not that simple, since the learning

machine cannot measure directly GN but only return an estimate of this quantity,

denoted by ˆ

GN . Moreover, what makes the problem still more complex is that the

same finite training set is employed both to select αN and to estimate GN , thus

inducing a strong correlation between these two quantities.

The common supervised learning practice to minimise the quantity GN consists

in

1. decomposing the set of hypothesis Λinto a nested sequence of hypothesis

classes (or model structures ) Λ1 Λ2 ⊂ ··· ⊂ ΛS of increasing capacity (or

expressiveness) s with Λ = S

s=1Λ s

2. implementing a search procedure at two nested levels [125] (Fig. 7.2). The

inner level, also known as parametric identification, considers a single class of

hypotheses Λsand uses a method or algorithm to select a hypothesis h (·, αs

N)

from this class. The algorithm typically implements a procedure of multivari-

ate optimisation in the space of model parameters of the class Λs, which can

be solved by (conventional) optimisation techniques. Examples of paramet-

ric identification procedures which will be presented in subsequent chapters

are linear least-squares for linear models or back-propagated gradient-descent

for feedforward neural networks [165]. The outer level, also called structural

identification, ranges over nested classes of hypotheses Λs , (s = 1, . . . , S ), and

executes for each of them the parametric routine returning the vector α s

N. The

outcome of the parametric identification is used to assess the class Λsthrough

avalidation procedure which returns the estimate ˆ

Gs

Non the basis of the fi-

nite training set. It is common to use nonparametric techniques to assess the

quality of a predictor like the bootstrap (Section 6.4) or cross-validation [176]

based on the jacknife strategy (Section 6.3).

3. selecting the best hypothesis in the set {α s

N}, with s = 1, . . . , S , according

to the assessments n ˆ

Gs

Noproduced by the validation step. This final step,

which returns the model to be used for prediction, is usually referred to as the

model selection procedure. Instances of model selection include the problem

of choosing the degree of a polynomial model or the problem of determining

the best number of hidden nodes in a neural network [25].

The outline of the chapter is as follows. Section 7.2 introduces the supervised

learning problem in statistical terms. We will show that classification (Section 7.3)

and regression (Section 7.4) can be easily cast in this framework. Section 7.5 intro-

duces the statistical assessment of a learning machine while Section 7.6 reports some

results from the work of Prof. Vapnik on statistical learning and in particular the

formalisation of the notion of capacity of a learning machine. Section 7.7 discusses

the notion of generalisation error and its bias/variance decomposition. Section 7.9

introduces the supervised learning procedure and its decomposition in structural

and parametric identification. Model validation and in particular cross validation,

a technique for estimating the generalisation error on the basis of a finite number

of data, are introduced in Section 7.10.

150 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

Figure 7.2: The learning problem and its decomposition in parametric and struc-

tural identification. The larger is the class of hypothesis Λs, the large is its expressive

power in terms of functional relationships.

7.2 Estimating dependencies

This section details the main actors of the supervised learning problem:

A data generator of random input vectors x ∈ X Rn independently and

identically distributed (i.i.d) according to some unknown (but fixed) probabil-

ity distribution Fx (x ). The variable x is called the independent variable. It is

helpful to distinguish between cases in which the experimenter has a complete

control over the values of xand those cases in which she does not. When the

nature of inputs is completely random, we consider xas a realisation of the

random variable x having probability law Fx ( · ). When the experimenter's

control is complete, we can regard Fx (· ) as describing the relative frequencies

with which different values for xare set.

Atarget operator, which transforms the input xinto the output value y ∈ Y

according to some unknown (but fixed) conditional distribution

Fy ( y|x= x) (7.2.1)

(this includes the simplest case where the target implements some determin-

istic function y =f (x )). The conditional distribution (7.2.1) formalizes the

stochastic dependency between inputs and output.

Atraining set DN = {hx1 , y1 i,hx2 , y2 i ,..., hxN , yN i} made of N pairs (or

training examples) hxi , yi i∈Z = X × Y independent and identically dis-

tributed (i.i.d) according to the joint distribution

Fz ( z) = Fx,y (h x, yi) (7.2.2)

Note that, as in Section 5.4, the observed training set DN ∈ ZN = (X × Y ) N

is considered here as the realisation of a random variable DN .

Alearning machine having three components:

1. A class of hypothesis functions h (·, α ) with α Λ. We consider only the

case where the functions h (· , α ) ∈ Y are single valued mappings.

7.2. ESTIMATING DEPENDENCIES 151

2. A loss function L (·, · ) associated with a particular y and a particular h (x),

whose value L (y, h(x )) measures the discrepancy between the output y

and the prediction h (x ). For a given hypothesis h ( ·, α ), the functional

risk is the loss average over the XY -domain

R( α) = Exy [L ] =

ZX,Y

L( y, h(x, α)) dFx,y ( x, y) = Z X,Y

L( y, h( x, α)) p (x, y) dxdy (7.2.3)

Note that L is random since x and y are random test points (i.i.drawn

from the same distribution (7.2.2) of the training set) while the hypoth-

esis h (·, α ) is given. This is the expected loss if we test the hypothesis

h( · , α) over an infinite amount of i.i.d. input/output pairs generated

by (7.2.2). For the class Λ of hypothesis we define

α0 = arg min

αΛR(α ) (7.2.4)

as the hypothesis in the class Λ which has the lowest functional risk.

Here, we assume for simplicity that there exists a minimum value of

R( α) achievable by a function in the class Λ. We define with R (α0 ) the

functional risk of the class Λof hypotheses.

3. If instead of a single class of hypothesis we consider the set Λ containing

all possible single valued mappings h : X → Y , we may define the

quantity

α = arg min

α Λ R(α ) (7.2.5)

and

R =R( α ) (7.2.6)

as the absolute minimum rate of functional risk. Note that this quan-

tity is ideal since it requires the complete knowledge of the distribution

underlying the data. In a classification setting, the optimal model with

parameters α is called the Bayes classifier and R (α ) the Bayes error

(Section 7.3.1). In a regression setting (Section 7.4) where y =f ( x ) + w

and the loss function is quadratic, h (·, α ) = f (· ) and R (α ) amounts to

the variance of w.

4. An algorithm L of parametric identification which takes as input the

training set DN and returns as output one hypothesis function h ( ·, αN )

with αN Λ. Here, we will consider only the case of deterministic and

symmetric algorithms. This means respectively that they always give

the same h (· , αN ) for the same data set DN and that they are insensitive

to the ordering of the examples in DN .

The parametric identification of the hypothesis is done according to ERM

(Empirical Risk Minimisation) inductive principle [186] where

αN = α (DN ) = arg min

αΛ R emp (α ) (7.2.7)

minimizes the empirical risk (also know as training error or apparent

error)

Remp ( α ) = 1

N

N

X

i=1

L( yi , h( xi , α)) (7.2.8)

constructed on the basis of the data set DN .

152 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

This formulation of a supervised learning problem is quite general, given that it

includes two basic statistical problems:

1. the problem of classification (also known as pattern recognition),

2. the problem of regression estimation.

These two problems and their link with supervised learning will be discussed in the

following sections.

7.3 Dependency and classification

Classification is one of the most common problem in statistics. It consists in ex-

ploring the association between categorical dependent variables and independent

variables which can take either continuous or discrete values. The problem of clas-

sification is formulated as follows: consider an input/output stochastic dependence

which can be described by a joint distribution F x,y ( · ), such that once an input

vector x is given, y ∈ Y = {c1 , . . . , cK } takes a value among Kdifferent classes. In

the example of spam email classification, K = 2 and c1 =SPAM, c2 =NO.SPAM.

We assume that the dependence is described by a conditional discrete probability

distribution Prob {y = ck |x =x} that satisfies

K

X

k=1

Prob {y = ck | x} = 1

This means that observations are noisy and follow a probability distribution. In

other terms, given an input x ,y does not always take the same value. Pretending

to have a zero-error classification in this setting is then completely unrealistic.

Example

Consider a stochastic dependence where xrepresents a year's month and yis a

categorical variable representing the weather situation in Brussels. Suppose that y

may take only the two values { RAIN,NO.RAIN} . The setting is stochastic since

you might have rainy August and some rare sunny December days. Suppose that

the conditional probability distribution of y is represented in Figure 7.3. This figure

plots Prob {y = RAIN|x = month} and Prob {y = NO.RAIN|x = month} for each

month. Note that for each month the probability constraint is respected:

Prob {y = RAIN|x = month} + Prob {y = NO.RAIN|x = month}= 1

A classifier is a particular instance of estimator which for a given x is expected

to return an estimate ˆ y= ˆ c= h( x, α) which takes a value in { c1 , . . . , cK }. Once a

cost function is defined, the problem of classification can be expressed in terms of

the formalism introduced in the previous section. An example of cost function is

the indicator function (taking only two values: zero and one)

L( c, ˆ c) = ( 0 if c= ˆ c

1 if c 6 = ˆ c(7.3.9)

also called the 0/1 loss . However, we can imagine situations where some misclassi-

fications are worse than others. In this case, it is better to introduce a loss matrix

L(K×K) where the element L(jk) =L(cj ,ck ) denotes the cost of the misclassification

7.3. DEPENDENCY AND CLASSIFICATION 153

Figure 7.3: Conditional distribution Prob {y |x } where x is the current month and

yis the random weather state. For example the column corresponding to x=Dec

and y =RAIN returns the conditional probability of RAIN in December.

when the predicted class is ˆ c( x ) = cj and the correct class is ck . This matrix must

be null on the diagonal and non negative everywhere else. In practical cases the

definition of a loss matrix could be quite challenging since it should take into ac-

count and combine several criteria, some easy to quantify (e.g. financial costs) and

some much less (e.g. ethical considerations)2. Note that in the case of the 0-1 loss

function (Equation 7.3.9) all the elements outside the diagonal are equal to one.

The goal of the classification procedure for a given xis to find the predictor

ˆ c( x) = h (x, α ) that minimises the quantity

K

X

k=1

L c( x),ck ) Prob {y= c k|x}(7.3.10)

which is an average of the ˆ c( x) row of the loss matrix weighted by the conditional

probabilities of observing y = ck . Note that the average of the above quantity over

the Xdomain

ZX

K

X

k=1

L c( x),ck ) Prob {y= c k|x} dF x=Z X ,Y

L( y, h( x, α)) dFx,y = R ( α ) (7.3.11)

corresponds to the functional risk (7.2.3).

The problem of classification can then be seen as a particular instance of the

more general supervised learning problem described in Section 7.2.

2By default, any automatic classifier (and the associated decision maker) implicitly or explicitly

embeds a loss function weighting often highly heterogeneous criteria. For instance, the Tesla

automatic braking systems (implicitly or explicitly) assigns a cost to false positives (e.g. a bag

wrongly identified as a pedestrian) and false negatives (e.g. a pedestrian mistaken for a bag).

154 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

7.3.1 The Bayes classifier

It can be shown that the optimal classifier h (·, α0 ) where α0 is defined as in (7.2.4)

is the one that returns for all x

c ( x) = h( x, α0 ) = arg min

cj ∈{c1,...,cK }

K

X

k=1

L(j,k) Prob {y = ck | x} (7.3.12)

The optimal classifier is also known as the Bayes classifier . In the case of a 0-1 loss

function the optimal classifier returns

c ( x) = arg min

cj ∈{c1 ,...,cK } X

k=1: K,k6= j

Prob {y =ck | x} (7.3.13)

= arg min

cj ∈{c1 ,...,cK }(1 Prob {y = c j |x}) (7.3.14)

= arg min

cj ∈{c1,...,cK } Prob {y6=c j |x}= arg max

cj ∈{c1,...,cK } Prob {y= c j |x}

(7.3.15)

The Bayes decision rule selects the j ,j = 1, . . . , K , that maximizes the posterior

probability Prob {y =cj |x}.

Example

Consider a classification task where X ={ 1,2,3,4, 5 } ,Y = {c1 , c2, c3 } and the loss

matrix and the conditional probability values are given in the following figures.

Let us focus on the optimal classification for x= 2. According to (7.3.12) the Bayes

classification rule for x = 2 returns

c (2) = arg min

k=1 ,2,3{L 11 Prob {y=c 1 |x= 2} + L 12 Prob {y=c 2 |x= 2} +L 13 Prob {y= c 3 |x= 2} ,

L21 Prob {y = c1 |x = 2} + L22 Prob {y = c2 |x = 2 }+ L23 Prob {y = c3 |x = 2 } ,

L31 Prob {y = c1 |x = 2 }+ L32 Prob {y = c2 |x = 2} + L33 Prob {y = c3 |x = 2}}

= arg min

k=1 ,2 ,3{0 0. 2+1 0. 8+5 0.0, 20 0. 2+0 0. 8 + 10 0.0,

2 0. 2+1 0. 8+0 .0 0} = arg min

k=1 ,2 ,3{1, 4,1.2} = 1

What would have been the Bayes classification in the 0-1 case?

7.3. DEPENDENCY AND CLASSIFICATION 155

Figure 7.4: Class conditional distributions: the green class is distributed as a mix-

ture of two gaussians while the red class as a gaussian.

7.3.2 Inverse conditional distribution

An important quantity, often used in classification algorithms, is the inverse condi-

tional distribution. According to the Bayes theorem (3.1.20) we have that

Prob {y = ck |x =x} = Prob {x= x |y=ck } Prob {y =ck }

PK

k=1 Prob {x=x |y= c k }Prob {y =c k }(7.3.16)

and that

Prob {x = x|y =ck } = Prob {y=ck |x=x } Prob {x=x }

Px Prob {y=ck |x=x } Prob {x =x} . (7.3.17)

The above relation means that by knowing the a-posteriori conditional distribution

Prob {y = ck |x =x} and the a-priori distribution Prob {x =x} , we can derive the

inverse conditional distribution Prob {x =x |y =ck } . This distribution is replaced

by a density if x is continuous and is also known as the class conditional density.

This distribution characterises the values of the inputs xfor a given class ck .

Shiny dashboard

The Shiny dashboard classif2.R illustrates a binary classification task where

xR2 and the two classes are green and red. The green and the class condi-

tional distributions (7.3.17) are a mixture of two gaussians (Section 3.7.2) and a

unimodal gaussian, respectively (Figure 7.4). Figure 7.5 illustrates the associated

conditional distribution (7.3.16) if the two classes have an equal a-priori proba-

bility (Prob {y = red} = Prob {y = green} ). Figure 7.6 shows the scattering of a

set of N = 500 points sampled according to the class-conditional distributions in

Figure 7.4.

156 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

Figure 7.5: The a-posteriori conditional distribution associated to the class-

conditional distributions (equal a-priori probability) in Figure 7.4.

Figure 7.6: Dataset sampled according to the class-conditional distributions (equal

a-priori probability) in Figure 7.4.

7.4. DEPENDENCY AND REGRESSION 157

Figure 7.7: Inverse conditional distribution of the distribution in Figure 7.3

Example

Suppose we want to know during which months it is most probable to have rain.

This boils down to have the distribution of x for y = RAIN . Figure 7.7 plots the in-

verse conditional distributions Prob {x = month|y = RAIN}and

Prob {x = month|y = NO.RAIN} according to (7.3.17) when we assume that the a

priori distribution is uniform (i.e. Prob {x =x} = 1/ 12 for all x).

Note that

X

month

Prob {x = month|y = NO.RAIN}=

=X

month

Prob {x = month|y = RAIN}= 1

7.4 Dependency and regression

Consider the stochastic relationship between two continuous random variables x

Rn and y Rdescribed by

Fx,y ( x, y) (7.4.18)

This means that to each vector xsampled according to the Fx (x ) there corresponds

a scalar ysampled from Fy (y |x = x ). Assume that a set of Ninput/output obser-

vations is available. The estimation of the stochastic dependence on the basis of the

empirical dataset requires the estimation of the conditional distribution Fy (y |x ).

This is known to be a difficult problem but for prediction purposes, most of the

time, it is sufficient to estimate the conditional expectation

f( x) = Ey [y | x] = Z Y

ydFy ( y| x ) (7.4.19)

also known as the regression function.

158 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

The regression function is also related to the functional risk

R( α) = Z L( y, h(x, α)) dFx,y ( x, y) = Z ( y h( x, α ))2 dFx,y ( x, y) (7.4.20)

for the quadratic loss L ( y, h )=( y h)2 . From (3.6.63) it can be shown that the

minimum (7.2.4) is attained by the regression function h ( ·, α0 ) = f ( · ) if the function

fbelongs to the set h( x, α), α Λ.

Once defined the regression function f , the input/output stochastic depen-

dency (7.4.18) is commonly represented in the regression plus noise form

y=f (x ) + w=Ey [ y|x ] + w(7.4.21)

where w denotes the noise term and satisfies E [w ] = 0 and E [w2 ] = σ 2

w. The role of

the noise is to make explicit that some variability of the target cannot be explained

by the regression function f. Notice that the assumption of an additive noise w

independent of x is common in statistical literature and is not overly restrictive. In

fact, many other conceivable signal/noise models can be transformed into this form.

The problem of estimating the regression function (7.4.19) is then a particular

instance of the supervised learning problem described in Section 7.2, where the

learning machine is assessed by a quadratic cost function. Examples of learning

algorithms for regression will be discussed in Section 9.1 and Section 10.1.

7.5 Assessment of a learning machine

A learning machine works well if it exhibits good generalisation, i.e. if it is able

to perform good predictions for unseen input values, which are not part of the

training set but that are generated by the same input/output distribution (7.2.2)

underlying the training set. This ability is commonly assessed by the amount of

bad predictions, measured by the generalisation error. The generalisation error of

a learning machine can be evaluated at two levels:

Hypothesis: Let αN be the hypothesis returned by a learning algorithm for a

training set DN according to the ERM principle (Eq. (7.2.7)). The functional

risk R (αN ) in (7.2.3) represents the generalisation error of the hypothesis αN .

This quantity is also known as conditional error rate [98] since it is conditional

on a given training set DN .

Algorithm: Let us define the average of the loss Lfor a given input xover the

ensemble of training sets of size Nas

gN ( x ) = ED N ,y [ L| x= x ] = Z Z N ,Y

L( y, h(x, αN )) dFy ( y| x ) dF N

z(D N )

(7.5.22)

where F N

z(D N ) is the distribution of the i.i.d. dataset D N . In this expression

Lis a function of the random variables DN (through h ) and y, while the

test input xis fixed. In the case of a quadratic loss function, this quantity

corresponds to the mean squared error (MSE) defined in Section 5.5.6. By

averaging the quantity (7.5.22) over the X domain we have

GN = Z X

gN ( x) dFx ( x ) = ED N E x,y [L(y , h(x ,αN ))] (7.5.23)

that is the generalisation error of the algorithm L (also known as expected

error rate [66] or expected test error [98]).

7.5. ASSESSMENT OF A LEARNING MACHINE 159

STOCHASTIC

PROCESS

N

N

N

N

N

LEARNING MACHINE

LEARNING MACHINE

LEARNING MACHINE

LEARNING MACHINE

LEARNING MACHINE

αN

αN

αN

αN

αN

αN

R( )

αN

αN

αN

αN

R( )

R( )

R( )

R( )

G =E[R( )] αNN

Figure 7.8: Functional risk vs. MISE

From (7.2.3) and (7.5.23) we obtain that

GN =ED N [ R(αN )]

where R (αN ) is random because of the dependence on DN (Figure 7.8).

In the case of a quadratic loss function, the quantity

MISE = ED NEx,y [(y h(x,αN ))2 ] (7.5.24)

takes the name of mean integrated squared error (MISE).

The two criteria correspond to two different ways of assessing the learning machine:

the first is a measure to assess the specific hypothesis (7.2.7) chosen by ERM, the

second assesses the average performance of the algorithm over training sets with N

observations. According to the hypothesis-based approach the goal of learning is to

find, on the basis of observations, the hypothesis that minimises the functional risk.

According to the algorithmic-based approach the goal is to find, on the basis of

observations, the algorithm which minimises the generalisation error. The two cri-

teria will be detailed in Section 7.6 and 7.7, respectively. Note that both quantities

requires the knowledge of Fx,y which is unfortunately unknown in real situations. A

key issue in machine learning is then to take advantage of observable quantities, i.e.

quantities that may be computed on the basis of the observed dataset, to estimate

or approximate the measures discussed above. An important quantity in this sense

is the empirical risk (7.2.8) which has however to be carefully considered in order

to avoid too optimistic evaluations of the learning machine accuracy.

7.5.1 An illustrative example

The notation introduced in Section 7.2 and 7.5 is rigorous but it may appear hostile

to the practitioner. In order to make the statistical concepts more affordable we

160 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

Figure 7.9: Training set (dots) obtained by sampling uniformly in the interval [ 2,2]

an input/output distribution with regression function f (x ) = x3 and unit variance.

present a simple example to illustrate these concepts. We consider a supervised

learning regression problem where :

The input is a scalar random variable x Rwith a uniform probability

distribution over the interval [ 2,2].

The target is distributed according to a conditional Gaussian distribution

py ( y |x= x) = N ( x3 , 1) (7.5.25)

where the conditional expected value E [y |x ] is the regression function f (x ) =

x3 and the noise w has a unit variance.

The training set DN = {hxi , yi i}, i = 1, . . . , N consists of N = 100 i.i.d. pairs

(Figure 7.9) generated according to the distribution 7.5.25. Note that this

training set can be easily generated with the following R commands

## script regr.R

N<-100

X<-runif(N,-2,2)

Y=X^3+rnorm(N)

plot(X,Y)

The learning machine is characterised by the following three components:

1. A class of hypothesis functions h (x, α ) = αx consisting of all the linear

models passing through the origin. The class Λ is then the set of real

numbers.

7.5. ASSESSMENT OF A LEARNING MACHINE 161

Figure 7.10: The empirical risk for the training set DN vs. the model parameter

value (x-axis). The minimum of the empirical risk is attained in α = 2.3272.

2. A quadratic loss L (y, h(x )) = (y h (x))2.

3. An algorithm of parametric identification based on the least-squares tech-

nique, which will be detailed later in Section 9.1.2. The empirical risk is

the quantity

Remp ( α ) = 1

100

100

X

i=1

(yi αxi )2(7.5.26)

The empirical risk is a function of αand the training set. For the given

training set DN , the empirical risk as a function of αis plotted in Fig.

7.10.

For the dataset DN in Figure 7.9, it is possible to obtain αN by minimising the

empirical risk (7.5.26)

αN = arg min

αΛR emp (α) = arg min

α Λ

1

100

100

X

i=1

(yi αxi )2 = 2. 3272 (7.5.27)

The selected hypothesis is plotted in the input/output domain in Fig. 7.11.

If the joint distribution (e.g. its conditional expectation and variance) were to

be known, it would also be possible to compute the risk functional (7.2.3) as

R( α) = 1

4Z 2

2

(x3 αx )2 dx + 1 = 4α 2

3 32

5α + 71/7 (7.5.28)

where the derivation of the equality is sketched in Appendix C.13. For the given

joint distribution, the quantity R (α ) is plotted as a function of α in Fig. 7.12. The

function takes a global minimum in α0 = 2. 4 as can be derived from the analytical

expression in (7.5.28).

The computation of the quantity (7.5.22) requires however an average over all

the possible realisations of the random variable αN for datasets of N = 100 points.

Figure 7.13 shows 6 different realisations of the training set for the same conditional

distribution (7.5.25) and the corresponding 6 values of αN . Note that those six

values may be considered as 6 different realisations of the sampling distribution

(Section 5.4) of αN .

It is important to remark that both the quantities (7.2.3) and (7.5.22) may be

computed only if we know a priori the data joint distribution. Unfortunately, in

real cases this knowledge is not accessible and the goal of learning theory is to study

the problem of estimating these quantities from a finite set of data.

162 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

Figure 7.11: Training set (dotted points) and the linear hypothesis function h (·, αN )

(straight line). The quantity αN , which represents the slope of the straight line, is

the value of the model parameter αwhich minimizes the empirical risk.

Figure 7.12: The functional risk (7.5.28) vs. the value of model parameter α(x-

axis). The minimum of the functional risk is attained in α0 = 2.4.

7.5. ASSESSMENT OF A LEARNING MACHINE 163

Figure 7.13: Six different realisations of a training set with N = 100 points (dots)

and the relative hypotheses (solid straight lines) chosen according to the ERM

principle (7.5.27).

164 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

Monte Carlo computation of generalisation error

The script functRisk.R computes by Monte Carlo the functional risk (7.5.28) for

different values of αand returns the value of α0 = 2. 4 which minimises it. Note

that the functional risk is computed by generating a very large number of i.i.d. test

examples.

The script gener.R computes by Monte Carlo the generalisation error (7.5.23).

Unlike the previous script which considers only the predictive value of different hy-

pothesis (with different α ), this script assesses the average accuracy of the empirical

risk minimisation strategy (7.5.27) for a finite number N= 100 of examples.

7.6 Functional and empirical risk

This section reports some results from the pioneering work of Prof. Vladimir Vap-

nik [188, 186, 187] on statistical learning. He defines the learning problem as the

problem of finding the hypothesis which minimises the functional risk (7.2.3) on

the basis of a finite set of observed data and without any specific assumption about

the data distribution. For details and mathematical derivations, we refer the reader

to his books [186, 187]. Here we will limit to report some of his most significant

results. We start by rewriting the functional risk notation (7.2.3) as

R( α) = Z L( y, h(x, α) dFhx, yi ( x, y) = Z Q( z , α) dFz ( z) α Λ (7.6.29)

where z = hx, yi , Q (z, α ) = L ( y, h ( x, α )), the probability measure Fz (· ) is unknown

but an i.i.d. sample z1 , . . . , zN is given. Analogously, the empirical risk may be

rewritten as

Remp ( αN ) = 1

N

N

X

i=1

Q(zi , αN )

Let us define with Λthe set of all possible single valued mappings f : X → Y and

consider the quantity

α = arg min

α Λ R(α )

where R (α ) is the absolute minimum rate of functional risk (7.2.6).

We can write the equality

R( αN ) R( α )=( R( αN ) R (α0 )) + ( R (α0 ) R ( α )) =

= Errestim (αN ) + Errapprox (αN )

where α0 is the hypothesis with lowest risk in Λ (Equation (7.2.4)).

The first right-hand term is the estimation error while the second is the ap-

proximation error (Figure 7.14). The estimation error represents the discrepancy

between the generalisation error of the best hypothesis in the class (R ( α0 )) and the

one learned from DN (R ( αN )). The approximation error is non null when the best

hypothesis in the class Λ (h ( ·, α0 )) is different from h (·, α ).

The trade-off between approximation and estimation error is controlled by the

size of Λ: when the size of Λ is large, R (α0 ) is close to R (α ) but the estimation

error could be large. On the other way round, if the size of Λ is small, the estimation

error is limited but the approximation error could be non negligible.

7.6. FUNCTIONAL AND EMPIRICAL RISK 165

*

Err

estim

αN

α*

α0

Err approx

Λ

Λ

Figure 7.14: Decomposition of the functional risk into estimation and approximation

error.

7.6.1 Consistency of the ERM principle

Functional and empirical risk are two key quantities in statistical learning (Fig-

ure 7.15). The functional risk represents the generalisation accuracy of the hypoth-

esis once tested with new data while R emp ( ·) measures the accuracy of the fitting

to the training set. A main issue is that Remp (· ) could be a very bad estimator of

the functional risk, e.g. when the class of hypothesis is too rich with respect to the

size of the observed sample.

According to Vapnik it is important to characterise the relation between those

two quantities, i.e. to define the (necessary and sufficient) conditions for the em-

pirical risk R emp (αN ) to converge for N → ∞ to the best functional risk R (α0 )

in the class Λ. This is known as the problem of consistency of the Empirical Risk

Minimisation (ERM) principle.

In formal terms, the ERM principle is consistent for the set of functions Q ( z, α)

and for the probability distribution Pz (z ) if the following two sequences converge

in probability to the same limit

R( αN ) P

N→∞ R(α 0 )

Remp (αN ) P

N→∞ R(α 0 )

The following lemma shows that both convergences may be studied by consid-

ering the quantity supαΛ |R emp (α )R (α ) |.

Lemma 3 (Devroye 1988).

R( αN ) inf

αΛR(α ) = R ( α N )R( α 0 )2 sup

αΛ|R emp (α) R( α)|

|Remp (αN ) R( αN ) | ≤ sup

αΛ |R emp (α) R (α)|

Setting an upper bound for supαΛ |R emp (α )R (α )| , we obtain an upper bound

for three quantities:

166 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

Λ

α0 αN

Remp

Remp

R

R

RN

0

(

(α )

(α )

(α )

N

Figure 7.15: Functional and empirical risk.

1. the estimation error R (αN )R (α0 ) which returns the sub-optimality of the

the model chosen by the ERM principle within the class α Λ

2. |R emp (αN )R (αN )| that is the error committed when the empirical risk is

used to estimate the functional risk of the selected model

3. |R emp (αN )R (α0 )| that is the error made when the empirical risk is used to

estimate the functional risk of the best model in the class Λ.

It can be shown that bounding supαΛ | Remp (α )R (α )| is not only a sufficient but

also a necessary condition for consistency of the ERM principle.

7.6.2 Key theorem of learning

Theorem 6.1 (Vapnik,Chervonenkis, 1991). Let Q ( z, α )α Λ be a set of functions

that satisfy the condition

aZ Q( z, α) dP ( z) b

Condition necessary and sufficient for the ERM principle to be consistent is that

the empirical risk Remp ( α) converges uniformly to the actual risk R (α ) over the set

Q( z, α), αΛ that is

lim

N→∞ Prob sup

αΛ

(R(α )Remp (α )) > ε = 0 ε > 0

This theorem rephrases the problem of ERM consistency as a problem of uniform

convergence, which ensures that the empirical risk is a good approximation of the

functional risk over all functions of Λ (i.e. including the worst-case).

The uniform convergence is trivial and guaranteed by the Law of Large Numbers

if the set of functions Q (z, α ) contains a single element: in fact, this is nothing more

than the convergence of the average to expectation for increasing N. For a real-

valued bounded function a Q (z, α ) b , by Hoeffding's inequalities (Section 5.6)

we have

Prob ( ZQ( z, α) dP ( z) 1

N

N

X

i=1

Q( zi , α) > ε) < exp 2 ε 2 N

(b a )2

7.6. FUNCTIONAL AND EMPIRICAL RISK 167

Then the probability of a deviation between empirical and functional risk converges

to zero for N → ∞. It is easy to generalise to the case where Q (z , α ) has a finite

number K of elements:

Prob (sup

1k K ZQ(z, α ) dP (z ) 1

N

N

X

i=1

Q(zi , α) > ε)<

Kexp 2 ε 2 N

(b a )2 = exp  ln K

N 2 ε 2

(b a )2 N

In order to obtain uniform convergence for any ε, the expression

lim

N→∞

ln K

N= 0 (7.6.30)

has to be satisfied. A problem arises when the set of functions is infinite, like in

machine learning where the most common classes of hypothesis are uncountable.

In this case we need to generalise the classical law of large numbers to functional

spaces. Consider the sequence of random variables

ξN = sup

α Λ

(R (α )R emp (α )) = sup

αΛ Z Q( z, α) dF ( z) 1

N

N

X

i=1

Q(zi , α)!

where the set of functions Q (z, α ), α Λ, has an infinite number of elements. Unlike

the finite case, the sequence ξN does not necessarily converge to zero. The problem

of learning is then strongly related to the problem of defining which properties of

the class of functions Q (z, α ), α Λ, guarantee the convergence in probability of the

sequence ξN to zero. In the following section we show some theoretical results from

Vapnik about the relation between ERM consistency and the topological properties

(notably the diversity ) of the class of hypothesis.

7.6.2.1 Entropy of a set of functions

In what follows we limit to consider the binary classification setting though similar

results can be shown for regression. In this setting the functions Q ( z, α ), α Λ are

indicator functions since they may take only 0 or 1 values. In order to characterise

the diversity of the set of functions Q (z, α ), α Λ, on the dataset DN , let NΛ (DN )

be the number of possible separations of DN using the functions Q (z, α ), α Λ.

Note that NΛ (DN ) is a random variable since DN is a random variable.

An example of this concept is presented in Figure 7.163 where N = 3 and the

functions h (· ) implement linear separators of the 2D (n= 2) input space. This class

of functions is able to perform all possible (i.e. 2N= 8) separations of the dataset.

It is also said that the class Λ of functions shatters the dataset of size N = 3. In

other words, a set of Npoints is said to be shattered by a class of hypothesis Λ if,

no matter how we assign a binary label to each point, there exists a hypothesis in

Λ that separates them. Note that a set of N= 4 points is not shattered by a class

of linear separators.

The quantity

HΛ ( N) = Eln NΛ ( DN )

is called the entropy of the set of functions on the given data and measures the

diversity of the class of hypothesis for a given number of observations.

The following theorem from Vapnik shows that this quantity is related to the

consistency of the ERM principle.

3Taken from https://datascience.stackexchange.com/questions/16140/

how-to- calculate-vc- dimension/16146

168 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

Figure 7.16: Number of linear separations of a dataset of N = 3 points.

Theorem 6.2. A necessary and sufficient condition for the two-sided uniform con-

vergence of the functional risk to the empirical risk is that

lim

N→∞

HΛ ( N)

N= 0 (7.6.31)

In other words, the ratio of the entropy to the number of observations should

decrease to zero with increasing number of observations. Note that this condition

depends on the underlying probability distribution Fz (· ) and that the entropy plays,

for uncountable classes, the role played by the number of functions in the finite case

(compare (7.6.30) with (7.6.31)).

7.6.2.2 Distribution independent consistency

Vapnik has been also able to extend the distribution-dependent result of the previous

Section to a distribution-free setting.

Theorem 6.3. Necessary and sufficient condition for consistency of ERM for any

probability measure is

lim

N→∞

GΛ ( N )

N= 0

where

GΛ ( N ) = ln max

DN N Λ (D N )

is the growth function.

Vapnik proved that in the pattern recognition case

Prob sup

α Λ

(R(α )Remp (α )) > ε 4 exp  G Λ (2N)

N ε 2 N (7.6.32)

7.6. FUNCTIONAL AND EMPIRICAL RISK 169

This means that, provided that GΛ (N ) does not grow linearly in N , it is actu-

ally possible to bound the (unknown) functional risk R (αN ) on the basis of the

(observable) empirical risk Remp (αN ).

If we set the probability in (7.6.32) to δ > 0 and we solve for ε, then the following

inequality holds with probability 1 δ :

R( αN ) R emp (αN ) + E

2(7.6.33)

where the right-hand side is called the guaranteed risk and

E= 4 G Λ (2N) ln(δ/4)

N.

Several other bounds have been derived for different class of hypothesis in [186].

7.6.3 The VC dimension

Vapnik and Chervonenkis showed that either the relation GΛ (N ) = N ln 2 holds

true for all N, or there exists some maximal Nfor which this relation is satisfied.

In this case, this maximal Nis called the VC (Vapnik and Chervonenkis) dimension

and denoted by D. By construction, the VC dimension is the maximal number of

points which can be shattered by functions in Λ.

Theorem 6.4. Any growth function either satisfies the equality

GΛ ( N) = Nln 2

or is bounded by the inequality

GΛ ( N) ≤ D ln N

D+ 1

where D is an integer such that when N= D

GΛ (D ) = D ln 2 , GΛ (D + 1) < (D + 1) ln 2

The VC dimension of a set of indicator functions Q (z, α ) is infinite if the growth

function is linear. It is finite and equal to Dif the growth function is bounded by

a logarithmic function with coefficient D.

The VC dimension quantifies the richness or capacity of a set of functions. If for

any N an hypothesis function h ( ·, α ), α Λ can shatter Npoints (i.e. separate them

in all 2Npossible ways) then GΛ (N ) = N ln 2. In this case, the class of function has

an infinite capacity and there is no ERM convergence (the empirical risk is always

zero whatever is the functional risk): no learning from data is possible4.

The finiteness of the the VC dimension is a necessary and sufficient condition for

distribution independent consistency of ERM learning machines. The VC dimension

of the set of linear functions with n + 1 model parameters is equal to D =n + 1.

Note that, though for this specific class the VC dimension equals the number of free

parameters, this is not necessarily true for other family of functions. For instance,

it can be shown that the VC dimension of the highly wiggly set of functions

h( x, α) = sin αx, α R

4Note that, in Popper terminology (Section 2.6) this corresponds to a non scientific situation

where no dataset may falsify the hypothesis, or equivalently it is always possible to find a hypothesis

justifying what we observe. Since the class of hypothesis is too rich, no falsification (and then no

generalisation or scientific discovery) is possible.

170 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

is infinite though it has a single parameter. At the same time, you can have set of

functions with infinite number of parameters yet a finite VC dimension.

Generally speaking, the VC dimension of a set of functions can be either larger

than or smaller than the number of parameters. The VC dimension of the set of

functions (rather than the number of parameters) is responsible for the generalisa-

tion ability of learning machines.

Once defined D, the relation between empirical and functional risk of a class of

function with finite VC dimension is now made explicit by the bound (7.6.33) where

the second summand is

E( D, N, δ )=4 D ln 2N

D+ 1 ln(δ/4)

N

The reliability of the empirical risk as approximation of the functional risk depends

on the ratio N/D . If N/D is large (i.e. sample size much larger than the VC

dimension), the E term is small and the empirical risk is a good approximation

of the functional risk. In other terms, minimising the empirical risk guarantees a

small value of the (expected) risk. On the contrary, if N/D is small (i.e. number of

samples comparable to the VC dimension), a small empirical risk Remp (αN ) does

not guarantee a small value of the actual risk. In other terms a small empirical

risk could be an optimistic (then biased) estimator of the associated functional risk.

In those configurations, to minimise the actual risk R (αN ) it is recommended to

address both terms of the confidence interval (e.g. by considering alternative classes

of hypothesis).

7.7 Generalisation error

In the previous section we presented how Vapnik [185, 186, 187] formalised the

learning task as the minimisation of functional risk R ( αN ) in a situation where the

joint distribution is unknown. This section focuses on the algorithm-based criterion

GN (Equation (7.5.24)) as a measure of the generalisation error of the learning

machine.

In particular we will study how the generalisation error can be decomposed in

the regression formulation and in the classification formulation.

7.7.1 The decomposition of the generalisation error in re-

gression

Let us focus now on of the gN measure (Equation (7.5.22)) of the generalisation

error in the case of regression. In the case of a quadratic loss

L( y ( x) , h(x, α)) = ( y ( x) h( x, α))2(7.7.34)

the quantity gN is often referred to as the mean squared error (MSE) and its

marginal (7.5.24) as the mean integrated squared error (MISE). If the regression

dependency is described in the regression plus noise form (7.4.21), the conditional

target density can be written as

py ( y f( x )| x ) = py ( y Ey [y | x ]| x ) = pw ( w ) (7.7.35)

where w is a noisy random variable with zero mean and variance σ 2

w.

This supervised learning problem can be seen as a particular instance of the

estimation problem discussed in Chapter 5, where, for a given x , the unknown

parameter θ to be estimated is the quantity f (x ) and the estimator based on the

7.7. GENERALISATION ERROR 171

training set is ˆ

θ=h (x, αN ). The MSE quantity, defined in (5.5.14) coincides, apart

from an additional term, with the term (7.5.22) since

gN ( x ) = ED N ,y [ L| x ] = (7.7.36)

=ED N ,y (y h(x, αN ))2 = (7.7.37)

=ED N ,y (y Ey [y |x ] + Ey [y |x ] h (x, αN ))2 = (7.7.38)

=ED N ,y (yEy [y |x ])2 + 2w (Ey [y|x ] h(x, αN ))+ (7.7.39)

+ (Ey [y |x ] h(x, αN ))2 = (7.7.40)

=Ey (yEy [y|x ])2 +ED N (h (x, αN )Ey [y|x ])2 = (7.7.41)

=Ey w2 +ED N (h (x, αN )Ey [y |x ])2 (7.7.42)

=σ2

w+E D N[(f( x) h (x, α N )) 2 ] = σ 2

w+E D N[(θ ˆ

θ)2 ] = (7.7.43)

=σ2

w+ MSE (7.7.44)

Note that y =f (x ) + w = Ey [y |x ] + w ,f is fixed but unknown and that the noise

term w is independent of DN and satisfies E [w ] = 0 and E [w2 ] = σ 2

w

We can then apply bias/variance decomposition (5.5.14) to the regression prob-

lem where θ =f (x ) and ˆ

θ=h (x, αN ):

gN ( x ) = ED N ,y [L( x, y)] =

=σ2

w+E D N(h( x, α N )E y [y|x ]) 2 =

=σ2

w+noise variance

+ (ED N [h ( x, αN )] Ey [y |x ])2 + squared bias

+ED N (h(x, αN )ED N [h(x, αN )])2 = model variance

=σ2

w+B 2 (x) + V( x)

(7.7.45)

In a regression task, the bias B (x ) measures the difference in xbetween the aver-

age of the outputs of the hypothesis functions over the set of possible DN and the

regression function value f (x ) = Ey [y |x ]. The variance V ( x ) reflects the variabil-

ity of the guessed h ( x, αN ) as one varies over training sets of fixed dimension N.

This quantity measures how sensitive the algorithm is to changes in the data set,

regardless of the target. So by Eq. (7.5.24) by averaging (7.7.45) over Xwe obtain

MISE = GN =σ 2

w+Z X

B2 ( x) dFx + Z X

V( x) dFx (7.7.46)

where the three terms are

1. the intrinsic noise term reflecting the target alone,

2. the integrated squared bias reflecting the target's relation with the learning

algorithm and

3. the integrated variance term reflecting the learning algorithm alone.

As the aim of a learning machine is to minimise the quantity GN and the com-

putation of (7.7.46) requires the knowledge of the joint input/output distribution,

this decomposition could appear as a useless theoretical exercise. In practical set-

tings, the designer of a learning machine does not have access to the term GN but

can only estimate it on the basis of the training set. Nevertheless, the bias/variance

decomposition is relevant in practical learning too since it provides a useful hint

about how to control the error GN . In particular, the bias term measures the lack

of representational power of the class of hypotheses. This means that to reduce the

172 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

Capacity

Underfitting

Overfitting

Noise variance

Model

variance

Squared bias

s*

MISE

Figure 7.17: Bias/variance/noise tradeoff in regression: this is a qualitative rep-

resentation of the relationship between the hypothesis' bias and variance and the

capacity of the class of functions. The MISE generalisation error is the sum of

the three terms (squared bias, hypothesis variance and noise variance) as shown

in (7.7.46). Note that the variance of the noise is supposed to be target indepen-

dent and then constant.

bias term of the generalisation error we should consider classes of hypotheses with

a large capacity s, or in other words hypotheses which can approximate a large

number of input/output mappings. On the other side, the variance term warns us

against an excessive capacity (or complexity) s of the approximator. This means

that a class of too powerful hypotheses runs the risk of being excessively sensitive to

the noise affecting the training set; therefore, our class Λscould contain the target

but it could be practically impossible to find it out on the basis of the available

dataset.

In other terms, it is commonly said that an hypothesis with large bias but low

variance underfits the data while an hypothesis with low bias but large variance

overfits the data. In both cases, the hypothesis gives a poor representation of the

target and a reasonable trade-off needs to be found.

A graphical illustration of the bias/variance/noise tradeoff (7.7.46) is made in

Figure 7.17. The left side of the figure corresponds to an underfitting configuration

where the model has too low capacity (i.e. high bias) to capture the nonlinearity

of the regression function. The right side of the figure corresponds to an overfitting

configuration where the model capacity is too large (i.e. high variance) leading then

to high instability and poor generalisation. Note that Figure 7.17 requires a formal

definition of the notion of capacity and that it is only a qualitative visualisation of

the theoretical link between the hypothesis' properties and the capacity of the class

of functions. Nevertheless it provides useful hints about the impact of the learning

procedure on the final generalisation accuracy. The task of the model designer is to

search for the optimal trade-off between the variance and the bias terms (ideally the

capacity s in Figure 7.17), on the basis of the available training set. Section 7.9

will discuss how this search proceeds in practice in a real setting.

7.7. GENERALISATION ERROR 173

Two naive predictors

Consider a regression task y =f (x ) + w , where Var [w ] = σ 2

wand two naive

predictors:

1. h(1) (x )=0

2. h(2) (x ) = P N

i=1 y i

N

What about their generalisation errors in x = ¯ x? By using (7.7.45) we obtain

1. g (1)

N x) = σ2

w+f x)2

2. g (2)

N x) = σ2

w+ (f x) E[y])2 + Var [y ] /N

The script naive.R executes a Monte Carlo validation of the formulas above.

7.7.2 The decomposition of the generalisation error in clas-

sification

Let us consider a classification task with Koutput classes and a loss function L.

For a given input x, we denote by ˆy the class predicted by the classifier h ( x, αN )

trained with a dataset DN . We derive the analytical expression of gN (x ), usually

referred to as the mean misclassification error (MME).

MME(x ) = Ey,DN [ L (y , h(x, αN ))|x ] = Ey,DN [L (y, ˆy)] = (7.7.47)

=Ey,DN [

K

X

k,j=1

L(j,k) 1(ˆy =cj | x )1( y= ck | x )] = (7.7.48)

=

K

X

k,j=1

L(j,k) ED N [1(ˆy = cj | x )] Ey [1( y=ck |x )]] = (7.7.49)

=

K

X

k,j=1

L(j,k) Prob { ˆy = cj | x} Prob { y=ck |x } (7.7.50)

where 1(·) is the indicator function which returns zero when the argument is false

and one otherwise. Note that the distribution of ˆy depends on the training set DN

while the distribution of yis the distribution of a test set (independent of DN ).

For zero-one loss function, since y and ˆy are independent, the MME expression

simplifies to

MME(x ) =

K

X

k,j=1

1(cj 6 =ck )Prob { ˆy =cj |x } Prob { y= ck | x} =

= 1

K

X

k,j=1

1(cj =ck )Prob { ˆy =cj |x } Prob { y= ck | x} =

= 1

K

X

k

Prob { ˆy =ck |x } Prob { y= ck | x} = Prob { y 6= ˆy} (7.7.51)

174 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

A decomposition of a related quantity was proposed in [198]. Let us consider

the squared sum:

1

2

K

X

j=1

(Prob {y = cj } − Prob { ˆy =cj })2=

1

2

K

X

j=1

Prob {y =cj }2

+1

2

K

X

j=1

Prob { ˆy = cj }2

K

X

j=1

Prob {y = cj } Prob { ˆy =cj }

By adding one to both members and by using (7.7.47) we obtain a decomposition

analogous to the one in (7.7.45)

gN ( x ) = MME( x ) =

=1

2

1

K

X

j=1

Prob {y = cj | x}2

+"noise"

+1

2

K

X

j=1

(Prob {y = cj | x } − Prob { ˆy = cj | x })2 + "squared bias"

+1

2

1

K

X

j=1

Prob { ˆy = cj |x}2

"variance"

(7.7.52)

The noise term measures the degree of uncertainty of yand consequently the degree

of stochasticity of the dependance. It equals zero if and only if there exists a class

csuch that Prob {y = c| x}= 1 and zero otherwise. Note that this quantity does

not depend on the learning algorithm nor on the training set.

The variance term measures how variant the classifier prediction ˆy = h( x, αN )

is. This quantity is zero if the predicted class is always the same regardless of the

training set.

The squared bias term measures the squared difference between the yand the

ˆy probability distributions on the domain Y.

7.8 The hypothesis-based vs the algorithm-based

approach

In the previous sections we introduced two different manners of assessing the accu-

racy of a learning machine. The reader could logically raise the following question:

which approach is the most adequate in practice?

Instead of providing a direct answer to such question, we prefer to conduct a

short comparison of the assumptions and limitations related to the two approaches.

The hypothesis-based approach formulates learning as the problem of finding

the hypothesis which minimises the functional risk. Vapnik reformulates this prob-

lem into the problem of consistency of a learning process based on ERM. The main

result is that it is possible to define a probabilistic distribution-free bound on the

functional risk which depends on the empirical risk and the VC dimension of the

class of hypothesis. Though this achievement is impressive from a theoretical and

scientific perspective (it was published in a Russian book in the 60s), its adop-

tion in practical settings is not always easy for several reasons: results derive from

asymptotic considerations though learning by definition deals with finite samples,

the computation of the VC dimension is explicit only for specific classes of hypoth-

esis functions and the bound, derived from worst-case analysis, is not always tight

enough for practical purposes.

7.9. THE SUPERVISED LEARNING PROCEDURE 175

The algorithm-based approach relies on the possibility of emulating the stochas-

tic process underlying the dataset by means of resampling procedures like cross-

validation or bootstrap. Note that this approach is explicitly criticised by Vapnik

and others who consider it inappropriate to reason in terms of data generation once

a single dataset is available. According to [58] "averaging over the data would be

unnatural, because in a given application, one has to live with the data at hand. It

would be marginally useful to known the number GN as this number would indicate

the quality of an average data sequence, not your data sequence". Nevertheless,

though it is hard to guarantee formally the accuracy of a resampling strategy, its

general-purpose nature, simplicity and ease of implementation have been, along

years, key ingredients of its success.

Whatever the degree of realism of the hypothesis made by the two approaches

is, it is worth making a pragmatic and historical consideration. Though the Vap-

nik results represent a major scientific success and underlie the design of powerful

learning machines (notably SVM), in a wider perspective it is fair to say that cross-

validation is the most common and successful workhorse of practical learning appli-

cations. This means that, though most data scientists have been eager to formalise

the consistency of their algorithms in terms of Vapnik bounds, in practice they had

recourse to intensive cross-validation tricks to make it work in the real world. Now,

more than 60 years after the first computational version of learning processes, we

have enough evidence to say that cross-validation is a major element of the machine

learning success story. This is the reason why in the following sections we will focus

on an algorithm-based approach aiming to assess (and minimise) the generalisation

error by means of a resampling strategy.

7.9 The supervised learning procedure

The goal of supervised learning is to return the hypothesis with the lowest gen-

eralisation error. Since we assume that data samples are generated in a random

way, there is no hypothesis which gives a null generalisation error. Therefore, the

generalisation error GN of the hypothesis returned by a learning machine has to

be compared to the minimal generalisation error that can be attained by the best

single-valued mapping. Let us define by Λthe set of all possible single valued

mappings h : X → Y and consider the hypothesis

α = arg min

α Λ R(α ) (7.9.53)

where R (α ) has been defined in (7.2.3).

Thus, R (α ) represents the absolute minimum rate of error obtainable by a single

valued approximator of the unknown target. For maintaining a simple notation, we

put G =R (α ). For instance, in our illustrative example in Section 7.5.1, α

denotes the parameters of the cubic function and G amounts to the unit variance

of the Gaussian noise.

In theoretical terms, a relevant issue is to demonstrate that the generalisation

error GN of the model with parameters αN learned from the dataset DN converges

to the minimum G for N going to infinity. Unfortunately, in real learning settings,

two problems must be dealt with. The first is that the error GN cannot be computed

directly but has to be estimated from data. The second is that a single class Λ could

not be large enough to contain the hypothesis α .

A common practice to handle these problems is to decompose the learning pro-

cedure in the following sequence of steps:

1. A nested sequence of classes of hypotheses

Λ1 ⊆ ··· ⊆ Λs . . . ΛS (7.9.54)

176 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

Capacity

Underfitting Overfitting

s*

GN

Figure 7.18: Bias/variance/noise tradeoff and model selection: since the generali-

sation error (e.g. MISE) is not accessible in practical settings, model selection is

performed on the basis of an estimation (dotted line) which may induce an error

(and a variability) in the selection (7.9.55) of the best capacity.

is defined so that Λ = S

s=1Λ s where s denotes the capacity of the class. This

guarantees that the set of hypotheses taken into consideration will necessarily

contain the best hypothesis α .

A priori informations as well as considerations related to the bias/variance

dilemma can help in the design of this sequence.

2. For each class in the sequence, a hypothesis h (· , αs

N), s = 1, . . . , S , is selected

by minimising the empirical risk (7.2.8). This step is defined as the parametric

identification step of the learning procedure.

3. For each class in the sequence, a validation procedure returns ˆ

Gs

Nwhich esti-

mates the generalisation error Gs

Nof the hypothesis α s

N. This step is called

the validation step of the learning procedure.

4. The hypothesis h ( ·, α¯ s

N)Λ ¯ swith

¯ s= arg min

s

ˆ

Gs

N(7.9.55)

is returned as the final outcome. This final step is called the model selection

step.

In order to accomplish the learning procedure, and specifically the selection in (7.9.55),

we need an estimation of the generalisation error (Section 7.10). However, since the

estimator of the generalisation error may be affected by an error (as any estima-

tor), this may induce an error and a variability in the model selection step (7.9.55)

(Figure 7.18).

7.10. VALIDATION TECHNIQUES 177

7.10 Validation techniques

This section discusses validation methods to estimate the generalisation error GN

from a finite set of N observations.

The empirical risk (also called apparent error)R emp ( αN ) introduced in (7.2.7)

could be the most intuitive estimator of GN . However, it is generally known that the

empirical risk is a biased (and optimistic) estimate of GN and that Remp (αN ) tends

to be smaller than GN , because the same data have been used both to construct

and to evaluate h (·, αN ). A demonstration of the biasedness of the empirical risk

for a quadratic loss function in a regression setting is available in Appendix C.14.

In Section 9.1.16 we will analytically derive the biasedness of the empirical risk in

case of linear regression models.

The study of error estimates other than the apparent error is of significant

importance if we wish to obtain results applicable to practical learning scenarios.

There are two main ways to obtain better, i.e. unbiased, estimates of GN : the first

requires some knowledge on the distribution underlying the data set, the second

makes no assumptions on the data. As we will see later, an example of the first

approach is the FPE criterion (presented in Section 9.1.16.2) while examples of the

second approach are the resampling procedures.

7.10.1 The resampling methods

Cross-validation [176] is a well-known method in sampling statistics to circumvent

the limits of the apparent error estimate. The basic idea of cross-validation is that

one builds a model from one part of the data and then uses that model to predict

the rest of the data. The dataset DN is split ltimes in a training and a test subset,

the first containing N tr examples, the second containing Nts =N N tr examples.

Each time, N tr examples are used by the parametric identification algorithm Lto

select a hypothesis α i

Ntr ,i= 1, . . . , l , from Λ and the remaining N ts examples are

used to estimate the error of h ( ·, αi

Ntr ) (Fig. 7.19)

ˆ

Rts ( αi

Ntr ) =

Nts

X

j=1

L(yj , h( xj , αi

Ntr )) (7.10.56)

The resulting average of the l errors ˆ

Rts ( α i

Ntr ), i = 1, . . . , l , is the cross-validation

estimate

ˆ

Gcv =1

l

l

X

i=1

ˆ

Rts ( αi

Ntr ) (7.10.57)

A common form of cross-validation is the "leave-one-out" (l-o-o). Let D (i) be the

training set with zi removed, and h ( x, αN(i ) ) be the corresponding prediction rule.

The l-o-o cross-validated error estimate is

ˆ

Gloo =1

N

N

X

i=1

L yi , h( xi , αN(i ) )(7.10.58)

In this case lequals the number of training points and N ts = 1.

Bootstrap (Section 6.4) is also used to return a nonparametric estimate of GN ,

by repeatedly sampling the training cases with replacement. Since empirical risk is

a biased optimistic estimation of generalisation error and bootstrap is an effective

method to assess bias (Section 6.4.3), it follows that bootstrap plays a role in a

validation strategy.

A bootstrap sample D(b) is a "fake" dataset {z1b , z2b , . . . , zN b } ,b = 1, . . . , B

randomly selected from the training set { z1 , z2, . . . , zN } with replacement.

178 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

Figure 7.19: Partition of the training dataset in the i th fold of cross-validation. The

quantity N i

tr is the number of training points while N i

ts is the number of test points.

Efron and Tibshirani [65] proposed to use bootstrap to correct the bias (or

optimism) of empirical risk by adopting a strategy similar to Section 6.4.3. Equa-

tion (6.4.9) estimates the bias of an estimator by computing the gap between the

average bootstrap (6.4.8) estimate and the sample estimation. In the case of gener-

alisation, the sample estimation ˆ

θis the empirical risk and the bootstrap estimate

ˆ

θ(·) may be computed as follows

ˆ

G(·) =1

B" N

X

i=1

(P ib L yi , h (xi , α(b) )# (7.10.59)

where P ib indicates the proportion of the bootstrap sample D (b) ,b = 1, . . . , B

containing the ith training point zi ,

Pib =# N

j=1(z j b =z i )

N(7.10.60)

and α(b) is the output of the parametric identification performed on the set D(b) .

The difference between empirical risk and (7.10.59)

Biasbs = 1

B

B

X

b=1

N

X

i=1 (P ib 1

N) L y i , h(xi , α(b) ) (7.10.61)

is the bias correction term to be subtracted to empirical risk to obtain a bootstrap

bias corrected estimate (6.4.10) of the generalisation error.

An alternative consists in using the holdout principle in combination with the

bootstrap one [65]. Since each bootstrap set is a resampling of the original training

set, it may happen that some of the original examples (called out-of-bag ) do not

belong to it: we can then use them to have an independent holdout set to be used

7.11. CONCLUDING REMARKS 179

for generalisation assessment. The bootstrap estimation of the generalisation error

(also known as E0) is then

ˆ

Gbs =1

N

N

X

i=1

1

|B (i) | X

b B(i)

L yi , h( xi , α(b) )(7.10.62)

where B (i) is the set of bootstrap samples which do not contain the ith point and

|B(i) |is its size. The terms where |B (i) | = 0 are discarded.

7.11 Concluding remarks

The goal of a learning procedure is to return a hypothesis which is able to predict

accurately the outcome of an input/output probabilistic mapping on the basis of

past observations. In order to achieve this goal, the learning procedure has to deal

with three major difficulties.

Minimisation of the empirical risk: in a general case finding the global mini-

mum of the empirical risk as in (7.2.7) demands the resolution of a multivari-

ate and nonlinear optimisation problem for which no analytical solution could

exist. Some heuristics to address this issue are discussed in Section 8.6.

Finite number of data: in real problems, a single random realisation of the sta-

tistical process, made of a finite number of input/output pairs, is accessible to

the learning machine. This means, that the hypothesis generated by a learn-

ing algorithm is a random variable as well. In theory, it would be required

to have access to the underlying process and to generate several times the

training set, in order to have a reliable assessment of the learning algorithm.

In practice, the use of repeated realisations is not viable in a real learning

problem.

The validation procedure copes with this problem by trying to assess a random

variable on the basis of a single realisation. In particular we focused on cross-

validation, a resampling method which works by simulating the stochastic

process underlying the data.

No a priori knowledge: we consider a setting where no knowledge about the

process underlying the data is available. This lack of a priori knowledge puts

no constraints on the complexity of the class of hypotheses to consider, with

the consequent risk of using an inadequate type of approximator. The model

selection deals with this issue by considering classes of hypotheses of increasing

complexity and selecting the one which behaves the best according to the

validation criteria. This strategy ensures the covering of the whole spectrum of

approximators, ranging from low bias/high variance to high bias/low variance

models, making easier the selection of a good trade-off on the basis of the

available data.

So far, the learning problem has been introduced and discussed for a generic class

of hypotheses, and we did not distinguish on purpose between different learning

machines. The following chapter will show the parametric and the structural iden-

tification procedure as well as the validation phase for some specific learning ap-

proaches.

7.12 Exercises

1. Consider an input/output regression task where n = 1, E [y |x ] = sin(πx/ 2) and

p( y| x) = N (sin( πx/2) , σ2 ), σ = 0 . 1 and x ∼ U ( 2 , 2). Let N be the size of the

180 CHAPTER 7. STATISTICAL SUPERVISED LEARNING

training set and consider a quadratic loss function.

Let the class of hypothesis be hM (x ) = α0 + P M

m=1 α m x m with α j [ 2, 2], j =

0,...,M.

For N = 20 generate S = 50 replicates of the training set. For each replicate,

estimate the value of the parameters that minimise the empirical risk, compute the

empirical risk and the functional risk.

1. Plot the evolution of the distribution of the empirical risk for M = 0, 1, 2.

2. Plot the evolution of the distribution of the functional risk for M = 0, 1, 2.

Hints: to minimise the empirical risk, perform a grid search in the space of pa-

rameter values, i.e. by sweeping all the possible values of the parameters in the

set [ 1, 0. 9, 0 . 8,..., 0. 8, 0. 9, 1]. To compute the functional risk generate a set of

Nts = 10000 i.i.d. input/output testing examples.

Solution: See the file Exercise6.pdf in the directory gbcode/exercises of the

companion R package (Appendix F).

Chapter 8

The machine learning

procedure

8.1 Introduction

Raw data is rarely of direct benefit. Its true value resides in the amount of informa-

tion that a model designer can extract from it. Modelling from data is often viewed

as an art form, mixing the insight of the expert with the information contained in the

observations. Typically, a modelling process is not a sequential process but is better

represented as a sort of loop with a lot of feedback and a lot of interactions with

the designer. Different steps are repeated several times aiming to reach, through

continuous refinements, a good model description of the phenomenon underlying

the data.

This chapter reviews the practical steps constituting the process of constructing

models for accurate prediction from data. Note that the overview is presented with

the aim of not distinguishing between the different families of approximators and

of showing which procedures are common to the majority of modelling approaches.

The following chapters will be instead devoted to the discussion of the peculiarities

of specific learning approaches.

We partition the data modelling process into two phases: a preliminary phase

which leads from the raw data to a structured training set, and a learning phase,

which leads from the training set to the final model. The preliminary phase is

made of a problem formulation step (Section 8.2) where the designer selects the

phenomenon of interest and defines the relevant input/output features, an experi-

mental design step (Section 8.3) where input/output data are collected, and a data

preprocessing step (Section 8.4) where preliminary conversion and filtering of data

is performed.

Once the numeric dataset has been formatted, the learning procedure begins.

In qualitative terms, this procedure can be described as follows. First, the designer

defines a set of models (e.g. polynomial models, neural networks) characterised by

acapacity (or complexity) index (or hyper-parameter) (e.g. degree of the polyno-

mial, number of neurons, VC dimension) which controls the approximation power

of the model. According to the capacity index, the set of models is consequently

decomposed in a nested sequence of classes of models (e.g. classes of polynomials

with increasing degree). Hence, a structural identification procedure loops over the

set of classes, first by identifying a parametric model for each class (parametric

identification) and then by assessing the prediction error of the identified model on

the basis of the finite set of points (validation ). Finally, a model selection procedure

selects the final model to be used for future predictions. A common alternative to

181

182 CHAPTER 8. THE MACHINE LEARNING PROCEDURE

model selection is a model combination, where a combination (e.g. averaging) of

the most promising models is used to return a meta-model, presumably with better

accuracy properties.

The problem of parametric identification is typically a problem of multivariate

optimisation [2]. Section 8.6 introduces the most common optimisation algorithms

for linear and nonlinear configurations. Structural identification is discussed in

Section 8.8, which focuses on the existing methods for model generation, model

validation and model selection. The last section concludes and resumes the whole

modelling process with the support of a diagram.

8.2 Problem formulation

The problem formulation is the preliminary and somewhat the most critical step of

a learning procedure. The model designer chooses a particular application domain

(e.g. finance), a phenomenon to be studied (e.g. the credit risk of a customer)

and hypothesises the existence of an unknown dependency (e.g. between the finan-

cial situation of the customer and the default risk) which is to be estimated from

experimental data. First, the modeller specifies a set of constructs , i.e. abstract

concepts or high-level topics which are potentially relevant for the study (e.g. the

profile and the financial situation of a client). Second, a set of variables (e.g. the

client age and her salary) is defined by grounding the constructs into a measurable

form. Eventually, an operationalisation, i.e. the definition of how to measure those

variables (e.g. by accessing a bank database), is proposed.

In this step, domain-specific knowledge and experience are the most crucial

requirements to come up with a meaningful problem formulation. Note that the

time spent in this phase is usually highly rewarding and can save vast amounts

of modelling time. There is often no substitute for physical intuition and human

analytical skills.

8.3 Experimental design

The most precious thing in data-driven modelling is the data itself. No matter how

powerful a learning method is, the resulting model would be ineffective if the data

are not informative enough. Hence, it is necessary to devote a great deal of attention

to the process of observing and collecting the data. In input/output modelling it is

essential that the training set be a representative sample of the phenomenon and

cover the input space adequately. To this aim, it is relevant to consider the relative

importance of the various areas of the input space. Some regions are more relevant

than others, as in the case of a dynamical system whose state has to be regulated

about some specified operating point.

The discipline of creating an optimal sampling of the input space is called exper-

imental design [72]. The study of experimental design is concerned with locating

training input data in the space of input variables so that the performance of the

modelling process is maximised. However, in some cases, the designer cannot ma-

nipulate the process of collecting data, and the modelling process has to deal with

what is available. This configuration, which is common to many real problems, is

called the observational setting [42]. Though this setting seems the most adequate

for a learning approach ("just learn from what you observe"), it is worth reminding

that most of the time, behind an observation setting, there is the strong implicit

assumption that the observations are i.i.d. samples of a stationary (i.e. invariant)

stochastic process. Now, in most realistic cases, this assumption is not valid (at

least not for a long time), and considerations of nonstationarity, drift should be

8.4. DATA PRE-PROCESSING 183

integrated in the learning process. Other problems are related to the poor causal

value of inferences made in an observational setting, e.g. in situations of sampling

bias or non-observable variables. Nevertheless, given the introductory nature of

this book, in what follows, we will limit to consider the simplest observational and

stationary setting.

8.4 Data pre-processing

Once data have been recorded, it is common practice to pre-process them. The hope

is that such treatment might make learning easier and improve the final accuracy.

Pre-processing includes a large set of actions on the observed data, and some of

them are worth being discussed:

Numerical encoding. Some interesting data for learning might not be in a nu-

meric format (e.g. text, image). Since, in what follows, we will assume that

all data are numeric, a preliminary conversion or encoding step is needed.

Given that most encoding procedures are domain-specific, we will not further

discuss them here.

Missing data treatment. In real applications, it often happens that some input

values are missing. If the quantity of data is sufficiently large, the simplest

solution is to discard the examples having missing features. When the amount

of data is too restricted or there are too many partial examples, it becomes

important to adopt some specific technique to deal with the problem. Various

heuristics [93], as well as methods based on the Expectation Maximisation

(EM) algorithm [79], have been proposed in the literature. Note that any

missing data treatment strategy makes assumptions about the process that

caused some observations to be missing (e.g. missing at random or not) [124]:

it is recommended to be aware of such assumptions before applying them (see

also Section 13.7.4 on selection bias).

Categorical variables. It may be convenient to treat categorical variables, specif-

ically in situations when they may take a very large number of values (e.g.

names of retailers in a business intelligence application). Two common ways

to deal with are: i) replace them with dummy variables encoding the dif-

ferent values in binary terms (e.g. K bits for Kcategories) ii) replace each

category with numerical values informative about the conditional distribution

of the target given such category: for instance, in regression (binary clas-

sification) we could replace a category x = "black " with an estimation of

E[y| x= " black"] (Prob {y = 1| x= " black"}).

Feature selection. The feature selection problem consists in selecting a relevant

subset of input variables in order to maximise the performance of the learning

machine. This approach is useful if there are inputs that carry only little useful

information or are strongly correlated. In these situations a dimensionality

reduction improves the performance reducing the variance of the estimator at

the cost of a slight increase in the bias.

Several techniques exist for feature selection, such as conventional methods

in linear statistical analysis [59], principal component analysis [139] and the

general wrapper approach [117]. For more details, we refer the reader to

Chapter 12.

Outliers removal. Outliers are unusual data values that are not consistent with

most of observations. Commonly, outliers are due to wrong measurement

procedures, storage errors and coding malfunctioning. There are two common

184 CHAPTER 8. THE MACHINE LEARNING PROCEDURE

strategies to deal with outliers: the first is performed at the preprocessing

stage [71] and consists in their detection and consequent removal, the second

is to delay their treatment at the model identification level by adopting robust

methodologies [105] that are by design insensitive to outliers.

Other common preprocessing operations are pre-filtering to remove noise effects,

anti-aliasing to deal with sampled signals, variable scaling to standardise all vari-

ables to have a zero mean and a unity variance 1 , compensation of nonlinearities or

the integration of some domain-specific information to reduce the distorting effects

of measurable disturbances.

8.5 The dataset

The outcome of the pre-processing phase is a dataset in a tabular numeric form

where each row represents a particular observation (also called instance, example

or data point), and each column a descriptive variable (also called feature, attribute

or covariate). We denote the dataset

DN = {z1 , z2, . . . , zN }

where N is the number of examples, nis the number of features, the i th example is

an input/output pair zi =hxi , yi i, i = 1, . . . , N ,xi is a [n× 1] input vector and yi

is a scalar output.

Note that hereafter, for the sake of simplicity, we will restrict ourselves to a

regression setting. We will assume the input/output data to be i.i.d. generated by

the following stochastic dependency:

y=f ( x) + w, (8.5.1)

where E [w ] = 0 and σ 2

wis the noise variance.

The noise term wis supposed to lump together all the unmeasured contributions

to the variability of y , like, for instance, missing or non-observable variables. There

are two main assumptions underlying Equation (8.5.1). The first is that the noise is

independent of the input and has a constant variance. This assumption also called

homeskedasticity in econometrics, is typically made in machine learning because of

the primary focus on the dependency fand the lack of effective methodologies to

assess it a priori in nonlinear and high dimensional settings. The reader should

be aware, however, that heteroskedastic configurations may have a strong impact

on the final model accuracy and that some a priori output variable transformation

and/or a posteriori assessment is always recommended (e.g. study of the residual

distribution after fitting). The second assumption is that noise enters additively

to the output. Sometimes the measurements of the inputs to the system may also

be noise corrupted; in system identification literature, this is what is called the

error-in-variable configuration [9]. As far as this problem is concerned, we adopt

the pragmatic approach proposed by Ljung [125], which assumes that the measured

input values are the actual inputs and that their deviations from the correct values

propagate through fand lump into the noise w.

In the following, we will refer to the set of vectors xi and yi through the following

matrices:

1. the input matrix X of dimension [N× n ] whose ith row is the vector x T

i,

2. the output vector Y of dimension [N× 1] whose ith component is the scalar

yi .

1This can be easily done with numeric inputs by the R command scale

8.6. PARAMETRIC IDENTIFICATION 185

8.6 Parametric identification

Assume that a class of hypotheses h ( ·, α ) with α Λ has been fixed. The problem of

parametric identification from a finite set of data consists in seeking the hypothesis

whose vector of parameters αN Λ minimises the loss function

Remp ( α) = 1

N

N

X

i=1

L( yi , h( xi , α)) (8.6.2)

This phase of the learning procedure requires the resolution of the optimisation

task (7.2.7). In this section, we will review some of the most common algorithms

that address the problem of parametric identification in linear and nonlinear cases.

To make the notation more readable, henceforth we will define the error function

J( α) = Remp ( α)

and we will formulate the optimisation problem (7.2.7) as

αN = arg min

αΛJ(α ). (8.6.3)

Also, we will use the term model as a synonymous of hypothesis.

8.6.1 Error functions

The choice of an optimisation algorithm is strictly dependent on the form of the

error function J (α ). The function J (α ) is directly determined by two factors

1. the form of the model h (·, α ) with α Λ,

2. the loss function L (y, h ( x, α )) for a generic x.

As far as the cost function is concerned, there are many possible choices depending

on the type of data analysis problem. In regression problems, the goal is to model

the conditional distribution of the output variable conditioned on the input vari-

able (see Section 7.4) whose mean is the value minimising the mean squared error

(Equation 3.3.29). This motivates the use of a quadratic function

L( y, h( x, α)) = ( y h( x, α))2(8.6.4)

which gives to J (α ) the form of a sum-of-squares.

For classification problems, the goal is to model the posterior probabilities of

class membership, again conditioned on the input variables. Although the sum-of-

squares J can also be used for classification, there are more appropriate error func-

tions to be considered [60]. The most used is cross-entropy which derives from the

adoption of the maximum-likelihood principle (Section 5.8) for supervised classifica-

tion. Consider a classification problem where the output variable ytakes values in

the set {c1 , . . . , cK } and Prob {y =cj | x} , j = 1, . . . , K is the conditional probabil-

ity. Given a training dataset and a set of parametric models ˆ

Pj ( x, α), j = 1, . . . , K

of the conditional distribution, the classification problem boils down to the minimi-

sation of the quantity

J( α) =

N

X

i=1

log ˆ

Py i ( xi , α) (8.6.5)

Note that the models ˆ

Pj ( x, α), j = 1, . . . , K must satisfy two important constraints:

1. ˆ

Pj ( x, α) > 0

186 CHAPTER 8. THE MACHINE LEARNING PROCEDURE

2. P K

j=1 ˆ

Pj ( x, α )=1

In the case of 0/1 binary classification problem, the cross-entropy is written as

J( α) =

N

X

i=1

[yi log ˆ

P1 ( xi , α) + (1 yi ) log(1 ˆ

P1 ( xi , α))] (8.6.6)

where ˆ

P1 ( xi , α) is the estimation of the conditional probability of the class y = 1.

Since this chapter will focus mainly on regression problems, we will limit to

consider the case of a quadratic loss function.

8.6.2 Parameter estimation

8.6.2.1 The linear least-squares method

The parametric identification of a linear model

h( x, α) = αT x

is obtained by minimising a quadratic function J (α ) by the well-known linear least-

squares method. In Chapter 9 we will see in detail the linear least-squares minimisa-

tion. Here we just report that, in the case of non-singularity of the matrix (XT X ),

J( α) has a single global minimum in

αN = ( XT X) 1 XT Y (8.6.7)

8.6.2.2 Iterative search methods

In general cases, when either the model is not linear, or the cost function is not

quadratic, J (α ) can be a highly nonlinear function of the parameters α , and there

may exist many minima, all of which satisfy

J= 0 (8.6.8)

where denotes the gradient of Jin parameter space. We will define as stationary

points all the points which satisfy condition (8.6.8). They include local maxima,

saddle points and minima. The minimum for which the value of the error function

is the smallest is called the global minimum, while other minima are called local

minima. As a consequence of the non-linearity of the error function J (α ), it is not

in general possible to find closed-form solutions for the minima. For more details

on multivariate optimisation we refer the reader to [2].

We will consider iterative algorithms, which involve a search through the pa-

rameter space consisting of a succession of steps of the form

α(τ +1) =α(τ) + ∆α(τ) (8.6.9)

where τ labels the iteration step.

Iterative algorithms differ for the choice of the increment ∆α (τ) .

In the following, we will present some gradient-based and non-gradient-based it-

erative algorithms. Note that each algorithm has a preferred domain of application

and that it is not possible, or at least fair, to recommend a single universal opti-

misation algorithm. We consider it much more interesting to highlight the relative

advantages and limitations of the different approaches.

8.6. PARAMETRIC IDENTIFICATION 187

8.6.2.3 Gradient-based methods

In some cases, the analytic form of the error function makes it possible to evaluate

the gradient of the cost function Jwith respect to the parameters α, increasing the

rate of convergence of the iterative algorithm. Some examples of gradient-based

methods are reported in the following sections. Those methods require the deriva-

tives of the cost function with respect to the model parameters. Such computation

is not always easy in complex nonlinear mappings, but it has been recently facil-

itated by the appearance of automatic differentiation functionalities [17], like the

ones made available by libraries like TensorFlow or PyTorch.

8.6.2.4 Gradient descent

It is the simplest of the gradient-based optimisation algorithms, also known as the

steepest descent. This algorithm starts with some initial guess α (0) for the parameter

vector (often chosen at random). Then, it iteratively updates the parameter vector

such that, at the τ th step, the estimate is updated by moving a short distance in

the direction of the negative gradient evaluated in α (τ) :

α(τ) =µ J (α(τ) ) (8.6.10)

where µ is called the learning rate. The updates are repeatedly executed until

convergence, i.e. when further improvements are considered to be too small to be

useful.

The gradient descent method is known to be a very inefficient procedure. One

drawback is the need for a suitable value of the learning rate µ. In fact, a decrease

of the cost function is guaranteed by (8.6.10) only for learning rates of infinitesimal

size: if its value is sufficiently small, it is expected that the value of J (α(τ) ) will

decrease at each successive step, eventually leading to a parameter vector at which

the condition (8.6.8) is satisfied. Too small learning rates may considerably delay

the convergence, while too large rates might result in numerical overflows.

Further, at most points in the parameter space, the local gradient does not point

directly towards the minimum: gradient descent then needs many small steps to

reach a stationarity point.

Example of gradient-based univariate optimisation

Let us consider the univariate function

J( α) = α2 2 α+ 3

visualised in Figure 8.1. By running the script optim.R you can visualise the gra-

dient search of the minimum of the function. Note that the gradient is obtained by

computing analytically the derivative

J0 ( α) = 2 α2

We invite the reader to assess the impact of the learning rate µon the conver-

gence of the minimisation process.

The function

J( α) = α4 /4 α" /3 α2 + 2

with two local minima is shown in Function 8.2 and minimised in the script optim2.R.

We invite the reader to assess the impact of the initial value α (0) of the solution on

the result of the minimisation process.

188 CHAPTER 8. THE MACHINE LEARNING PROCEDURE

Figure 8.1: Gradient-based minimisation of the function J (α ) = α2 2α + 3 with

a single global minimum.

Figure 8.2: Gradient-based minimisation of the function J (α ) = α4 / 4α" / 3α2 +2.

8.6. PARAMETRIC IDENTIFICATION 189

Figure 8.3: Contour plot: gradient-based minimisation of the bivariate function

J( α) = α2

1+α 2

22α 1 2α 2 + 6 with a single global minimum.

Example of gradient-based bivariate optimisation

Let us consider the bivariate function

J( α) = α2

1+α 2

22α 1 2α 2 + 6

whose contour plot is visualised in Figure 8.3. By running the script optim2D.R

you can visualise the gradient search of the minimum of the function. Note that

the gradient is obtained by computing analytically the gradient vector

J( α) = [2 α1 2, 2α2 2]T

We invite the reader to assess the impact of the learning rate µ on the conver-

gence of the minimisation process.

The contour plot of the function

J( α) = α 4

1+α 4

2

4 α 3

1+α 3

2

3α2

1α 2

2+ 4

with three local minima is shown in Function 8.4 and minimised in the script

optim2D2.R. We invite the reader to assess the impact of the initial value α (0)

of the solution on the result of the minimisation process.

In alternative to the simplest gradient descent, there are many iterative methods

in the literature, as the momentum -based method [155], the enhanced gradient

descent method [191] and the conjugate gradients techniques [157], which make

implicit use of second-order derivatives of the error function.

In the following section, we present instead a class of algorithms that make

explicit use of second-order information.

8.6.2.5 The Newton method

The Newton's method is a well-known example in optimisation literature. It is an

iterative algorithm which uses at the τ th step a local quadratic approximation in

190 CHAPTER 8. THE MACHINE LEARNING PROCEDURE

Figure 8.4: Contour plot: gradient-based minimisation of the function J (α ) =

α4

1+α 4

2

4 α3

1+α 3

2

3α 2

1α 2

2+ 4.

the neighbourhood of α (τ)

ˆ

J( α) = J α(τ) + α α(τ) T J α(τ) +1

2 αα(τ) T H( α(τ) ) α α(τ)

(8.6.11)

where H (α (τ) ) is the Hessian matrix of J (α ) computed in α (τ) and

H( α) =

2 J

∂α2

1

2 J

∂α1α2 . . . 2 J

∂α1αp

.

.

..

.

... . .

.

.

2 J

∂αpα1

2 J

∂αpα2 . . . 2 J

∂α2

p

is a [p, p] square matrix of second-order partial derivatives if αRp .

The minimum of (8.6.11) satisfies

αmin =α(τ) H1 ( α(τ) ) J α(τ) (8.6.12)

where the vector H1 (α (τ) )J α(τ) is denoted as the Newton direction or the

Newton step and forms the basis for the iterative strategy

α(τ +1) =α(τ) H1 J α(τ) (8.6.13)

There are several difficulties with such an approach, mainly related to the pro-

hibitive computational demand. Alternative approaches, known as quasi-Newton or

variable metric methods, are based on (8.6.12) but instead of calculating the Hes-

sian directly, and then evaluating its inverse, they build up an approximation to the

inverse Hessian. The two most commonly used update formulae are the Davidson-

Fletcher-Powell (DFP) and the Broyden-Fletcher-Goldfarb-Shanno (BFGS) proce-

dures [128].

8.6.2.6 The Levenberg-Marquardt algorithm

This algorithm is designed specifically for minimising a sum-of-squares error func-

tion

J( α) = 1

2

N

X

i=1

L( yi , h( xi , α)) = 1

2

N

X

i=1

e2

i=1

2kek2 (8.6.14)

8.6. PARAMETRIC IDENTIFICATION 191

where ei is the error for the ith training case, eis the [N× 1] vector of errors and

k·k is a 2-norm. Let us consider an iterative step in the parameter space

α(τ) α(τ+1) (8.6.15)

If the quantity (8.6.15) is sufficiently small, the error vector ecan ban expanded in

a first-order Taylor series form:

e α(τ+1) = e α(τ) + E α(τ+1) α(τ) (8.6.16)

where the generic element of the matrix Eis in the form

Eij = ∂e i

∂αj

(8.6.17)

and αj is the j th element of the vector α. The error function can then be approxi-

mated by

J α(τ+1) =1

2

e α (τ) +E α (τ +1) α (τ)

2(8.6.18)

If we minimise with respect to α(τ+1) we obtain:

α(τ +1) =α(τ) ( ETE)1 ET e( α(τ) ) (8.6.19)

where (ET E )1 ET is the pseudo-inverse of the matrix E . For the sum-of-squares

error function (8.6.14), the elements of the Hessian take the form

Hjk = 2 E

∂αjαk

=

N

X

i=1 ∂e i

∂αj

∂ei

∂αk

+ei

2 ei

∂αjαk (8.6.20)

Neglecting the second term, the Hessian can be written in the form

H' ET E(8.6.21)

This relation is exact in the case of linear models, while in the case of nonlinearities

it represents an approximation that holds exactly only at the global minimum of the

function [25]. The update formula (8.6.19) could be used as the step of an iterative

algorithm. However, the problem with such an approach could be a too large step

size returned by (8.6.19), making the linear approximation no longer valid.

The idea of the Levenberg-Marquardt algorithm is to use the iterative step, at

the same time trying to keep the step size small so as to guarantee the validity of

the linear approximation. This is achieved by modifying the error function in the

form

Jlm =1

2

e( α (τ) ) + E (α (τ +1) α(τ) )

2+λ

α (τ +1) α (τ)

2(8.6.22)

where λ is a parameter that governs the step size. The minimisation of the error

function (8.6.22) ensures, at the same time, the minimisation of the sum-of-square

cost and a small step size. Minimising (8.6.22) with respect to α (τ +1) we obtain

α(τ+1) =α(τ) ( ET E+ λI ) 1 ET e( α(τ) ) (8.6.23)

where I is the unit matrix. For very small values of λwe have the Newton formula,

while for large values of λ we recover the standard gradient descent.

A common approach for setting λis to begin with some arbitrary low value

(e.g. λ = 0. 1) and at each step (8.6.23) check the change in J . If J decreases, the

new parameter is retained, λis decreased (e.g. by a factor of 10), and the process

repeated. Otherwise, if J increased after the step (8.6.23), the old parameter is

restored, λ decreased, and a new step performed. The procedure is iterated until a

decrease in Jis obtained [25].

192 CHAPTER 8. THE MACHINE LEARNING PROCEDURE

8.6.3 Online gradient-based algorithms

The algorithms above are called batch since they compute the gradient of the quan-

tity (8.6.2) over the entire training set. In the case of very large datasets or se-

quential settings, this procedure is not recommended since it requires the storage

of the entire dataset. For this reason, online modification of the batch algorithms

have been proposed in the literature. The idea consists of replacing the gradient

computed on the entire training set with a gradient computed on the basis of a

single data point

α(τ +1) =α(τ) µ(τ) α J(τ) ( α(τ) ) (8.6.24)

where

J(τ) ( α(τ) )=( yτ h( xτ , α(τ) ))2

and zτ = hyτ , xτ i the input/output observation at the τth instant.

The underlying assumption is that the training error obtained by replacing the

average with a single term will not perturb the average behaviour of the algorithm.

Note also that the dynamics of µ(τ) plays an important role in the convergence.

This algorithm can be easily used in an adaptive online setting where no training

set needs to be stored, and observations are processed immediately to improve

performance. A linear version of the iteration (8.6.24) is the Recursive Least Squares

regression algorithm presented in Section 9.1.20. Note also that the earliest machine

learning algorithms were based on sequential gradient-based minimisation. Well-

known examples are Adaline and LMS [195].

8.6.4 Alternatives to gradient-based methods

Virtually no gradient-based method is guaranteed to find the global optimum of

a complex nonlinear error function. Additionally, all descent methods are deter-

ministic in the sense that they inevitably lead to convergence to the nearest local

minimum. As a consequence, the way a deterministic method is initialised is decisive

for the final result.

Further, in many practical situations, the gradient-based computation is time-

consuming or extremely difficult due to the complexity of the objective function.

For these reasons, a lot of derivative-free and stochastic alternatives to gradient-

based methods have been explored in the literature. We will limit to cite the most

common solutions:

Random search methods. They are iterative methods that are primarily used

for continuous optimisation problems. Random search methods explore the

parameter space of the error function sequentially in a random fashion to find

the global minimum. Their strength lies mainly in their simplicity, which

makes these methods easily understood and conveniently customised for spe-

cific applications. Moreover, it has been demonstrated that they converge

to the global optimum with probability one on a compact set. However, the

theoretical result of convergence to the minimum is not really important here

since the optimisation process could take a prohibitively long time.

Genetic algorithms. They are derivative-free stochastic optimisation methods

based loosely on the concepts of natural selection and evolutionary processes

[82]. Important properties are the strong parallelism and the possibility to

be applied to both continuous and discrete optimisation problems. Typically,

Genetic Algorithms (GA) encode each parameter solution into a binary bit

string (chromosome) and associate each solution with a fitness value. GAs

usually keep a set of solutions (population ) which is evolved repeatedly toward

a better overall fitness value. In each generation, the GA constructs a new

8.7. REGULARISATION 193

population using genetic operators such as crossover or mutation; members

with higher fitness values are more likely to survive and to participate in fu-

ture operations. After a number of generations, the population is expected to

contain members with better fitness values and to converge, under particular

conditions, to the optimum.

Simulated annealing. It is another derivative-free method suitable for continu-

ous and discrete optimisation problems. In Simulated Annealing (SA), the

value of the cost function J (α ) to be minimised is put in analogy to the en-

ergy in a thermodynamic system at a certain temperature T[115]. At high

temperatures T(τ) , the SA technique allows function evaluations at points far

away from α (τ) , and it is likely to accept a new parameter value with a higher

function value. The decision whether to accept or reject a new parameter

value α (τ +1) is based on the value of an acceptance function, generally shaped

as the Boltzmann probability distribution. At low temperatures, SA evaluates

the objective function at more local points, and the likelihood of accepting

a new point with a higher cost is much lower. An annealing schedule regu-

lates how rapidly the temperature Tgoes from high values to low values as a

function of time or iteration counts.

R script

The script grad.R compares four parameter identification algorithms in the case

of a univariate linear model: least-squares, random search, gradient-based and

Levenberg-Marquardt.

8.7 Regularisation

Parameter identification relies on Empirical Risk Minimisation to return an estima-

tor in supervised learning problems. In Section 7.7, we stressed that the accuracy

of such an estimator depends on the bias/variance trade-off, which is typically con-

trolled by capacity related hyper-parameters. There is, however, another important

strategy, called regularisation , to control the bias/variance trade-off by constraining

the ERM problem. The rationale consists then in restricting the set of possible so-

lutions by transforming the unconstrained problem (8.6.3) into a constrained one.

An example of constrained minimisation is

αN = arg min

αΛJ(α ) + λk αk (8.7.25)

where λ > 0 is the regularisation parameter. By adding the squared norm term,

solutions with large values of components of αare penalised unless they play a

major role in the J (α ) term. An alternative version is

αN = arg min

αΛJ(α ) + λS (h ( ·, α )) (8.7.26)

where the term S penalises non-smooth wiggling hypothesis functions. An example

is

S( α) = Z ( h"( x, α))2 dx

where the integral of the second derivative is a measure of lack of smoothness.

Regularisation is a well-known strategy in numerical analysis and optimisation

to avoid or limit the ill-conditioning of the solution. In estimation, regularisation

194 CHAPTER 8. THE MACHINE LEARNING PROCEDURE

is an additional way to control the variance of the estimator resulting from the

optimisation procedure. This is particularly effective in learning problems with a

number of observations comparable or even smaller than the input dimension.

Note, however, that variance reduction occurs at the cost of both increased bias

and additional complexity of the optimisation problem. For instance, according

to the nature of the regularisation and the nonlinearity of the hypothesis function

we could lose some interesting properties like the closed-form of the solution. In

Chapter12, we will show some examples of regularisation to address the curse of

dimensionality problem.

8.8 Structural identification

Once a class of models Λ is given, the identification procedure, described above,

returns a model h (,· , αN ) defined by the set of parameters αN Λ .

The choice of an appropriate class or model structure [125] is, however, the most

crucial aspect for a successful modelling application. The procedure of selecting the

best model structure is called structural identification. The structure of a model

is made of a series of features that influence the generalisation power of the model

itself. Among others, there are:

The type of the model. We can distinguish, for example, between nonlinear

and linear models, between physically based and black box representations,

between continuous-time or discrete-time systems.

The size of the model. This is related to features like the number of inputs

(regressors), the number of parameters, the degree of the polynomials in a

polynomial model, the number of neurons in a neural network, the number of

nodes in a classification tree, etc.

In general terms, structural identification requires (i) a procedure for proposing

a series of alternative model structures, (ii) a method for assessing each of these

alternatives and (iii) a technique for choosing among the available candidates.

We denote the first issue as model generation. Some techniques for obtaining

different candidates to the final model structure are presented in Section 8.8.1.

The second issue concerns the important problem of model validation , and will

be extensively dealt with in Section 8.8.2.

Once models have been generated and validated, the last step is the model

selection, which will be discussed in Section 8.8.3

It is important to remark that a selected model structure should never be ac-

cepted as a final and true description of the phenomenon. Rather, it should be

regarded as a good enough description, given the available dataset.

8.8.1 Model generation

The goal of the model generation procedure is to generate a set of candidate model

structures among which the best one is to be selected. The more this procedure

is effective, the easier will be the selection of a powerful structure at the end of

the whole identification. Traditionally there have been a number of popular ways

to search through a large collection of model structures. Maron and Moore [131]

distinguish between two main methods of model generation:

Brute force. This is the exhaustive approach. Every possible model structure is

generated in order to be evaluated.

Consider, for instance, the problem of selecting the best structure in a 3-

layer Feed Forward Neural Network architecture. The brute force approach

8.8. STRUCTURAL IDENTIFICATION 195

consists of enumerating all the possible configurations in terms of the number

of neurons.

The exhaustive algorithm runs in a time that is generally unacceptably slow

for complex architectures with a large number of structural parameters. The

only advantage, however, is that this method is guaranteed to return the best

learner according to a specified assessment measure.

Search methods. These methods treat the collection of models as a continuous

and differentiable surface. They start at some point on the surface and search

for the model structure that corresponds to the minimum of the generalisation

error until some stop condition is met. This procedure is much faster than

brute force since it does not need to explore all the space. It only needs to

validate those models that are on the search path. Gradient-based techniques

and/or non-gradient based methods can be used for the search in the model

space. Besides the well-known problem related to local minima in the gradi-

ent base case, a more serious issues derives from the structure of the model

selection procedure. At every step of the search algorithm, we need to find

a collection of models that are near or related to the current model. Both

gradient-based and non-gradient-based techniques require some metric in the

search space. This implies a notion of model distance, difficult to define in a

general model selection problem. Examples of search methods in model gen-

eration are the growing and pruning techniques in Neural Networks structural

identification [18].

8.8.2 Validation

The output of the model generation procedure is a set of model structures Λs,

s= 1 , . . . , S. Once the parametric identification is performed on each of these model

structures, we have a set of models h (·, αs

N) identified according to the Empirical

Risk Minimisation principle.

Now, the prediction quality of each one of the model structures Λs ,s = 1, . . . , S,

has to be assessed on the basis of the available data. In principle, the assessment

procedure, known as model validation , could measure the goodness of a structure

in many different ways: how the model relates to our a priori knowledge, how the

model is easy to be used, to be implemented or to be interpreted. In this book, as

stated in the introduction, we will focus only on criteria of accuracy, neglecting any

other criterion of quality.

In the following, we will present the most common techniques to assess a model

on the basis of a finite set of observations.

8.8.2.1 Testing

An obvious way to assess the quality of a learned model is by using a testing sequence

Dts = (h xN+1 , yN+1 i ,...,h xN+Nts , yN+Nts i) (8.8.27)

that is a sequence of i.i.d. pairs, independent of DN and distributed according

to the probability distribution P (x, y ) defined in (7.2.2). The testing estimator is

defined by the sample mean

ˆ

Rts ( α s

N) = 1

Nts

N+ Nts

X

j= N+1

(yj h (xj , αs

N)) 2 (8.8.28)

This estimator is clearly unbiased in the sense that

ED ts [ˆ

Rts (α s

N)] = R (α s

N) (8.8.29)

196 CHAPTER 8. THE MACHINE LEARNING PROCEDURE

When the number of available examples is sufficiently high, the testing technique

is an effective validation technique at a low computational cost. A serious problem

concerning the practical applicability of this estimate is that it requires a large,

independent testing sequence. In practice, unfortunately, an additional set of in-

put/output observations is rarely available.

8.8.2.2 Holdout

The holdout method, sometimes called test sample estimation, partitions the data

DN into two mutually exclusive subsets, the training set D tr and the holdout or

test set DN ts . It is common to design 2/3 of the data as training set and the

remaining 1/3 as test set. However, when the training set has a reduced number of

cases, the method can present a series of shortcomings, mainly due to the strong

dependence of the prediction accuracy on the repartition of the data between the

training and validation set. Assuming that the error R (αN tr ) decreases as more

cases are inserted in the training set, the holdout method is a pessimistic estimator

since only a reduced amount of data is used for training. The larger the number

of points used for test set, the higher the bias of the estimate αN tr ; at the same

time, fewer test points implies a larger confidence interval of the estimate of the

generalisation error.

8.8.2.3 Cross-validation in practice

In Chapter 7 we focused on the theoretical properties of cross-validation and boot-

strap. Here we will see some more practical details on these validation proce-

dures, commonly grouped under the name of computer-intensive statistical methods

(CISM) [101].

Consider a learning problem with a training set of size N.

In l -fold cross-validation the available points are randomly divided into l mutu-

ally exclusive test partitions of approximately equal size. The examples not found

in each test partition are independently used for selecting the hypothesis, which

will be tested on the partition itself (Fig. 7.19). The average error over all the l

partitions is the cross-validated error rate.

A special case is the leave-one-out (l-o-o). For a given algorithm and a dataset

DN , an hypothesis is generated using N 1 observations and tested on the single

remaining one. In leave-one-out, cross-validation is repeated l =N times, each data

point is used as a test case, and each time nearly all the examples are used to design

a hypothesis. The error estimate is the average over the N repetitions.

In a general nonlinear case, leave-one-out is computationally quite expensive.

This is not true for a linear model where the PRESS l-o-o statistic is computed as

a by-product of the least-squares regression (Section 9.1.17).

8.8.2.4 Bootstrap in practice

Bootstrap is a resampling technique that samples the training set with replacement

to return a nonparametric estimate of the desired statistic.

There are many bootstrap estimators, but two are the most commonly used in

model validation: the E0 and the E632 bootstrap.

The E0 bootstrap estimator, denoted by ˆ

Gb in (7.10.62), samples with replace-

ment from the original training set Bbootstrap training sets, each consisting of N

cases. The cases not found in the training group form the test groups. The average

of error rate on the Btest groups is the E0 estimate of the generalisation error.

The rationale for the E632 technique is given by Efron [63]. He argues that while

the resubstitution error R emp is the error rate for patterns that are "zero" distance

8.8. STRUCTURAL IDENTIFICATION 197

from the training set, patterns contributing to the E0 estimate can be considered as

too far out from the training set. Since the asymptotic probability that a pattern

will not be included in a bootstrap sample is

(1 1/N)N e 1 0 .368

the weighted average of R emp and E0 should involve patterns at the "right" distance

from the training set in estimating the error rate:

ˆ

GE632 = 0 . 368 Remp + 0.632 ˆ

Gbs (8.8.30)

where the quantity ˆ

Gbs is defined in (7.10.62). The choice of B is not critical as

long as it exceeds 100. Efron [63] suggests, however, that Bneed not be greater

than 200.

There a lot of experimental results on comparison between cross-validation and

bootstrap methods for assessing models [107], [116]. In general terms, only some

guidelines can be given to the practitioner [194]:

For training set size greater than 100, use cross-validation; either 10-fold cross-

validation or leave-one-out is acceptable.

For training set sizes less than 100, use leave-one-out.

For very small training sets (N < 50), in addition to the leave-one-out esti-

mator, the ˆ

GE632 and the ˆ

Gboot estimates may be useful measures.

8.8.2.5 Complexity based criteria

In conventional statistics, various criteria have been developed, often in the context

of linear models, for assessing the generalisation performance of the learned hypoth-

esis without the use of further validation data. Such criteria aim to understand the

relationship between the generalisation performance and the training error. Gen-

erally, they take the form of a prediction error, which consists of the sum of two

terms ˆ

GPE = R emp + complexity term (8.8.31)

where the complexity (or capacity) term represents a penalty that grows as the

number of free parameters in the model grows.

This expression quantifies the qualitative consideration that simple models re-

turn high empirical risk with a reduced complexity term while complex models have

a low empirical risk thanks to the high number of parameters. The minimum for

the criterion (8.8.31) represents a trade-off between performance on the training set

and complexity. Note that the bound (7.6.33) derived from the Vapnik learning

theory agrees with the relation (8.8.31) if we take the functional risk as a measure

of generalisation.

Let us consider a quadratic loss function and the quantity

\

MISEemp = Remp (αN ) = min

αP N

i=1(y i h(x i , α))2

N

If the input/output relation is linear and nis the number of input variables, well-

known examples of complexity based criteria are:

1. the Final Prediction Error (FPE) (see Section 9.1.16.2 and [6])

FPE = \

MISEemp

1 + p/N

1p/N (8.8.32)

with p =n + 1,

198 CHAPTER 8. THE MACHINE LEARNING PROCEDURE

2. the Predicted Squared Error (PSE) (see Section 9.1.16.2)

PSE = \

MISEemp + 2ˆ σ2

w

p

N(8.8.33)

where ˆ σ2

wis an estimate of the noise variance. This quantity is also known as

the Mallows' Cp statistics [130]

3. the Generalised Cross-Validation (GCV) [48]

GCV = \

MISEemp

1

(1 p

N) 2 (8.8.34)

A comparative analysis of these different measures is reported in [14].

These estimates are computed assuming a linear model underlying the data.

Moody [133] introduced the Generalised Prediction Error (GPE) as an estimate of

the prediction risk for generic biased nonlinear models. The algebraic form is:

GPE = \

MISEemp + 2

Ntr ˆ

Vˆ

R(8.8.35)

where tr(·) denotes the trace, ˆ

Vis a nonlinear generalisation of the estimated noise

covariance matrix of the target and ˆ

Ris the estimated generalised influence matrix.

GPE can be expressed in an equivalent form as:

GPE = \

MISEemp + 2ˆ σeff

ˆ peff

N(8.8.36)

where ˆ peff = tr ˆ

Ris the estimated effective number of model parameters, and

ˆ σeff =tr ˆ

Vˆ

R

tr ˆ

Ris the estimated effective noise variance in the data. For nonlinear

models, ˆ peff is generally not equal to the number of parameters (e.g. number of

weights in a neural network). When the noise in the target variables is assumed

to be independent with uniform variance and the squared error loss function is

used, (8.8.36) simplifies to:

GPE = \

MISEemp + 2ˆ σ2

w

ˆ peff

N(8.8.37)

In neural network literature another well-known form of complexity-based criterion

is the weight decay technique

U( λ, α, DN ) =

N

X

i=1

(yi h (xi , α ))2 + λg (α ) (8.8.38)

where the second term penalizes either small, medium or large weights of the neurons

depending on the form of g ( · ). Two common examples of weight decay functions

are the ridge regression form g (α ) = α2 which penalizes large weights, and the

Rumelhart form g (α ) = α 2

α0 + α2 which penalizes weights of intermediate values near

α0 .

Several roughness penalties likeZ [h"(x)]2 dx

have been proposed too. Their aim is penalising hypothesis functions that vary too

rapidly by controlling large values of the second derivative of h.

Another important method for model validation is the minimum-description

length principle proposed by Rissanen [161]. This method proposes to choose the

model which induces the shortest description for the data available. Rissanen and

Barron [14] have each shown a qualitative similarity between this principle and the

complexity based approaches. For further details refer to the cited works.

8.8. STRUCTURAL IDENTIFICATION 199

8.8.2.6 A comparison of validation methods

Computer intensive statistical methods are relatively new and must be measured

against more established statistical methods, as the complexity based criteria. In

the following we summarize some practical arguments on behalf of one or the other

method. The benefits of a CISM method are:

All the assumptions of prior knowledge on the process underlying the data

are discarded.

The validation technique replaces theoretical analysis by computation.

Results are generally much easier to grasp for non-theorist.

No assumption on the statistical properties of noise is required.

They return an estimate of the model precision and an interval of confidence.

Arguments on behalf of complexity criteria are:

The whole dataset can be used for estimating the prediction performance and

no partitioning is required.

Results valid for linear models remain valid to the extent that the nonlin-

ear model can be approximated by some first-order Taylor expansion in the

parameters.

Some results in literature show the relation existing between resampling and com-

plexity based methods. For example, an asymptotic relation between a kind of

cross-validation and the Akaike's measure was derived by Stone [177], under the

assumptions that the real model α is contained in the class of hypothesis Λ and

that there is a unique minimum for the log-likelihood.

Here we will make the assumption that no a priori information about the correct

structure or the quasi-linearity of the process is available. This will lead us to con-

sider computer intensive methods as the preferred method to validate the learning

algorithms.

8.8.3 Model selection criteria

Model selection concerns the final choice of the model structure in the set that

has been proposed by model generation and assessed by model validation. In real

problems, this choice is typically a subjective issue and is often the result of a

compromise between different factors, like the quantitative measures, the personal

experience of the designer and the effort required to implement a particular model

in practice.

Here we will reduce the subjectivity factors to zero, focusing only on a quan-

titative criterion of choice. This means that the structure selection procedure will

be based only on the indices returned by the methods of Section 8.8.2. We distin-

guish between two possible quantitative approaches: the winner-takes-all and the

combination of estimators approach.

8.8.3.1 The winner-takes-all approach

This approach chooses the model structure that minimises the generalisation error

according to one of the criteria described in Section 8.8.2.

Consider a set of candidate model structures Λs ,s = 1, . . . , S , and an associated

measure ˆ

Gs ) which quantifies the generalisation error.

200 CHAPTER 8. THE MACHINE LEARNING PROCEDURE

The winner-takes-all method simply picks the structure

¯ s= arg min ˆ

Gs ) (8.8.39)

that minimises the generalisation error. The model which is returned as final out-

come of the learning process is then h (· , α¯ s

N).

From a practitioner perspective it may be useful to make explicit the entire

winner-takes-all procedure in terms of pseudo-code . Here below you will find a

compact pseudo-code detailing the structural, parametric, validation and selection

steps in the case of a leave-one-out assessment .

1. for s = 1, . . . , S : (Structural loop)

for j = 1, . . . , N

(a) Inner parametric identification (for l-o-o):

αs

N1= arg min

α Λs X

i=1:N,i6= j

(yi h (xi , α ))2

(b) ej = yj h (xj , αs

N1)

\

MISELOO(s ) = 1

NP N

j=1 e 2

j

2. Model selection: ¯ s= arg mins=1,...,S \

MISELOO ( s)

3. Final parametric identification:

α¯ s

N= arg min αΛ ˜ sPN

i=1(y i h( x i , α))2

4. The output prediction model is h (· , α ¯ s

N)

8.8.3.2 The combination of estimators approach

The winner-takes-all approach is intuitively the approach which should work the

best. However, recent results in machine learning [153] show that the performance

of the final model can be improved not by choosing the model structure which is ex-

pected to predict the best but by creating a model whose output is the combination

of the output of models having different structures.

The reason for that non intuitive result is that in reality any chosen hypothesis

h(· , αN ) is only an estimate of the real target (Figure 7.18) and, like any estimate,

is affected by a bias and a variance term.

Section 5.10 presented some results on the combination of estimators. The

extension of these results to supervised learning is the idea which underlies the

first results in combination [16] and that has led later to more enhanced forms of

averaging different models.

Consider m different models h ( ·, αj ) and assume they are unbiased and uncor-

related. By (5.10.32), (5.10.33) and (5.10.34) we have that the combined model

is

h( ·) = P m

j=1 1

ˆ vj h(·, α j )

Pm

j=1 1

ˆ vj

(8.8.40)

where ˆ vj is an estimation of the variance of h (· , αj ). This is an example of the

generalised ensemble method (GEM) [153].

More advanced applications of the combination principle to supervised learning

will be discussed in Chapter 11.

8.9. PARTITION OF DATASET IN TRAINING, VALIDATION AND TEST 201

8.9 Partition of dataset in training, validation and

test

The main challenge of machine learning consists of using a finite size dataset for i)

learning several predictors, ii) assessing them, iii) selecting the most promising one

and finally iv) returning it together with a reliable estimate of its generalisation

error.

Section 7.10 discussed the need of avoiding correlation between training and

validation examples. While the training set is used for parametric identification, a

non-overlapping portion of the dataset (validation set) should be used to estimate

the generalisation error of model candidates.

The use of validation (or cross-validation) does not prevent, however, a risk of

overfitting inherent to the winner-take-all model selection. If we take the minimum

generalisation error ˆ

G¯ s) in (8.8.39) as the generalisation error of the winning

model, we have an optimistic estimation again. This is known as selection bias , i.e.

the bias that occurs when we make a selection in a stochastic setting and due to

the fact the expectation of minimum is lower than the minimum of expectations

(Appendix C.11).

Anested cross-validation strategy [40] is recommended to avoid such bias. If

we have enough observations (i.e. large N), the strategy consists in randomly

partitioning (e.g. 50%, 25%, 25%) the labelled dataset into three parts: a training

set, a validation set, and a test set. The test portion is supposed to be used for

the unbiased assessment of the generalisation error of the model ¯ sin (8.8.39). It is

important to use only this set to assess the generalisation accuracy of the chosen

model. For this reason, the test set should be carefully made inaccessible to the

learning process (and ideally forgotten) and considered only at the very end of the

data analysis. Any other use of the test-set during the analysis (e.g. before the final

assessment) would "contaminate" the procedure and make it irreversibly biased.

Selection bias

A Monte Carlo illustration of the selection bias effect in a univariate regression task

is proposed in the R script selectionbias.R. The script estimates the generalisa-

tion errors of a constant model (h1 ), a linear model (h2 ) and a third model which is

nothing more than the winner-takes-all of the twos in terms of leave-one-out valida-

tion. It appears that the winner-takes all model is not better than the best between

h1 and h2 : in other terms it has a generalisation error larger than the minimum

between h1 and h2 .

8.10 Evaluation of a regression model

Let us consider a test set of size N ts where Yts ={ y1 , . . . , yN ts } is the target and

ˆ

Y={ ˆ y1 ,..., ˆ yNts }is the prediction returned by the learner. The canonical way to

assess a regression model by using a testing set is to measure the mean-squared-

error (8.8.28) (MSE):

MSE = P N ts

i=1(y i ˆ yi )2

Nts

Let us suppose that the test of a learning algorithm returns a mean-squared-

error of 0.4. Is that good or bad? Is that impressive and/or convincing? How may

we have a rapid and intuitive measure of the quality of a regression model?

202 CHAPTER 8. THE MACHINE LEARNING PROCEDURE

A recommended way is to compare the learner to a baseline, e.g. the simplest

(or naive) predictor we could design. This is the rationale of the Normalised Mean-

Squared-Error measure which normalises the accuracy of the learner with respect

to the accuracy of the average predictor, i.e. the simplest predictor we could learn

from data. Then

NMSE = P N ts

i=1(y i ˆ yi )2

PN ts

i=1(y i ¯ y)2 (8.10.41)

where

¯ y=P Nts

i=1 y i

Nts

(8.10.42)

is the prediction returned by an average predictor. NMSE is then the ratio between

the MSE of the learner and the MSE of the baseline naive predictor (8.10.42).

As for the MSE, the lower the NMSE the better. At the same time, we should

target a NMSE (significantly) lower than one, if we wish to claim that the complex

learning procedure is effective. NMSE values close to (yet smaller than) one are

either indicators of a bad learning design or, more probably, of a high noise to signal

ratio (e.g. large σ 2

win (8.5.1)) which makes any learning effort ineffective.

Our recommendation is always to measure the NMSE of a regression model

before making too enthusiastic claims about the success of the learning procedure.

A very small NMSE could be irrelevant if not significantly smaller than what we

could obtain by a simple naive predictor.

Another common way to assess the MSE of a predictor is to normalise it with

respect to the MSE of the same learning algorithm, yet trained on a randomly

shuffled version of the training set. For instance it is enough to shuffle the training

target to cancel any dependency between inputs and outputs. Again in this case

we expect that the NMSE is much lower than one. Otherwise any claim that our

prediction is better than a random one would be unfounded.

8.11 Evaluation of a binary classifier

The most popular measure of performance is error rate or misclassification rate,

i.e. the proportion of test examples misclassified by the rule. However, misclassifi-

cation error is not necessarily the most appropriate criterion in real settings since

it implicitly assumes that the costs of different types of misclassification are equal.

When there are only a few or a moderate number of classes, the confusion matrix is

the most complete way of summarising the classifier performance. In the following,

we will focus on evaluating a binary classifier.

Suppose to use the classifier to make Ntest classifications and that among the

values to be predicted there are NP examples of class 1 and NN examples of class

0. The confusion matrix is

Negative (0) Positive (1)

Classified as negative TN FN ˆ

NN

Classified as positive FP TP ˆ

NP

NNNP N

where FP is the number of False Positives and FN is the number of False Negatives.

The confusion matrix contains all the relevant information to assess the general-

isation capability of a binary classifier. From its values it is possible to derive a

number of commonly used error rates or measures of accuracy. For instance the

misclassification error rate is

ER = F P +F N

N(8.11.43)

8.11. EVALUATION OF A BINARY CLASSIFIER 203

8.11.1 Balanced Error Rate

In a setting where the two classes are not balanced the misclassification error

rate (8.11.43) can lead to a too optimistic interpretation of the rate of success.

For instance if NP = 90 and NN = 10, a naive classifier returning always the

positive class would have a misclassification ER = 0. 1 since FN = 0 and FP = 10.

This low value of misclassification gives a false sense of accuracy since humans tend

to associate a 50% error to random classifiers. This is true in balanced settings

while in an unbalanced setting (as the one above) this generalisation performance

may be obtained with a trivial classifier making no use of the input information.

In these cases, it is preferable to adopt the balanced error rate which is the

average of the errors on each class:

BER = 1

2 F P

TN +FP

+F N

FN + TP

Note that in the example above BER= 0. 5, normalising the misclassification error

rate to a value correctly interpretable by humans.

8.11.2 Specificity and sensitivity

In many research works on classification, it is common usage to assess the classifier

in terms of sensitivity and specificity.

Sensitivity is a synonymous of True Positive Rate (TPR)

SE = TPR = T P

TP +FN

=TP

NP

=N P FN

NP

= 1 F N

NP

,0 SE 1 (8.11.44)

It is a quantity to be maximised, and it increases by reducing the number of

false negatives. Note that it is also often called the recall in information retrieval.

Specificity stands for the True Negative Rate (TNR)

SP = TNR = T N

FP +TN

=TN

NN

=N N FP

NN

= 1 F P

NN

,0 SP 1

It is a quantity to be maximised and it increases by reducing the number of false

positive.

In other terms, sensitivity is the proportion of positive examples classified as

positive while specificity is the proportion of negative examples classified as negative.

There exists a trade-off between these two quantities. This is the reason why

both quantities have to be calculated to have a thorough assessment of the classifier

accuracy. In fact, it is trivial to maximise one of those quantities at the detriment

of the other.

For instance, for a naive classifier who return always 0 we have ˆ

NP = 0, ˆ

NN = N,

FP = 0, TN = NN . This means that a naive classifier may attain maximal specificity

(SP = 1) but at the cost of minimal sensitivity (SE = 0).

Analogously in the case of a naive classifier who returns always 1 we have ˆ

NP =

N,ˆ

NN = 0, FN = 0, TP = NP , i.e. maximal sensitivity ( SE = 1) but null specificity

(SP = 0).

8.11.3 Additional assessment quantities

Other commonly used quantities which can be derived by the confusion matrix are

False Positive Rate:

FPR=1-SP = 1 T N

FP +TN

=FP

FP + TN

=FP

NN

,0 FPR 1

It decreases by reducing the number of false positive.

204 CHAPTER 8. THE MACHINE LEARNING PROCEDURE

False Negative Rate:

FNR = 1-SE = 1 T P

TP + F N

=F N

TP +FN

=F N

NP

0 FNR 1

It decreases by reducing the number of false negatives.

Positive Predictive value: the ratio (to be maximised)

PPV = T P

TP +FP

=TP

ˆ

NP

,0 PPV 1 (8.11.45)

This quantity is also called precision in information retrieval.

Negative Predictive value: the ratio (to be maximised)

PNV = T N

TN +FN

=TN

ˆ

NN

,0 PNV 1

False Discovery Rate: the ratio (to be minimised)

FDR = F P

TP +FP

=FP

ˆ

NP

= 1 PPV, 0 FDR 1

8.11.4 Receiver Operating Characteristic curve

All the assessment measures discussed so far make the assumption that the classifier

returns a class for each test point. However, since most binary classifiers compute an

estimation of conditional probability, a class may be returned as outcome provided

a threshold on the conditional probability. In other terms, the confusion matrix,

as well as its derived measured depends on a specific threshold. The choice of a

threshold is related to Type I error and Type II errors (Section 5.13) that we are

ready to accept in a stochastic setting.

In order to avoid conditioning our assessment on a specific threshold, it is in-

teresting to assess the overall accuracy for all possible thresholds. This is possible

by plotting curves, like the Receiver Operating Characteristic (ROC) which plots

the true positive rate (i.e. sensitivity or power) against the false positive rate (1-

specificity) for different classification thresholds.

In other terms, ROC visualises the probability of detection vs. the probability

of false alarm. Different points on the curve correspond to different thresholds used

in the classifier.

The ideal ROC curves would follow the two axes. In practice, real-life classifi-

cation rules produce ROC curves which lie between these two extremes. It can be

show that a classifier with a ROC curve following the bissetrix line would be useless.

For each threshold, we would have TP/NP =FP /NN , i.e. the same proportion of

true positives and false positives. In other terms, this classifier would not separate

the classes at all.

A common way to summarise a ROC curve is to compute the area under the

curve (AUC). By measuring the AUC of different classifiers, we have a compact way

to compare classifiers without setting a specific threshold.

8.11.5 Precision-recall curves

Another commonly used curve to visualise the accuracy of a binary classifier is

the precision-recall (PR) curve. This curve shows the relation between preci-

sion (8.11.45) (probability that an example is positive given that it has been clas-

sified as positive) vs recall (8.11.44) (probability that an example is classified as

positive given that is positive).

8.11. EVALUATION OF A BINARY CLASSIFIER 205

Figure 8.5: ROC and PR curves of a binary classifier.

Since precision is dependent on the probability a priori of the positive class, in

largely unbalanced problems (e.g. few positive classes like in fraud detection), the

PR curve is more informative than the AUC.

R script: visual assessment of a binary classifier

The R script roc.R illustrates the assessment of a binary classifier for a task where

xR, p( x |y = +) ∼ N (1 , 1) and p( x|y= ) ∼ N ( 1 , 1). Suppose that the

classifier categorises the examples as positive if t >Th and negative if t < Th, where

ThR is a threshold. Note that if Th= −∞ , all the examples are classed as positive:

TN =FN = 0 which implies SE =T P

NP = 1 and FPR = F P

FP + TN = 1. On the other

way round, if Th= , all the examples are classed as negative: TP =FP = 0 which

implies SE = 0 and FPR = 0.

By sweeping over all possible values of Th we obtain the ROC and the PR curves

in Figure 8.5. Each point on the ROC curve, associated to a specific threshold, has

an abscissa FPR =FP /NN and an ordinate TPR = TP/NP . Each point on the

PR curve, associated to a specific threshold, has an abscissa TPR = TP/NP and

an ordinate PR=T P / (T P + F P ).

Fraud detection example

Let us consider a fraud detection problem [52] with NP = 100 frauds out of

N= 2 · 106 transactions. Since one of the two classes (in this case the fraud)

is extremely rare, the binary classification setting is called unbalanced [51]. Un-

balanced classification settings are very common in real-world tasks (e.g. churn

detection, spam detection, predictive maintenance).

Suppose we want to compare two algorithms: the first returns 100 alerts, 90 of

which are frauds. Its confusion matrix is

Genuine (0) Fraudulent (1)

Classified as genuine 1, 999, 890 10 1, 999,900

Classified as fraudulent 10 90 100

1, 999, 900 100 2 · 106

206 CHAPTER 8. THE MACHINE LEARNING PROCEDURE

The second algorithm returns a much larger number of alerts (1000) 90 of which

are actual frauds. Its confusion matrix is then

Genuine (0) Fraudulent (1)

Classified as genuine 1, 998, 990 10 1, 999,000

Classified as fraudulent 910 90 1000

1,999, 900 100 2 · 106

Which of two algorithms is the best? In terms of TPR and FPR we have

1. TPR=TP/NP = 90/100 = 0. 9, FPR= FP/NN = 10 / 1 , 999 , 900 = 0 . 00000500025

2. TPR=90/ 100 = 0. 9, FPR=910/ 1,999, 900 = 0.00045502275

The FPR difference between the two algorithms is then negligible. Nevertheless,

though the recalls of the two algorithms are almost identical, the first algorithm is

definitely better in terms of false positives (much higher precision):

1. A1: PR=TP/ (TP + FP ) = 90/100 = 0. 9, recall=0.9

2. A2: PR=90/ 1000 = 0. 09, recall= 0.9

The example shows that, in strongly unbalanced settings, the performance of a

classification algorithm may be highly sensitive to the adopted cost function.

8.12 Multi-class problems

So far we limited to consider binary classification tasks. However, real-world clas-

sification tasks (e.g. in bioinformatics or image recognition) are often multi-class.

Some classification strategies (detailed in following chapters) may be easily adapted

to the multi-class setting, like the Naive Bayes (Section 10.2.3.1) or the KNN clas-

sifier (Section 10.2.1.1).

Suppose, however, to have a binary classification strategy that we want to use

in a multi-class context. There are three main strategies to extend binary classifiers

to handle multi-class tasks y ∈ {c1 , . . . , ck } .

1. One-versus-the rest (or one-versus-all, OVA): it builds for each class ck a

binary classifier that separates this class from the rest. To predict the class

of a query point q, the outputs of the kclassifiers are considered. If there

is a unique class label which is consistent with the kpredictions, the query

point is labeled with such a class. Otherwise, one of the kclasses is selected

randomly.

2. Pairwise (or one-versus-one, OVO): it trains a classifier for each pair of classes,

requiring in total the independent learning of k (k 1)/ 2 binary classifiers. To

predict a query point class, the output of the k (k 1)/ 2 classifiers is calculated

and a majority vote is considered. If there is a class which receives the largest

number of votes, the query point is assigned to such a class. Otherwise each

tie is broken randomly.

3. Coding: it first encodes each class by a binary vector of size d, then it trains a

classifier for each component of the vector. The aggregation of the outputs of

the d classifiers returns an output word , i.e. a binary of size d. Given a query

q, the output word is compared against the codeword of each class, and the

class having the smallest Hamming distance (the number of disagreements)

to the output word is returned.

8.13. CONCLUDING REMARKS 207

Suppose that we have a task with k = 8 output classes. According to the coding

strategy, d log2 8e = 3 binary classifiers can be used to handle this problem.

ˆ c1 ˆ c2 ˆ c3

c1 0 0 0

c2 0 0 1

c3 0 1 0

c4 0 1 1

c5 1 0 0

c6 1 0 1

c7 1 1 0

c8 1 1 1

The table columns denote the classifiers while the rows contain the coding of the

associated class. For instance, the ˆ c3 classifier will i) encode the training points

labeled with the classes { c2 , c4, c6, c8 } as ones ii) encode all the remaining examples

as zeros and iii) learn the corresponding binary classifier.

Note that, though each strategy requires the learning of more than a single

classifier, the number of trained classifiers is not the same. Given k > 2 classes, the

number of classifiers trained for each method is mentioned here below.

One-versus-the rest: k binary classifiers

Pairwise: k (k 1)/ 2 binary classifiers

Coding: dlog2 k ebinary classifiers where d·e denote the ceiling operator.

8.13 Concluding remarks

The chapter presented the most important steps to learn a model on the basis of a

finite set of input/output data. Though the entire procedure was globally depicted

as a waterfall process (Figure 8.6), it should be kept in mind that a learning pro-

cess, like any modelling effort, is better represented by a spiral model characterised

by feedbacks, iterations and adjustments. An example is the identification step,

composed of two nested loops, the inner one returning the parameters of a fixed

structure, and the external one searching for the best configuration.

The chapter focused on the core of the learning process which begins once the

data are in a tabular format. Nevertheless, it is worth reminding that the upstream

steps, sketched in Section 8.2-8.4 play a very important role as well. However, since

those steps are often domain and task-dependent, we considered them beyond the

scope of this book.

In the following chapters we will quit the general perspective and we will delve

into the specificities of the most known learning algorithms.

8.14 Exercises

1. Consider an input/output regression task where n = 1, E [y | x ] = sin(x ) and p (y| x )

N(sin(x ), 1). Let N = 100 be the size of the training set and consider a quadratic

loss function.

Let the class of hypothesis be h3 (x ) = α0 + P 3

m=1 α m x m .

1. Estimate the parameter by least-squares.

2. Estimate the parameter by gradient-based search and plot the evolution of the

training error as a function of the number of iterations. Show in the same

figure the least-squares error.

208 CHAPTER 8. THE MACHINE LEARNING PROCEDURE

FORMULATION

PHENOMENON

EXPERIMENTAL

DESIGN

RAW DATA

PREPROCESSING

MODEL

MODEL

GENERATION

PARAMETRIC

IDENTIFICATION

VALIDATION

MODEL

SELECTION

DATA

MODEL

PROBLEM

TEST ERROR

ASSESSMENT

TRAINING SET

VALIDATION SET

TEST SET

SELECTED

MODEL

ASSESSMENT

TR+VAL

TEST

DATASET

Figure 8.6: From the phenomenon to the predictive model: overview of the steps

constituting the modelling process.

8.14. EXERCISES 209

3. Plot the evolution of the gradient-based parameter estimations as a function of

the number of iterations. Show in the same figure the least-squares parameters.

4. Discuss the impact of the gradient-based learning rate on the training error

minimisation.

5. Estimate the parameter by Levenberg-Marquardt search and plot the evolution

of the training error as a function of the number of iterations. Show in the

same figure the least-squares error.

6. Plot the evolution of the Levenberg-Marquardt parameter estimations as a

function of the number of iterations. Show in the same figure the least-squares

parameters.

2. Consider an input/output regression task where n = 1, E [y|x ] = 3x +2 and p (y| x )

N(3x + 2, 1). Let N = 100 be the size of the training set and consider a quadratic

loss function. Consider an iterative gradient-descent procedure to minimize the

empirical error.

1. Show in a contour plot the evolution ˆ

β(τ) of the estimated parameter vector

for at least 3 different learning rates.

2. Compute the least-squares solution and show the convergence of the iterated

procedure to the least-squares solution.

3. Let us consider the dependency where the conditional distribution of yis

y= sin(2πx1x2 x3 ) + w

where wN (0, σ2 ), xR3 has a 3D normal distribution with an identity covari-

ance matrix, N = 100 and σ= 0.25.

Consider the following families of learners:

1. constant model returning always zero

2. constant model h ( x ) = β0

3. linear model h (x ) = xTβ

4. K nearest neighbour for K = 1, 3, 5, 7 where the distance is Euclidean

Implement for each learner above a function:

learner<-function(Xtr,Ytr,Xts){

####

## Xtr [N,n] input training set

## Ytr [N,1] output training set

## Xts [Nts,n] input test set

return(Yhat)

}

which returns a vector [N ts ,1] of predictions for the given input test set.

By using Monte Carlo simulation (S= 100 runs) and by using a fixed-input test set

of size Nts = 1000

compute the average squared bias of all the learners,

compute the average variance of all the learners,

check the relation between squared bias, variance, noise variance and MSE

define what is the best learner in terms of MSE,

discuss the results.

4. The student should prove the following equality concerning the quantities defined

in Section 8.11:

FPR = p

1p

1 PPV

PPV (1 FNR)

where p = Prob {y = +}.

Hint: use the Bayes theorem.

210 CHAPTER 8. THE MACHINE LEARNING PROCEDURE

Chapter 9

Linear approaches

The previous chapters distinguished between two types of supervised learning tasks

according to the type of output:

Regression when we predict quantitative outputs, e.g. real or integer numbers.

Predicting the weight of an animal on the basis of its age and height is an

example of regression problem.

Classification (or pattern recognition) where we predict qualitative or cate-

gorical outputs which assume values in a finite set of classes (e.g. black, white

and red) where there is no explicit ordering. Qualitative variables are also

referred to as factors. Predicting the class of an email on the basis of English

words frequency is an example of classification task.

This chapter will consider learning approaches to classification and regression

where the hypothesis functions are linear combinations of the input variables.

9.1 Linear regression

Linear regression is a very old technique in statistics and traces back to the work

of Gauss.

9.1.1 The univariate linear model

The simplest regression model is the univariate linear regression model where the

input is supposed to be a scalar variable and the stochastic dependency between

input and output is described by

y=β0 +β1 x + w(9.1.1)

where xR is the regressor (or independent) variable, yis the measured response

(or dependent) variable, β0 is the intercept, β1 is the slope and wis called noise

or model error. We will assume that E [w ] = 0 and that its variance σ 2

wis inde-

pendent of the xvalue. The assumption of constant variance is often referred to as

homoscedasticity. From (9.1.1) we obtain

Prob {y =y| x} = Prob {w =y β0 β1 x} , E [y |x ] = f (x ) = β0 +β1 x

The function f (x ) = E [y |x ], also known as regression function, is a linear func-

tion in the parameters β0 and β1 (Figure 9.1). In the following we will intend as

linear model each input/output relationship which is linear in the parameters but

211

212 CHAPTER 9. LINEAR APPROACHES

Figure 9.1: Conditional distribution and regression function for stochastic linear

dependence

not necessarily in the dependent variables. This means that: i) any value of the

response variable yis described by a linear combination of a series of parameters

(regression slopes, intercept) and ii) no parameter appears as an exponent or is

multiplied or divided by another parameter. According to this definition of linear

model, then

y= β0 +β1 xis a linear model

y= β0 +β1 x2 is again a linear model. Simply by making the transformation

X= x2 , the dependency can be put in in the linear form (9.1.1).

y= β0 xβ 1 can be studied as a linear model between Y= log( y) and X=

log(x ) thanks to the equality

log(y ) = β0 +β1 log(x )Y =β0 +β1 X

the relationship y =β0 +β1 β x

2is not linear since there is no way to linearize

it.

9.1.2 Least-squares estimation

Suppose that Npairs of observations (xi , yi ), i = 1, . . . , N are available. Let us

assume that data are generated by the following stochastic dependency

yi = β0 +β1 xi + wi , i = 1 , . . . , N (9.1.2)

where

1. the wi R are i.i.d realisations of the r.v. whaving mean zero and constant

variance σ 2

w(homoscedasticity),

2. the xi R are non random and observed with negligible error.

The unknown parameters (also known as regression coefficients)β0 and β1 can be

estimated by the least-squares method. The method of least squares is designed to

provide

1. the estimations ˆ

β0 and ˆ

β1 of β0 and β1 , respectively

9.1. LINEAR REGRESSION 213

2. the fitted values of the response y

ˆ yi =ˆ

β0 +ˆ

β1 xi , i = 1 , . . . , N

so that the residual sum of squares (which is Ntimes the empirical risk )

SSEemp =N· \

MISEemp =

N

X

i=1

(yi ˆ yi )2=

N

X

i=1

(yi ˆ

β0 ˆ

β1 xi )2

is minimised. In other terms

{ˆ

β0 ,ˆ

β1 }= arg min

{b0,b1 }

N

X

i=1

(yi b0 b1 xi )2

It can be shown that the least-squares solution is

ˆ

β1 = Sxy

Sxx

ˆ

β0 = ¯ yˆ

β1 ¯ x

if Sxx 6 = 0, where

¯ x=PN

i=1 x i

N,¯ y=PN

i=1 y i

N

and

Sxy =

N

X

i=1

(xi ¯ x) yi

Sxx =

N

X

i=1

(xi ¯ x)2 =

N

X

i=1

(x 2

i2x i ¯ x+ ¯ x2 ) =

N

X

i=1

(x 2

ix i ¯ x xi ¯ x+ ¯ x2 ) =

=

N

X

i=1

[(xi ¯ x)xi ] +

N

X

i=1

x(¯ x xi )] =

N

X

i=1

(xi ¯ x)xi

It is worth noting that if ¯ x= 0 and ¯ y= 0 then ˆ

β0 = 0 and

Sxy = h X, Y i , Sxx = h X, X i (9.1.3)

where X and Y are the [N, 1] vectors of x and y observations, respectively, and the

inner product , ·i of two vectors is defined in Appendix B.2.

Also it is possible to write down the relation between the least squares estimation

ˆ

β1 and the sample correlation coefficient (D.0.3):

ˆ ρ2 =ˆ

β1 Sxy

Syy

(9.1.4)

R script

The script lin uni.R computes and plots the least-squares solution for N = 100

observations generated according to the dependency (9.1.2) where β0 = 2 and β1 =

2.

214 CHAPTER 9. LINEAR APPROACHES

If the dependency underlying the data is linear then the estimators are unbiased.

We show this property for ˆ

β1 :

ED N [ˆ

β1 ] = ED N S xy

Sxx =

N

X

i=1

(xi ¯ x) E[yi ]

Sxx

=1

Sxx

N

X

i=1

(xi ¯ x)( β0 +β1 xi ) =

=1

Sxx " N

X

i=1

[(xi ¯ x) β0 ] +

N

X

i=1

[(xi ¯ x) β1 xi ]# = β 1 Sxx

Sxx

=β1

Note that the analytical derivation relies on the relation P N

i=1(x i ¯ x) = 0 and the

fact that xis not a random variable. Also it can be shown [139] that

Var h ˆ

β1 i =σ 2

w

Sxx

(9.1.5)

E[ˆ

β0 ] = β0 (9.1.6)

Var h ˆ

β0 i =σ2

w1

N+¯ x2

Sxx (9.1.7)

Another important result in linear regression is that the quantity

ˆ σ2

w=P N

i=1(y i ˆ yi )2

N2(9.1.8)

is an unbiased estimator of σ 2

wunder the (strong) assumption that observations

have been generated according to (9.1.1). The denominator is often referred to as

the residual degrees of freedom, also denoted by df. The degree of freedom can be

be seen as the number Nof observations reduced by the numbers of parameters

estimated (slope and intercept). The estimate of the variance σ 2

wcan be used in

Equations (9.1.7) and (9.1.5) to derive an estimation of the variance of the intercept

and slope, respectively.

9.1.3 Maximum likelihood estimation

The properties of least-squares estimators rely on the only assumption that the

wi = yi β0 β1 xi (9.1.9)

are i.i.d. realisations with mean zero and constant variance σ 2

w. Therefore, no

assumption is made concerning the probability distribution of w(e.g. Gaussian or

uniform). On the contrary, if we want to use the maximum likelihood approach

(Section 5.8), we have to define the distribution of w. Suppose that w ∼ N (0, σ2

w).

By using (9.1.9), the likelihood function can be written as

LN ( β0 , β1 ) =

N

Y

i=1

pw ( wi ) = 1

(2π)N/2 σ N

w

exp ( P N

i=1(y i β 0 β 1 x i ) 2

2σ2

w)(9.1.10)

It can be shown that the estimates of β0 and β1 obtained by maximising LN (·)

under the normal assumption are identical to the ones obtained by least squares

estimation.

9.1.4 Partitioning the variability

An interesting way of assessing the quality of a linear model is to evaluate which

part of the output variability the model is able to explain. We can use the following

relation

9.1. LINEAR REGRESSION 215

N

X

i=1

(yi ¯ y)2=

N

X

i=1

yi ¯ y)2 +

N

X

i=1

(yi ˆ yi )2

i.e.

SSTot =SSMod +S SRes

where SSTot (which is also Ntimes the sample variance of y) represents the total

variability of the response, SSMod is the variability explained by the model and

SSRes is the variability left unexplained. This partition helps to determine whether

the variation explained by the regression model is real or is no more than chance

variation. It will be used in the following section to perform hypothesis test on the

quantities estimated by the regression model.

9.1.5 Test of hypotheses on the regression model

Suppose that we want to answer the question whether the regression variable xtruly

influences the distribution Fy (· ) of the response yor, in other words, that they are

linearly dependent. We can formulate the problem as an hypothesis testing problem

on the slope β1 where

H: β1 = 0 ,¯

H: β1 6= 0

If H is true this means that the regressor variable does not influence the response

(at least not through a linear relationship). Rejection of Hin favor of ¯

Hleads to

the conclusion that xsignificantly influences the response in a linear fashion. It

can be shown that, under the assumption that w is normally distributed, if the null

hypothesis H (null correlation) is true then

SSMod

SSRes /( N 2) F 1,N 2 .

Large values of the Fstatistic (Section C.2.4) provide evidence in favor of ¯

H(i.e. a

linear trend exists). The test is a two-sided test. In order to perform a single-sided

test, typically T -statistics are used.

9.1.5.1 The t-test

We want to test whether the value of the slope is equal to a predefined value ¯

β:

H: β1 =¯

β, ¯

H: β1 6=¯

β

Under the assumption of normal distribution of w , the following relation holds

ˆ

β1 ∼ N (β1 ,σ 2

w

Sxx

) (9.1.11)

It follows that

(ˆ

β1 β1 )

ˆ σp Sxx ∼ TN2

where ˆ σ2 is the estimation of the variance σ2

w. This is a typical t-test applied to

the regression case. Note that this statistic can be used also to test a one-sided

hypothesis, e.g.

H: β1 =¯

β, ¯

H: β1 >¯

β

216 CHAPTER 9. LINEAR APPROACHES

9.1.6 Interval of confidence

Under the assumption of normal distribution, according to (9.1.11)

Prob ( tα/2,N2 < ( ˆ

β1 β1 )

ˆ σp Sxx < tα/2,N 2 ) = 1 α

where t α/2,N2 is the upper α/2 critical point of the T -distribution with N 2

degrees of freedom. Equivalently we can say that with probability 1 α , the real

parameter β1 is covered by the interval described by

ˆ

β1 ± tα/2,N2 s ˆ σ2

Sxx

(9.1.12)

Note that the interval (9.1.12) may be used to test the hypothesis of input irrele-

vance. If the value 0 is outside the interval above, we can reject the input irrelevance

hypothesis with 100(1 α )% confidence.

Similarly from (9.1.7) we obtain that the 100(1 α )% confidence interval of β 0

is

ˆ

β0 ± tα/2,N2 ˆ σs 1

N+¯ x2

Sxx

9.1.7 Variance of the response

Let

ˆy =ˆ

β0 +ˆ

β1 x

be the estimator of the regression function value in x. If the linear dependence (9.1.1)

holds, we have for an arbitrary x =x 0

E[ˆy|x0 ] = E [ ˆ

β0 ] + E [ ˆ

β1 ]x0 =β0 +β1 x0 =E [y|x0 ]

This means that the prediction ˆy is an unbiased estimator of the value of the

regression function in x0 . Under the assumption of normal distribution of w , the

variance of ˆy in x 0

Var [ˆy|x0 ] = σ 2

w1

N+( x 0 ¯ x)2

Sxx

where ¯ x=PN

i=1 x i

N. This quantity measures how the prediction ˆy would vary if

repeated data collections from (9.1.1) and least-squares estimations were conducted.

R script

Let us consider a data set DN = {xi , yi }i=1,...,N where

yi = β0 +β1 xi + w i

where β0 and β1 are known and w ∼ N (0, σ2

w) with σ 2

wknown. The R script bv.R

may be used to:

Study experimentally the bias and variance of the estimators ˆ

β0 ,ˆ

β1 and

ˆ

σwhen data are generated according to the linear dependency (9.1.2) with

β0 = 1, β1 = 1 and σw = 4.

Compare the experimental values with the theoretical results.

Study experimentally the bias and the variance of the response prediction.

Compare the experimental results with the theoretical ones.

9.1. LINEAR REGRESSION 217

R script

Consider the medical dataset available in the R script medical.R. This script may be

used to: i) estimate the intercept and slope of the linear model fitting the dataset,

ii) plot the fitted linear model, iii) estimate the variance of the estimator of the

slope, iv) test the hypothesis β1 = 0, v) compute the confidence interval of β1 and

compare your results with the output of the R command lm().

9.1.8 Coefficient of determination

The coefficient of determination, also known as R2 ,

R2 = SS Mod

SSTot

=PN

i=1(ˆ yi ¯ y)2

PN

i=1(y i ¯ y)2 = 1 SSRes

SSTot

is often used as a measure of the fit of the regression line.

This quantity, which satisfies the inequality 0 R2 1, represents the propor-

tion of variation in the response data that is explained by the model. The coefficient

of determination is easy to interpret and can be understood by most experimenters

regardless of their training in statistics. However, it is a dangerous criterion for com-

parison of candidate models because any additional model terms (e.g. a quadratic

term) will decrease SSRes and thus increase R2 . In other terms R2 can be made

artificially high by a practice of overfitting (Section 7.7) since it is not merely the

quality of fit which influences R2 .

9.1.9 Multiple linear dependence

Consider a linear relation between an independent vector x ∈ X Rn and a

dependent random variable y ∈ Y ⊂ R

y=β0 +β1 x·1 +β2 x·2 +· ·· +βn x·n + w(9.1.13)

where w represents a random variable with mean zero and constant variance σ 2

w.

Note that it is possible to establish a link between the partial regression coefficients

βi and partial correlation terms (Section 3.8.3) showing that βi is related to the

conditional information of xi about y once fixed all the other terms (ceteris paribus

effect) [139].

In matrix notation1the equation 9.1.13 can be written as:

y=xT β + w(9.1.14)

where x stands for the [p× 1] vector x = [1, x·1 , x·2 , . . . , x·n ]T and p =n + 1 is the

total number of model parameters.

9.1.10 The multiple linear regression model

Consider N observations DN = {hxi , yi i :i = 1, . . . , N} generated according to the

stochastic dependence (9.1.14) where xi = [1, xi1 , . . . , xin ]T . We suppose that the

following multiple linear relation holds

Y= Xβ +W

1We use the notation x ·jto denote the jth variable of the non random vector x , while xi =

[1, xi1 , xi2 ,...,x in] T denotes the ith observation of the vector x. This extension of notation is

necessary when the input is not considered a random vector. In the generic case xj will be used

to denote the jth variable

218 CHAPTER 9. LINEAR APPROACHES

where Y is the [N× 1] response vector, Xis the [N× p ] data matrix , whose j th column

of X contains readings on the j th regressor, β is the [p× 1] vector of parameters

Y=

y1

y2

.

.

.

yN

X=

1x11 x 12 . . . x1n

1x21 x 22 . . . x2n

.

.

..

.

..

.

..

.

.

1xN1 xN2 . . . xN n

=

xT

1

xT

2

.

.

.

xT

N

β=

β0

β1

.

.

.

βn

W=

w1

w2

.

.

.

wN

Here wi are assumed uncorrelated, with mean zero and constant variance σ 2

w(ho-

mogeneous variance). Then Var [w1 ,...,wN ] = σ 2

wI N .

9.1.11 The least-squares solution

We seek the least-squares estimator ˆ

βsuch that

ˆ

β= arg min

b

N

X

i=1

(yi x T

ib) 2 = arg min

b(Y Xb ) T (Y Xb )(9.1.15)

where

SSEemp =N· \

MISEemp = (Y Xb )T (Y Xb ) =eT e (9.1.16)

is the residual sum of squares (which is Ntimes the empirical risk (7.2.8) with

quadratic loss) and

e= Y Xb

the [N× 1] vector of residuals. The quantity SSEemp is quadratic function in the p

parameters. In order to minimise

(Y X ˆ

β)T ( Y Xˆ

β) = ˆ

βT XT Xˆ

βˆ

βT XT Y YT Xˆ

β+ YT Y

the vector ˆ

βmust satisfy

ˆ

β[( YXˆ

β)T ( Y Xˆ

β)] = 2 XT ( Y Xˆ

β) = 0 (9.1.17)

Assuming X is of full column rank, the second derivative

2

ˆ

β∂ ˆ

βT [( YXˆ

β)T ( Y Xˆ

β)] = 2 XT X

is definite positive and the SSEemp attains its minimum in the solution of the least-

squares normal equations

(XTX ) ˆ

β= XT Y

As a result ˆ

β= ( XT X)1 XT Y= X Y(9.1.18)

where the XT X matrix is a symmetric [p× p ] matrix (also known as Gram ma-

trix) and X = (XT X ) 1 XT is called the pseudo-inverse of X since X X = IN .

Note that the computation of ˆ

βrepresents the parametric identification step of the

supervised learning procedure (Section 7.9) when the class of hypothesis is linear.

9.1. LINEAR REGRESSION 219

9.1.12 Least-squares and non full-rank configurations

A full-rank Xis required to ensure that the matrix XT X is invertible in (9.1.18).

However, for numerical reasons it is recommended that XT X is not only invertible

but also well-conditioned, or equivalently non ill-conditioned [2]. An ill-conditioned

matrix is an almost singular matrix: its inverse may contain very large entries

and sometimes numeric overflows. This means that small changes in the data may

cause large and unstable changes in the solution ˆ

β. Such sensibility of the solution

to the dataset should evoke in the attentive reader the notion of estimator variance

(Section 5.5). In fact, in the following sections we will show that the variance of

least-squares estimators is related to the inverse of XT X (e.g. Equation (9.1.19)).

But what to do in practice if Xis not full-rank (or rank-deficient) or ill-

conditioned? A first numerical fix consists in computing the generalised QR de-

composition (Appendix B.4)

X= QR

where Q is an orthogonal [N, p0 ] matrix and Ris a [p0 , p ] upper-triangular matrix

of full row rank with p0 < p . Since RRT is invertible, the pseudo-inverse in (9.1.18)

can be written as X =RT (RRT ) 1 QT (details in Section 2.8.1 of [2]). A second

solution consists in regularising the optimisation, i.e. constraining the optimisation

problem (9.1.15) by adding a term which penalises solutions ˆ

βwith a too large

norm. This leads to the ridge regression formulation which will be discussed in

Section 12.5.1.1. In more general terms, since non invertible or ill-conditioned con-

figurations are often due to highly correlated (multicollinear) or redundant inputs,

the use of feature selection strategies (Chapter 12) before the parametric identifi-

cation step may be beneficial.

9.1.13 Properties of least-squares estimators

Under the condition that the linear stochastic dependence (9.1.14) holds, it can be

shown [139] that:

If E [w ] = 0 then the random vector ˆ

βis an unbiased estimator of β.

The residual mean square estimator

ˆ

σ2

w=(Y X ˆ

β)T (Y X ˆ

β)

N p

is an unbiased estimator of the error variance σ 2

w.

If the wi are uncorrelated and have common variance, the variance-covariance

matrix of ˆ

βis given by

Var[ ˆ

β] = σ 2

w(X T X) 1(9.1.19)

It can also be shown (Gauss-Markov theorem) that the least-squares estima-

tion ˆ

βis the "best linear unbiased estimator" (BLUE) i.e. it has the lowest

variance among all linear unbiased estimators.

From the results above it is possible to derive the confidence intervals of model

parameters, similarly to the univariate case discussed in Section 9.1.6.

R script

A list of the most important least-squares summary statistics is returned by the

summary of the R command lm. See for instance the script ls.R.

220 CHAPTER 9. LINEAR APPROACHES

summary(lm(Y~X))

Call:

lm(formula = Y ~ X)

Residuals:

Min 1Q Median 3Q Max

-0.40141 -0.14760 -0.02202 0.03001 0.43490

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.09781 0.11748 9.345 6.26e-09

X 0.02196 0.01045 2.101 0.0479

(Intercept) ***

X *

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2167 on 21 degrees of freedom

Multiple R-Squared: 0.1737, Adjusted R-squared: 0.1343

F-statistic: 4.414 on 1 and 21 DF, p-value: 0.0479

9.1.14 Variance of the prediction

Since the estimator ˆ

βis unbiased, this is also the case for the the prediction ˆy =x T

0ˆ

β

for a generic input value x =x0 . Its variance is

Var [ˆy| x0 ] = σ 2

wx T

0(X T X) 1 x 0 (9.1.20)

Assuming that w is normally distributed, the 100(1 α )% confidence bound for the

regression value in x0 is given by

ˆ y( x0 )± tα/2,N p ˆ σw q xT

0(X T X) 1 x 0

where t α/2,N p is the upper α/ 2 percent point of the t-distribution with N p

degrees of freedom and the quantity ˆ σw p xT

0(X T X) 1 x 0 , obtained from (9.1.20),

is the standard error of prediction for multiple regression.

R script

The R script bv mult.R validates by Monte Carlo simulation the properties of least-

squares estimation mentioned in Section 9.1.11 and 9.1.14.

In order to assess the generality of the results, we invite the reader to run the

script for different input sizes n, different number of observations Nand different

values of the parameter β.

9.1.15 The HAT matrix

The Hat matrix is defined as

H= X( XTX)1 XT (9.1.21)

9.1. LINEAR REGRESSION 221

It is a symmetric, idempotent [N× N ] matrix that transforms the output values Y

of the training set in the regression predictions ˆ

Y:

ˆ

Y= Xˆ

β= X(XT X)1 XT Y= HY

Using the above relation, the vector of residuals can be written as:

e= Y Xˆ

β= Y X(XT X)1 XT Y= [ I H]Y

and the residual sum of squares as

eT e= YT [ I H ]2 Y= YT [ I H ] Y= YT P Y (9.1.22)

where P is a [N× N ] matrix, called the projection matrix.

if X has full rank, by commutativity of the trace operator it follows that

tr(H ) = tr(X (XT X )1 XT ) = tr(XTX (XT X ) 1 ) = tr(Ip ) = p(9.1.23)

If we perform a QR decomposition of X(Appendix B.4) then we obtain

H= X(XT X)1 XT = QR( RT QTQR)1 RT QT =QRR1 ( RT )1 RT QT =QQT

(9.1.24)

Note that, in this case the input matrix Xis replaced by the matrix Q which

contains an orthogonalised transformation of the original inputs.

9.1.16 Generalisation error of the linear model

Given a training dataset DN = {hxi , yi i :i = 1, . . . , N } and a query point x , it is

possible to return a linear prediction

ˆ y= h( x, α) = xT ˆ

β

where ˆ

βis returned by least-squares estimation (9.1.18). From an estimation per-

spective, the ˆ

βis a realisation of the random estimator ˆ

βfor the specific dataset

DN . But which precision can we expect from ˆ

y=xT ˆ

βif we average the pre-

diction error over all finite-size datasets DN that can be generated by the linear

dependency (9.1.13)?

A quantitative measure of the quality of the linear predictor on the whole domain

Xis the Mean Integrated Squared Error (MISE) defined in (7.5.24). But how can

we estimate this quantity in the linear case? Also, is the empirical risk \

MISEemp

in (9.1.16) a reliable estimate of MISE?

9.1.16.1 The expected empirical error

This section derives analytically that the empirical risk \

MISEemp (defined in (9.1.16))

is a biased estimator of the MISE generalisation error.

Let us first compute the expectation of the residual sum of squares which is

equal to N times the empirical risk. According to (9.1.22) and Theorem 4.2 the

expectation can be written as

ED N [SSEemp ] = ED N [eTe ] = ED N [YT PY] = σ2

wtr(P ) + E [Y T ]P E[Y]

Since tr(ABC ) = tr(CAB)

tr(P ) = tr(I H ) = N tr(X(XT X ) 1 XT )

=N tr(XT X (XT X )1 ) = N tr(Ip ) = N p

222 CHAPTER 9. LINEAR APPROACHES

and we have

ED N [eTe ]=( N p) σ 2

w+ (Xβ) T P(Xβ ) = (9.1.25)

= (N p )σ 2

w+β T X T (I X(X T X) 1X T)Xβ = (9.1.26)

= (N p )σ 2

w(9.1.27)

It follows that

ED N [\

MISEemp ] = ED N SSE emp

N = E D N e T e

N = (1 p/N )σ2

w(9.1.28)

is the expectation of the error made by a linear model trained on DN to predict the

value of the output in the same dataset DN .

In order to obtain the MISE term we derive analytically the expected sum of

squared errors of a linear model trained on DN and used to predict for the same

training inputs X a set of outputs Yts distributed according to the same linear

law (9.1.13) but independent of the training output Y.

ED N ,Y ts [(Y ts X ˆ

β)T(Y ts X ˆ

β)] =

=ED N ,Y ts [(Y ts Xβ + X ˆ

β)T (Y ts Xβ + X ˆ

β)] =

=ED N ,Y ts [(W ts + Xβ X ˆ

β)T ( Wts + Xβ X ˆ

β)] =

=Nσ2

w+E D N[(Xβ X ˆ

β)T (Xβ X ˆ

β)]

Since

Xβ X ˆ

β= Xβ X ( XTX )1 XT Y =

=Xβ X (XT X )1 XT ( +W ) = X (XTX)1 XT W

we obtain

Nσ2

w+E D N[(Xβ X ˆ

β)T(Xβ X ˆ

β)]

=Nσ2

w+E D N[(W T X(X T X) 1X T )(X(X T X) 1X TW)]

=Nσ2

w+E D N[W T X(X T X) 1X TW]

=Nσ2

w+σ 2

wtr(X(X T X) 1X T)

=Nσ2

w+σ 2

wtr(X T X(X T X) 1)

=Nσ2

w+σ 2

wtr(I p ) = σ 2

w(N+ p)

By dividing the above quantity by N

MISE = (1 + p/N )σ 2

w(9.1.29)

From (9.1.28) and (9.1.29) it follows that that the empirical error \

MISEemp is a

biased estimate of MISE:

ED N [ \

MISEemp ] = ED N e T e

N = σ 2

w(1 p/N) 6= MISE = σ 2

w(1 + p/N) (9.1.30)

As a consequence, if we replace \

MISEemp with

eTe

N+ 2 σ 2

wp

N(9.1.31)

we correct the bias and we obtain an unbiased estimator of the MISE generalisation

error. Nevertheless, this estimator requires an estimate of the noise variance.

9.1. LINEAR REGRESSION 223

R script

The R script ee.R performs a Monte Carlo validation of (9.1.30).

Example

Let { y1 , . . . , yN } ← Fy be the training set. Consider the simplest linear predictor

of the output variable: the average ˆ

µy (i.e. p = 1). This means that

ˆ yi = P N

i=1 y i

N= ˆ µy , i = 1 , . . . , N

We want to show that, even for this simple estimator, the empirical error is a biased

estimator of the quality of this predictor. Let µbe the mean of the r.v. y . Let us

write yas

y=µ + w

where E [w ] = 0 and Var [w] = σ2 . Let {z1 , . . . , zN } ← Fy a test set coming from

the same distribution underlying DN . Let us compute the expected empirical error

and the mean integrated square error.

Since E [ ˆ

µy ] = µ and Var ˆ

µy =σ2 /N

N·MISE = ED N ,Y ts [

N

X

i=1

(zi ˆ

µy )2 ] = ED N ,w [

N

X

i=1

(µ +wi ˆ

µy )2]

=Nσ2 +

N

X

i=1

ED N [( µ ˆ

µy )2 ]

=Nσ2 +N (σ2 /N ) = (N + 1)σ 2

Instead, since ˆ σy = (P N

i=1(y i µ y ) 2 )/(N 1) and E [ ˆ

σ2

y] = σ 2

ED N [

N

X

i=1

(yi ˆ

µy )2 ] = ED N [(N 1) ˆ

σ2

y]=( N1) σ2 6= N·MISE

It follows that, even for a simple estimator like the estimator of the mean, the

empirical error is a biased estimate of the accuracy (see R file ee mean.R).

9.1.16.2 The PSE and the FPE

In the previous section we derived that \

MISEemp is a biased estimate of MISE and

that the addition of the correction term 2σ 2

wp/N makes it unbiased.

Suppose we have an estimate ˆ σ2

wof σ 2

w. By replacing it into the expres-

sion (9.1.31) we obtain the so-called Predicted Square Error (PSE) criterion

PSE = \

MISEemp + 2ˆ σ2

wp/N (9.1.32)

In particular, if we take as estimate of σ 2

wthe quantity

ˆ σ2

w=1

N pSSE emp = N

N p

\

MISEemp

we obtain the so-called Final Prediction Error (FPE)

FPE = 1 + p/N

1p/N

\

MISEemp (9.1.33)

224 CHAPTER 9. LINEAR APPROACHES

Figure 9.2: Estimation of the generalisation error of hm ,m = 2,..., 7 returned by

the empirical error.

The PSE and the FPE criteria allow us to replace the empirical risk with a

more accurate estimate of the generalisation error of a linear model. Although their

expression is easy to compute, it is worth reminding that their derivation relies on

the assumption that the stochastic input/output dependence has the linear form

9.1.14.

R script

Let us consider an input/output dependence

y=f (x ) + w= 1 + x +x2 +x3 + w(9.1.34)

where w ∼ N (0, 1) and x ∼ U ( 1, 1). Suppose that a dataset DN of N = 100

input/output observations is drawn from the joint distribution of hx, y i . The R

script fpe.R assesses the prediction accuracy of 7 different models having the form

hm ( x ) = ˆ

β0 +

m

X

j=1

ˆ

βj xj (9.1.35)

by using the empirical risk and the FPE measure. These results are compared with

the generalisation error measured by

MISEm = 1

N

N

X

i=1

(hm (xi )f (xi ))2(9.1.36)

The empirical risk and the FPE values for m = 2,..., 7 are plotted in Figure 9.2

and 9.3, respectively. The values MISEmare plotted in Figure 9.4. It is evident, as

confirmed by Figure 9.4, that the best model should be h3 (x ) since it has the same

analytical structure as f (x ). However, the empirical risk is not able to detect this

and returns as the best model the one with the highest complexity (m = 7). This

is not the case for FPE which, by properly correcting the \

MISEemp value, is able to

select the optimal model.

9.1. LINEAR REGRESSION 225

Figure 9.3: Estimation of the generalisation error of hm ,m = 2,..., 7 returned by

the FPE.

Figure 9.4: Computation of the generalisation error of hm ,m = 2,..., 7 by (9.1.36).

226 CHAPTER 9. LINEAR APPROACHES

Figure 9.5: Leave-one-out for linear models. The leave-one-out error can be com-

puted in two equivalent ways: the slowest way (on the right) which repeats Ntimes

the training and the test procedure; the fastest way (on the left) which performs

only once the parametric identification and the computation of the PRESS statistic.

9.1.17 The PRESS statistic

Section 7.10.1 introduced cross-validation to provide a reliable estimate of the gener-

alisation error GN . The disadvantage of this approach is that it requires the training

process to be repeated l times, implying a large computational effort. However, in

the linear case the PRESS (Prediction Sum of Squares) statistic [7] returns the leave-

one-out cross-validation error at a reduced computational cost (Fig. 9.5). PRESS

relies on a simple formula which returns the leave-one-out (l-o-o) as a by-product

of the parametric identification of ˆ

βin Eq. (10.1.41). Consider a training set DN

in which for Ntimes

1. we set aside the ith observation (i = 1, . . . , N )hxi , yi i from the training set,

2. we use the remaining N 1 observations to estimate the linear regression

coefficients ˆ

βi ,

3. we use ˆ

βi to predict the target in xi .

The leave-one-out residual is

eloo

i=y i ˆ yi

i=y i x T

iˆ

βi , i = 1 , . . . , N (9.1.37)

The PRESS statistic is an efficient way to compute the l-o-o residuals on the basis of

the simple regression performed on the whole training set. This allows a fast cross-

validation without repeating Ntimes the leave-one-out procedure. The PRESS

procedure can be described as follows:

1. use the whole training set to estimate the linear regression coefficients ˆ

β. This

procedure is performed only once and returns as by product the Hat matrix

(see Section 9.1.15)

H= X(XT X)1 XT (9.1.38)

2. compute the residual vector e, whose ith term is ei = yi x T

iˆ

β,

3. use the PRESS statistic to compute e loo

ias

eloo

i=e i

1Hii

(9.1.39)

9.1. LINEAR REGRESSION 227

where H ii is the i th diagonal term of the matrix H.

Note that (9.1.39) is not an approximation of (9.1.37) but simply a faster way of

computing the leave-one-out residual e loo

i.

Let us now derive the formula of the PRESS statistic. Matrix manipulations

show that

XT X xixT

i=X T

i X i (9.1.40)

where X T

iX i is the X T Xmatrix obtained by putting the i th row aside. Using

the relation (B.9.13) we have

(XT

i X i) 1 = (X T Xx i x T

i) 1 = (X T X) 1+(X T X) 1x i x T

i(X T X) 1

1Hii

(9.1.41)

and

ˆ

βi = ( XT

iX i) 1X 0

iy i= (X T X) 1+(X T X) 1x i x T

i(X T X) 1

1Hii X T

iy i

(9.1.42)

where yi is the target vector with the ith example set aside.

From (9.1.37) and (9.1.42) we have

eloo

i=y i x T

iˆ

βi

=yi x T

i(X T X) 1+(X T X) 1x i x T

i(X T X) 1

1Hii X T

iy i

=yi x T

i(X T X) 1X T

iy iH ii x T

i(X T X) 1X T

iy i

1Hii

=(1 Hii ) yi xT

i(X T X) 1X T

iy i

1Hii

=(1 Hii ) yi xT

i(X T X) 1(X Tyx i y i )

1Hii

=

=(1 Hii ) yi ˆ yi +Hii yi

1Hii

=y i ˆ yi

1Hii

=e i

1Hii

(9.1.43)

where X T

iy i+x iy i=X T yand x T

i(X T X) 1X Ty= ˆ yi . Thus, the leave-one-out

estimate of the local mean integrated squared error is:

ˆ

Gloo =1

N

N

X

i=1 y i ˆ yi

1Hii 2

(9.1.44)

Since from (9.1.23) the sum of the diagonal terms of the H matrix is p , the

average value of Hii is p/N . It follows that the PRESS may be approximated by

ˆ

Gloo 1

N

N

X

i=1 y i ˆ yi

1p/N 2

which leads us to the GCV formula (8.8.34).

9.1.18 Dual linear formulation

Consider a linear regression problem with [N, n] input matrix Xand [N , 1] out-

put vector y. The conventional least-squares solution is the [p, 1] parameter vec-

tor (9.1.18) where p =n +1. This formulation is common in conventional statistical

settings where the number of observations is supposed to be much larger than the

number of variables.

228 CHAPTER 9. LINEAR APPROACHES

However, machine learning may be confronted with high-dimensional settings

where the ratio of observations to features is low: this would imply a very large

value of p and risks of ill-conditioning of the numerical solution. In this case it

is interesting to consider a dual formulation of the least-squares problem based

on (B.9.14). In this formulation

ˆ

β= ( X0 X)1 X0 y= X0X(X0 X)2 X0 y

| {z }

α

=X0 α =

N

X

i=1

αi xi

where α is a [N, 1] vector and xi is the [n, 1] vector which represents the ith obser-

vation. It follows that if N << p, the dual formulation has less parameters than the

conventional one with advantages in terms of storage requirements and numerical

conditioning.

9.1.19 The weighted least-squares

The assumption of homogeneous variance of the noise wmade in Eq. (9.1.14) is often

violated in practical situations. Suppose we relax the assumption that Var(w ) =

σ2

wI N with I N the identity matrix and assume instead that there is a positive

definite matrix Vfor which Var(w ) = V . We may wish to consider

V= diag[ σ2

1, σ 2

2, . . . , σ 2

N] (9.1.45)

in which case we are assuming uncorrelated errors with error variances that vary

from observation to observation. As a result it would seem reasonable that the

estimator of β should take this into account by weighting the observations in some

way that allows for the differences in the precision of the results. Then the function

being minimised is no more (9.1.16) but depends on Vand is given by

(y X ˆ

β)T V1 ( y Xˆ

β) (9.1.46)

The estimate of βis then

ˆ

β= ( XT V1 X)1 XT V1 y(9.1.47)

The corresponding estimator is called the generalized least-squares estimator and

has the following properties: i) it is unbiased, that is E [ ˆ

β] = β , ii) under the as-

sumption w ∼ N (0, V ) it is the minimum variance estimator among all the unbiased

estimators.

9.1.20 Recursive least-squares

In many analytics tasks, data records are not statically available but have to be

processed and analysed continuously rather than in batches. Examples are the data

streams generated from sensors (notably IoT), financial, business intelligence or

adaptive control applications. In those cases it is useful not to restart from scratch

the model estimation but simply to update the model on the basis of the newly

collected data. One appealing feature of least-squares estimates is that they can be

updated at a lower cost than their batch counterpart.

Let us rewrite the least-squares estimator (9.1.18) for a training set of Nobser-

vations as: ˆ

β(N) = ( XT

(N ) X (N) ) 1 X T

(N ) Y (N)

where the subscript (N) is added to denote the number of observations used for

the estimation. Suppose that a new data point hxN+1 , yN+1 i becomes available.

9.1. LINEAR REGRESSION 229

Instead of recomputing the estimate ˆ

β(N+1) by using all the N+ 1 available data

we want to derive ˆ

β(N+1) as an update of ˆ

β(N) . This problem is solved by the

so-called recursive least-squares (RLS) estimation [26].

If a single new example hxN+1 , yN+1 i , with xN+1 a [1, p] vector, is added to the

training set the X matrix acquires a new row and ˆ

β(N +1) can be written as:

ˆ

β(N +1) = X (N)

xN+1 T X (N)

xN+1 ! 1 X (N)

xN+1 T Y (N)

yN+1

By defining the [p, p] matrix

S(N) = ( X T

(N ) X (N) )

we have

S(N +1) = ( X T

(N+1)X (N+1) ) = h X T

(N )x T

N+1i X ( N)

xN+1 

= XT

(N )X (N)+x T

N+1x N+1 =S (N )+x T

N+1x N+1

(9.1.48)

Since X (N)

xN+1 T Y (N)

yN+1 = X T

(N ) Y (N)+x T

N+1y N+1

and

S(N) ˆ

β(N) = ( XT

(N ) X (N))h (X T

(N ) X (N) ) 1 X T

(N ) Y (N)i=X T

(N ) Y (N)

we obtain

S(N+1) ˆ

β(N +1) = X (N)

xN+1 T Y (N)

yN+1 = S (N) ˆ

β(N) + xT

N+1y N+1

= S(N +1) x T

N+1x N+1 ˆ

β(N) + xT

N+1y N+1

=S(N +1) ˆ

β(N) xT

N+1x N+1 ˆ

β(N) + xT

N+1y N+1

or equivalently

ˆ

β(N +1) =ˆ

β(N) + S1

(N +1)x T

N+1(y N+1 x N+1 ˆ

β(N) ) (9.1.49)

9.1.20.1 1st Recursive formulation

From (9.1.48) and (9.1.49) we obtain the following recursive formulation

S(N +1) =S(N) +xT

N+1x N+1

γ(N +1) = S1

(N+1)x T

N+1

e= yN+1 xN+1 ˆ

β(N)

ˆ

β(N +1) =ˆ

β(N) + γ(N+1) e

where the term ˆ

β(N +1) can be expressed as a function of the old estimate ˆ

β(N) and

the new observation hxN+1 , yN+1 i. This formulation requires the inversion of the

[p× p ] matrix S(N+1) . This operation is computationally expensive but, fortunately,

using a matrix inversion theorem, an incremental formula for S1 can be found.

230 CHAPTER 9. LINEAR APPROACHES

9.1.20.2 2nd Recursive formulation

Once defined

V(N) = S1

(N ) = (X T

(N ) X (N) ) 1

we have (S(N+1) ) 1 = (S(N) +x T

N+1x N+1 ) 1 and

V(N+1) = V (N) V (N) x T

(N+1)(I+ x N +1 V (N)x T

N+1) 1 x N+1 V ( N)(9.1.50)

=V(N) V (N) x T

N+1x N+1 V ( N)

1 + xN+1 V (N) x T

N+1

(9.1.51)

From (9.1.50) and (9.1.49) we obtain a second recursive formulation:

V(N+1) = V (N) V (N)x T

N+1x N+1 V (N)

1+xN+1 V (N) x T

N+1

γ(N +1) = V(N+1) xT

N+1

e= yN+1 xN+1 ˆ

β(N)

ˆ

β(N +1) =ˆ

β(N) + γ(N+1) e

(9.1.52)

9.1.20.3 RLS initialisation

Both recursive formulations presented above require the initialisation values ˆ

β(0)

and V(0) . One way to avoid choosing these initial values is to collect the first N

data points, to solve ˆ

β(N) and V(N) directly from

V(N) = ( X T

(N ) X (N) ) 1

ˆ

β(N) = V(N) X T

(N )Y (N)

and to start iterating from the N + 1th point. Otherwise, in case of a generic

initialisation ˆ

β(0) and V(0) we have the following relations

V(N) = (V(0) + XT

(N )X (N ) ) 1

ˆ

β(N) = V(N) ( XT

(N ) Y (N)+V 1

(0) ˆ

β(0) )

A common choice is to put

V(0) = aI, a > 0

Since V(0) represents the variance of the estimator to choose a very large ais equiv-

alent to consider the initial estimation of βas very uncertain. By setting aequal to

a large number the RLS algorithm will diverge very rapidly from the initialisation

ˆ

β(0) . Therefore, we can force the RLS variance and parameters to be arbitrarily

close to the ordinary least-squares values, regardless of ˆ

β(0) .

In any case, in absence of further information, the initial value ˆ

β(0) is usually

put equal to a zero vector.

9.1.20.4 RLS with forgetting factor

In some adaptive configurations it can be useful not to give equal importance to all

the historical data but to assign higher weights to the most recent data (and then

to forget the oldest one). This may happen when the phenomenon underlying the

data is non stationary or when we want to approximate a nonlinear dependence by

using a linear model which is local in time. Both these situations are common in

adaptive control problems.

9.2. LINEAR APPROACHES TO CLASSIFICATION 231

Figure 9.6: RLS fitting of a nonlinear function where the arrival order of data is

from left to right

RLS techniques can deal with these situations by a modification of the formu-

lation (9.1.52) obtained by adding a forgetting factor µ < 1.

V(N +1) =1

µV ( N) V (N) xT

N+1x N+1 V ( N)

1+xN+1 V (N) x T

N+1

γ(N +1) = V(N+1) x T

N+1

e= yN+1 xN+1 ˆ

β(N)

ˆ

β(N+1) =ˆ

β(N) + γ(N+1) e

Note that: (i) the smaller µ, the higher the forgetting, (ii) for µ= 1 we have the

conventional RLS formulation.

R script

The R script lin rls.R implements the RLS fitting of a nonlinear univariate

function. The simulation shows that the fitting evolves as long as data xi , yi ,

i= 1 , . . . , N are collected. Note that the values xi , i= 1 , . . . , N are increas-

ingly ordered. This means that xis not random and that the oldest collected values

are the ones with the lowest xi .

The final fitting for a forgetting factor µ = 0. 9 is shown in Figure 9.6. Note that

the linear fitting concerns only the rightmost points since the values on the left,

which are also the oldest ones, are forgotten.

9.2 Linear approaches to classification

The methods presented so far deal with linear regression tasks. Those methods

may be easily extended to classification once we consider that in a binary 0/1

classification case the conditional expectation coincides with conditional probability:

E[y| x ]=1· Prob {y = 1| x}+ 0 ·Prob {y = 0| x}= Prob {y = 1| x}(9.2.53)

232 CHAPTER 9. LINEAR APPROACHES

In other words, by encoding the two classes in 0/ 1 values and estimating the con-

ditional expectation with regression techniques, we estimate as well the conditional

probability. Such value may be used to return the most probable class associated

to a query point x.

This section will present some additional strategies to learn linear boundaries

between classes. The first strategy relies on modelling the class conditional densities

and derive from them the equation of the boundary region. The other strategies

aim to learn directly the equations of separating hyperplanes.

9.2.1 Linear discriminant analysis

Let xRn denote a real valued random input vector and ya categorical random

output variable that takes values in the set {c1 , . . . , cK } such that

K

X

k=1

Prob {y = ck |x} = 1

A classifier can be represented in terms of a set of Kdiscriminant functions

gk ( x ), k = 1 , . . . , K such that the classifier applies the following decision rule [61]:

assigns a feature vector xto a class ˆ y( x) = ck if

k= arg max

jg j (x ) (9.2.54)

Section 7.3 showed that in the case of a zero-one loss function (Equation (7.3.13)),

the optimal classifier corresponds to a maximum a posteriori discriminant function

gk ( x ) = Prob {y = ck | x }. This means that if we are able to define the K functions

gk (· ), k = 1 , . . . , K and we apply the classification rule (9.2.54) to an input x , we

obtain a classifier which is equivalent to the Bayes one.

The discriminant functions divide the feature space into K decision regions Dk ,

where a decision region Dk is a region of the input space Xwhere the discriminant

classifier returns the class ck for each x Dk . The regions are separated by decision

boundaries, i.e. surfaces in the domain of xwhere ties occur among the largest

discriminant functions.

Example

Consider a binary classification problem where y can take values in {c1 , c2 } and

xR2 . Let g1 ( x )=3 x1 +x2 + 2 and g2 ( x)=2 x1 + 2 the two discriminant

functions associated to the class x1 and x2 , respectively. The classifier will return

the class c1 if

3x1 +x2 + 2 > 2x1 + 2 x1 > x 2

The decision regions D1 and D2 are depicted in Figure 9.7.

We can multiply all the discriminant functions by the same positive constant

or shift them by the same additive constant without influencing the decision [61].

More generally, if we replace every gk (z ) by f (gk (z )), where f (· ) is a monotonically

increasing function, the resulting classification is unchanged.

For example in the case of a zero/one loss function, any of the following choices

gives identical classification result:

gk ( x ) = Prob {y = ck | x}= p(x|y= ck ) P (y = ck )

PK

k=1 p(x|y = c k )P(y = c k )(9.2.55)

gk ( x ) = p( x |y= ck ) P (y = ck ) (9.2.56)

gk ( x ) = ln p ( x|y= ck ) + ln P(y = ck ) (9.2.57)

9.2. LINEAR APPROACHES TO CLASSIFICATION 233

Figure 9.7: Decision boundary and decision regions for the binary discrimination

functions g1 (x )=3 x1 +x2 + 2 and g2 ( x)=2 x1 + 2

and returns a Bayes classifier.

9.2.1.1 Discriminant functions in the Gaussian case

Let us consider a binary classification task where the inverse conditional densities

are multivariate normal (Section 3.7), i.e. p (x =x|y =ck ) ∼ N (µk , Σk ) where

xRn , µk is a [ n, 1] vector and Σkis a [ n, n] covariance matrix. Since

p(x = x |y= ck ) = 1

(2π )n pdet(Σk ) exp 1

2(x µk )T Σ 1

k(x µ k )

from (9.2.57) we obtain

gk ( x ) = ln p( x |y= ck ) + ln P (y = ck ) (9.2.58)

= 1

2(x µk )T Σ 1

k(x µ k )n

2ln 2π 1

2ln det(Σ k ) + ln P (y =ck )

(9.2.59)

If we make no assumptions about Σkthe discriminant function is quadratic. Now

let us consider a simpler case where all the distributions have the same diagonal

covariance matrix Σk =σ2 I where I is the [n, n] identity matrix. It follows that

det(Σk ) = σ2n , Σ1

k= (1 2 )I

are independent of k and can be ignored by the decision rule (9.2.54). From (9.2.58),

we obtain the simpler discriminant function

gk ( x ) = k x µ k k 2

2σ2 + ln P (y=ck )

= (x µk )T(x µk )

2σ2 + ln P(y = ck )

= 1

2σ2 [xT x2 µT

kx+µ T

kµ k ] + ln P(y = c k )

However, since the quadratic term xT x is the same for all kthis is equivalent to a

linear discriminant function

gk ( x ) = w T

kx+w k0(9.2.60)

where wk is a [n, 1] vector

wk =1

σ2 µ k (9.2.61)

234 CHAPTER 9. LINEAR APPROACHES

and

wk0 = 1

2σ2 µT

kµ k + ln P (y =c k ) (9.2.62)

In the two-classes problem, the decision boundary (i.e. the set of points where

g1 ( x ) = g2 ( x )) can be obtained by solving the identity

wT

1x+w 10 =w T

2x+w 20 (w 1 w 2 ) T x(w 20 w 10 )=0

We obtain a hyperplane having equation

wT ( x x0 ) = 0 (9.2.63)

where

w= µ 1 µ2

σ2

and

x0 =1

2(µ1 +µ2 ) σ 2

kµ1 µ2 k2 ln Prob {y=c1 }

Prob {y = c2 } (µ1 µ2 )

This can be verified by the fact that wT x0 = w20 w10 . The equation (9.2.63)

defines a hyperplane through the point x0 and orthogonal to the vector w.

9.2.1.2 Uniform prior case

If the prior probabilities P (y = ck ) are identical for the Kclasses, then the term

ln P (y =ck ) is a constant that can be ignored. In this case, it can be shown that

the optimum decision rule is a minimum distance classifier [61]. This means that

in order to classify an input x, it measures the Euclidean distance kx µk k2 from x

to each of the Kmean vectors, and assign xto the category of the nearest mean. It

can be shown that for the more generic case Σk= Σ, the discriminant rule is based

on minimising the Mahalanobis distance

ˆ c( x ) = arg min

k(x µ k ) T Σ 1(x µ k ) (9.2.64)

R script

The R script discri.R considers a binary classification task (c1 =red, c2 =green)

where xR2 and the inverse conditional distributions of the two classes are

N(µ1 , σ2 I ) and N(µ2 , σ2 I ), respectively. Suppose that the two a priori proba-

bilities are identical, that σ = 1, µ1 = [ 1, 2]T and µ2 = [2, 5]T . The positions of

100 points randomly drawn from N (µ1 , σ2 I ), of 100 points drawn from N ( µ2 , σ2 I )

together with the optimal decision boundary computed by (9.2.63) are plotted in

Figure 9.8.

The R script discri2.R shows instead the limitations of the LDA approach

when the assumption of Gaussian unimodal class-conditional distributions is not

respected. Suppose that the two a priori probabilities are identical, but that the

class conditional distribution of the green class is a mixture of two Gaussians. The

positions of 1000 points randomly drawn from the two class-conditional distributions

together with the LDA decision boundary computed by (9.2.63) are plotted in Figure

9.9.

9.2. LINEAR APPROACHES TO CLASSIFICATION 235

Figure 9.8: Binary classification problem: distribution of inputs and linear decision

boundary

Figure 9.9: Binary classification problem where one class distribution is bimodal:

distribution of inputs and linear decision boundary. Since the classification task is

not linearly separable the LDA classifier performs poorly.

236 CHAPTER 9. LINEAR APPROACHES

Figure 9.10: Several hyperplanes separating the two classes (blue and red).

9.2.1.3 LDA parameter identification

In a real setting, we do not have access to the quantities µk , Σ and Prob {y =ck }

to compute the boundary (9.2.63) Before applying the discrimination rule above,

we need to estimate those quantities from the dataset DN :

[

Prob {y = ck } =N k

N(9.2.65)

ˆ µk = P i:yi =ck x i

Nk

(9.2.66)

ˆ

Σ = P K

k=1 P i: yi = ck (x i ˆ µk )( xi ˆ µk )T

N K(9.2.67)

where Nk is the number of observations labeled with the class ck and (9.2.67) is

also known as pooled covariance [99].

9.2.2 Perceptrons

Consider a binary classification task (Figure 9.10) where the two classes are denoted

by +1 and 1. The previous section presented a technique to separate input data

by a linear boundary by making assumptions on the class conditional densities and

their covariances. In a generic setting, however, the problem is ill-posed and there

are infinitely many possible separating hyperplanes (Figure 9.10) characterized by

the equation

β0 + xT β= 0 (9.2.68)

If xR2 , this equation represents a line. In a generic case (xRn ) some properties

hold for all hyperplanes

Since for any two points x1 and x2 lying on the hyperplane we have

(x1 x2 )T β = 0

9.2. LINEAR APPROACHES TO CLASSIFICATION 237

Figure 9.11: Bi-dimensional space (n = 2): vector β normal to the hyperplane and

distance of a point from an hyperplane.

the vector normal to the hyperplane (Figure 9.11) is given by

β = β

kβk

The signed distance of a point x to the hyperplane (Figure 9.11) is called the

geometric margin and is given by

βT ( x x0 ) = x T ββxT

0

kβ k= 1

kβ k(xT β+ β0 )

Aperceptron is a classifier that uses the sign of the linear combination h (x, ˆ

β) =

ˆ

β0 +ˆ

βT xto perform classification [98]. The class returned by a perceptron for a

given input xq is

(1 if ˆ

β0 + xT

qˆ

β=ˆ

β0 +Pn

j=1 x qj ˆ

βj >0

1 if ˆ

β0 + xT

qˆ

β=ˆ

β0 +Pn

j=1 x qj ˆ

βj <0

In other terms the decision rule is given by

h( x ) = sgn( ˆ

β0 + xT ˆ

β) (9.2.69)

For all well classified points in the training set the following relation holds

γi = yi ( x T

iˆ

β+ˆ

β0 ) >0

where the quantity γi is called the functional margin of the pair hxi , yi i with respect

to the hyperplane (9.2.68). Misclassifications in the training set occur when

(y i = 1 but ˆ

β0 +ˆ

βT xi <0

yi = 1 but ˆ

β0 +ˆ

βT xi >0 y i (ˆ

β0 +ˆ

βT xi ) <0

The parametric identification step of a perceptron learning procedure aims at

finding the values { ˆ

β, ˆ

β0 }that minimise the quantity

SSEemp ( ˆ

β, ˆ

β0 ) = X

i∈M

yi ( x T

iˆ

β+ˆ

β0 )

238 CHAPTER 9. LINEAR APPROACHES

where M is the subset of misclassified points in the training set. Note that this

quantity is non-negative and proportional to the distance of the misclassified points

to the hyperplane. Since the gradients are

SSEemp ( ˆ

β, ˆ

β0 )

ˆ

β=X

i∈M

yi xi , SSE emp ( ˆ

β, ˆ

β0 )

ˆ

β0

=X

i∈M

yi

a batch gradient descent minimisation procedure (Section 8.6.2.3) or the online

version (Section 8.6.3) can be adopted. This procedure is guaranteed to converge

provided there exists a hyperplane that correctly classifies the data: this configura-

tion is called linearly separable.

Although the perceptron set the foundations for much of the following research

in machine learning, a number of problems with this algorithm have to be men-

tioned [98]:

When the data are separable, there are many possible solutions, and which

one is found depends on the initialisation of the gradient method.

When the data are not separable, the algorithm will not converge.

Also for a separable problem the convergence of the gradient minimisation

can be very slow.

R script

The script hyperplane.R visualises the evolution of the separating hyperplane dur-

ing the perceptron learning procedure. We invite the reader to run the script for

different number of points and different data distributions (e.g. by changing the

mean and the variance of the 2D gaussians).

A possible solution to the separating hyperplane problem has been proposed by the

SVM technique.

9.2.3 Support vector machines

This technique relies on an optimisation approach to compute the separating hy-

perplane.

Let us define as geometric margin of a hyperplane with respect to a training

dataset the minimum of the geometric margin of the training points. Also, the

margin of a training set is the maximum geometric margin over all hyperplanes.

The hyperplane attaining such maximum is known as a maximal margin hyperplane.

The SVM approach [186] computes the maximal margin hyperplane for a train-

ing set. In other words, the SVM optimal separating hyperplane is the one which

separates the two classes by maximising the distance to the closest point from both

classes. This approach provides a unique solution to the separating hyperplane

problem and was shown to lead to good classification performance on real data.

The search for the optimal hyperplane is modelled as the optimisation problem

max

β,β0

C(9.2.70)

subject to 1

kβ k y i ( x T

iβ+β 0 )Cfor i= 1 , . . . , N (9.2.71)

where the constraint ensures that all the points are at least a distance Cfrom the

decision boundary defined by β and β0 . The SVM parametric identification step

seeks the largest Cthat satisfies the constraints and the associated parameters.

9.2. LINEAR APPROACHES TO CLASSIFICATION 239

Since the hyperplane (9.2.68) is equivalent to the original hyperplane where the

parameters β0 and β have been multiplied by a constant, we can set kβ k = 1/C.

The maximisation problem can be reformulated in a minimisation form

min

β,β0

1

2kβk2 (9.2.72)

subject to yi (x T

iβ+β 0 )1 for i = 1, . . . , N (9.2.73)

where the constraints impose a margin around the linear decision of thickness 1/k βk .

This problem is a convex optimisation problem (Appendix ) where the primal La-

grangian is

LP ( β, β0 ) = 1

2kβk2

N

X

i=1

αi [ yi (xT

iβ+β 0 )1] (9.2.74)

and αi 0 are the Lagrangian multipliers.

Setting the derivatives wrt β and β0 to zero we obtain:

β=

N

X

i=1

αiyi xi , 0 =

N

X

i=1

αiyi (9.2.75)

Substituting these in the primal form (9.2.74) we obtain

LD =

N

X

i=1

αi 1

2

N

X

i=1

N

X

k=1

αiαk yiyk x T

ix k (9.2.76)

subject to αi 0.

The dual optimisation problem is now

max

α

N

X

i=1

αi 1

2

N

X

i=1

N

X

k=1

αiαk yiyk x T

ix k = max

α

N

X

i=1

αi 1

2

N

X

i=1

N

X

k=1

αiαk yi yk h xi , xk i

(9.2.77)

subject to 0 =

N

X

i=1

αi yi , (9.2.78)

αi 0 , i = 1 , . . . , N (9.2.79)

where hxi , xk i is the inner product of xi and xk .

Note that the problem formulation requires the computation of all the inner

products hxi , xk i, i = 1, . . . , N, k = 1, . . . , N . This boils down to the computation

of the Gram matrix

G= XXT (9.2.80)

It can be shown that the optimal solution must satisfy the Karush-Kuhn-Tucker

(KKT) condition

αi [yi ( x T

iβ+β 0 )1] = 0, i

The above condition means that we are in either of these two situations

1. yi (x T

iβ+β 0 ) = 1, i.e. the point is on the boundary of the margin, then α i >0

2. yi (x T

iβ+β 0 )>1, i.e. the point is not on the boundary of the margin, then

αi = 0

240 CHAPTER 9. LINEAR APPROACHES

The training points having an index isuch that αi > 0 are called the support vectors

Given the solution α and obtained βfrom (9.2.75), the term β0 is obtained by

β0 = 1

2[βx (1) + βx ( 1)]

where we denote by x (1) some (any) support vector belonging to the first class

and we denote by x ( 1) a support vector belonging to the second class.

Now, the decision function can be written as

h( x, β, β0 ) = sign[ xT β+ β0 ]

or equivalently

h( x, β, β0 ) = sign[ X

support vectors

yi αi h xixi+ β0 ] (9.2.81)

This is an attractive property of support vector machines: the classifier can

be expressed as a function of a limited number of points of the training set, the

so called support vectors which are on the boundaries. This means that in SVM

all the points far from the class boundary do not play a major role, unlike the

linear discriminant rule where the mean and the variance of the class distributions

determine the separating hyperplane (see Equation (9.2.63)). It can be shown, also,

that in the separable case

C=1

kβ k= 1

qP N

i=1 α i

(9.2.82)

R script

The R script svm.R considers a binary classification problem. It generates sets of

separable data and builds a separating hyperplane by solving the problem (9.2.74).

The training points belonging to the two classes (in red and blue), the separating

hyperplane, the boundary of the margin and the support vectors (in black) are

plotted for each training set (see Figure 9.12 ).

A modification of the formulation (9.2.70) occurs when we suppose that the

classes are nonlinearly separable. In this case the dual problem (9.2.77) is un-

bounded. The idea is still to maximise the margin but by allowing some points to

be misclassified. For each example hxi , yi i we define the slack variable ξi and we

relax the constraints (9.2.71) into

1

kβ ky i (xT

iβ+β 0 )C(1 ξ i ) for i = 1, . . . , N (9.2.83)

ξi 0 (9.2.84)

N

X

i=1

ξi γ(9.2.85)

The value ξi represents the proportional amount by which the quantity yi (x T

iβ+

β0 ) can be lower than Cand the norm k ξkmeasures how much the training set fails

to have a margin C. Note that since misclassifications occur when ξi > 1, the upper

bound γ of P N

i=1 ξ i represents the maximum number of allowed misclassifications

9.2. LINEAR APPROACHES TO CLASSIFICATION 241

Figure 9.12: Maximal margin hyperplane for a binary classification task with the

support vectors in black.

in the training set. It can be shown [98] that the maximisation (9.2.70) with the

above constraints can be put in the equivalent quadratic form

max

α

N

X

i=1

αi 1

2

N

X

i=1

N

X

k=1

αiαkyiyk x T

ix k (9.2.86)

subject to 0 =

N

X

i=1

αiyi , (9.2.87)

0αi γ, i = 1, . . . , N (9.2.88)

The decision function takes again the form (9.2.81) where β0 is chosen so that

yi h( xi ) = 1 for any i such that 0 < αi< γ. The geometric margin takes the value

C= N

X

k=1

αi αk yi yk x T

ix k ! 1/ 2

(9.2.89)

Note that the set of points for which the corresponding slack variables satisfy

ξi >0 are also the points for which αi = γ.

R script

The R script svm.R solves a non separable problem by setting the boolean variable

separable to FALSE. Figure 9.13 plots: the training points belonging to the two

classes (in red and blue), the separating hyperplane, the boundary of the margin,

the support vectors (in black), the points of the red class for which the slack variable

is positive (in yellow) and the points of the blue class for which the slack variable

is positive (in green).

Once the value γis fixed, the parametric identification in the SVM approach

boils down to a quadratic optimisation problem for which a large amount of meth-

ods and numerical software exists. The value γ plays the role of capacity hyper-

parameter which bounds the total proportional amount by which classifications fall

242 CHAPTER 9. LINEAR APPROACHES

Figure 9.13: Maximal margin hyperplane for a non separable binary classification

task for different values of C: support vectors are in black, the slack points of the

red class are in yellow and the slack points of the blue class are in green.

on the wrong side of the margin. In practice, the choice of this parameter requires a

structural identification loop where the parameter γis varied through a wide range

of values and assessed through a validation strategy.

9.3 Conclusion

In this chapter we considered input/output regression problems where the rela-

tionship between input and output is linear and classification problems where the

optimal decision boundaries are linear.

The advantage of linear models are numerous:

the least-squares ˆ

βestimate can be expressed in an analytical form and can

be easily calculated through matrix computation.

statistical properties of the estimator can be easily defined.

recursive formulation for sequential updating are available.

Unfortunately, in real problems, it is extremely unlikely that the input and output

variables are linked by a linear relation. Moreover, the form of the relationship is

often unknown, and only a limited amount of observations is available. For this

reason, machine learning proposed a number of nonlinear approaches to address

nonlinear tasks.

9.4 Exercises

1. Consider an input/output regression task where n = 1, E [y | x ] = sin(x ) and p (y| x )

N(sin(x ), 1). Let N = 100 be the size of the training set and consider a quadratic

loss function.

Let the class of hypothesis be hM (x ) = α0 + P M

m=1 α m x m .

1. Estimate the parameter by least-squares.

2. Compute the error by leave-one-out and by using the PRESS statistic.

3. Plot the empirical error as a function of the degree M for M = 0, 1,...,7.

4. Plot the leave-one-out error as a function of the degree M for M = 0, 1,..., 7.

9.4. EXERCISES 243

2. Consider a univariate linear regression problem. Write a R script which, using

Monte Carlo simulation, validates the formula (9.1.7) for at least three regression

tasks differing in terms of

parameters β0 , β1 ,

variance σ2 ,

number N of observations.

3. Consider a univariate linear regression problem. Write a R script which, using

Monte Carlo simulation, validates the formula (9.1.8) for at least three regression

tasks differing in terms of

parameters β0 , β1 ,

variance σ2 ,

number N of observations.

4. Consider a univariate linear regression problem. Write a R script which, using Monte

Carlo simulation, shows that the least-squares estimates of β0 and β1 minimise the

quantity (9.1.10) for at least three regression tasks differing in terms of

parameters β0 , β1 ,

variance σ2 ,

number N of observations.

5. Consider a regression task with input x and output y . Suppose we observe the

following training set

X Y

0 .1 1

0 0.5

-0.3 1.2

0.2 1

0.4 0.5

0.1 0

-1 1.1

1. Fit a linear model to the dataset.

2. Trace the data and the linear regression function on graph paper.

3. Are the two variables positively or negatively correlated?

Hint:

A= a 11 a12

a12 a22 A 1 = 1

a11a22 a2

12 a 22 a 12

a12 a11

Solution:

1. Once we set X=

1 0.1

1 0

1 0. 3

1 0.2

1 0.4

1 0.1

11

we have

X0 X= 7 .0 0 .50

0. 5 1. 31

and

β= ( X0X)1 X0 Y= 0 .725

0. 456

244 CHAPTER 9. LINEAR APPROACHES

2.

1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4

0.0 0.2 0.4 0.6 0.8 1.0 1.2

x

y

3. Since ˆ

β1 <0 the two variables are negatively correlated.

6. Let us consider the dependency where the conditional distribution of yis

y= 1 x +x2 x3 + w

and wN (0, σ2 ) and σ = 0. 5. Suppose that xR takes the values seq ( 1,1,length. out = N)

(with N = 50).

Consider the family of regression models

h(m) ( x) = β0 +

m

X

j=1

βjxj

where p denote the number of weights of the polynomial model h(m) of degree m.

Let \

MISE(m)

emp denote the least-squares empirical risk and MISE the mean integrated

empirical risk. By using Monte Carlo simulation and for m = 0,...,6

plot E [ \

MISE(m)

emp] as a function of p,

plot MISE(m) as a function of p,

plot the difference E [ \

MISE(m)

emp]MISE (m) as a function of p and compare it

with the theoretical result seen during the class.

For a single observed dataset:

plot \

MISE(m)

emp as a function of the number of model parameters p,

plot PSE as a function of p,

discuss the relation between

arg min

m

\

MISE(m)

emp

and

arg min

mPSE ( m)

Solution: See the file Exercise2.pdf in the directory gbcode/exercises of the

companion R package gbcode (Appendix F).

Chapter 10

Nonlinear approaches

This chapter will present several algorithms proposed in machine learning literature

to deal with nonlinear regression and nonlinear classification tasks. Along the years

statisticians and machine learning researchers have proposed a number of nonlinear

approaches, with the aim of finding approximators able to combine high generalisa-

tion with effective learning procedures. The presentation of these techniques could

be organised according to several criteria and principles. In this chapter, we will

focus on the distinction between global and divide-and-conquer approaches.

A family of models traditionally used in supervised learning is the family of

global models which describes the relationship between the input and the output

values as a single analytical function over the whole input domain (Fig. 10.1). In

general, this makes sense when it is reasonable to believe that a physical-like law

describes the data over the whole set of operating conditions. Examples of well-

known global parametric models in the literature are the linear models discussed in

the previous chapter, generalised linear models and neural networks which will be

presented in Section 10.1.1.

A nice property of global modelling is that, even for huge datasets, the storage of

a parametric model requires a small amount of memory. Moreover, the evaluation

of the model requires a short program that can be executed in a reduced amount

of time. These features have undoubtedly contributed to the success of the global

approach in years when most computing systems imposed severe limitations on

users.

However, for a generic global model, the parametric identification (Section 7.2)

consists of a nonlinear optimisation problem (see Equation 7.2.7) which is not an-

alytically tractable due to the numerous local minima and for which only a sub-

optimal solution can be found through a slow iterative procedure. Similarly, the

problem of selecting the best model structure in a generic nonlinear case cannot be

handled in analytical form and requires time-consuming validation procedures.

For these reasons, alternatives to global modelling techniques, as the divide-

and-conquer approach, gained popularity in the modelling community. The divide-

and-conquer principle consists in attacking a complex problem by dividing it into

simpler problems whose solutions can be combined to yield a solution to the original

problem. This principle presents two main advantages. The first is that simpler

problems can be solved with simpler estimation techniques: in statistical language,

this means to adopt linear techniques, well studied and developed over the years.

The second is that the learning method can better adjust to the properties of the

available dataset. Training data are rarely distributed uniformly in the input space.

Whenever the distribution of patterns in the input space is uneven, a proper local

adjustment of the learning algorithm can significantly improve the overall perfor-

mance.

245

246 CHAPTER 10. NONLINEAR APPROACHES

Figure 10.1: A global model (solid line) which fits the training set (dotted points)

for a learning problem with one input variable (x-axis) and one output variable

(y-axis).

Figure 10.2: Function estimation (model induction + model evaluation) vs. value

estimation (direct prediction from data).

We will focus on two main instances of the divide-and-conquer principle: the

modular approach, which originated in the field of system identification, and the

local modelling approach, which was first proposed in the nonparametric statistical

literature.

Modular architectures are input/output approximators composed of a number

of modules which cover different regions of the input space. This is the idea of

operating regimes which propose a partitioning of the operating range of the system

as a more effective way to solve modelling problems (Section 10.1.3).

Although these architectures are a modular combination of local models, their

learning procedure is still performed on the basis of the whole dataset. Hence,

learning in modular architectures remains a functional estimation problem, with the

advantage that the parametric identification can be made simpler by the adoption

of local linear modules. However, in terms of structural identification, the problem

is still nonlinear and requires the same procedures used for generic global models.

A second example of divide-and-conquer methods are local modelling techniques

(Section 10.1.11), which turn the problem of function estimation in a problem of

value estimation. The goal is not to model the whole statistical phenomenon but

to return the best output for a given test input hereafter called the query . The

motivation is simple: why should the problem of estimating the values of an un-

known function at given points of interest be solved in two stages? Global modelling

techniques first estimate the function (induction) and second estimate the values

of the function using the estimated function (deduction). In this two-stage scheme

one actually tries to solve a relatively simple problem (estimating the values of a

function at given points of interest) by first solving, as an intermediate problem, a

much more difficult one (estimating the function).

Local modelling techniques take an alternative approach, defined as transduc-

tion by Vapnik [186] (Fig. 10.2). They focus on approximating the function only

10.1. NONLINEAR REGRESSION 247

Figure 10.3: Local modelling of the input/output relationship between the input

variable x and the output variable y, on the basis of a finite set of observations

(dots). The value of the variable y for x =q is returned by a linear model (solid

line) which fits the training points in a neighbourhood of the query point (bigger

dots).

in the neighbourhood of the point to be predicted. This approach requires to keep

in memory the dataset for each prediction, instead of discarding it as in the global

modelling case. At the same time, local modelling requires only simple approxi-

mators, e.g. constant and/or linear, to model the dataset in a neighbourhood of

the query point. An example of local linear modelling in the case of a single-input

single-output mapping is presented in Fig. 10.3.

Many names have been used in the past to label variations of the local modelling

approach: memory-based reasoning [174], case-based reasoning [121], local weighted

regression [44], nearest neighbor [47], just-in-time [49], lazy learning [5], exemplar-

based, instance based [4],... These approaches are also called nonparametric in the

literature [96, 170], since they relax the assumptions on the form of a regression

function, and let the data search for a suitable function that describes well the

available data.

In the following, we will present in detail some machine learning techniques for

nonlinear regression and classification.

10.1 Nonlinear regression

A general way of representing the unknown input/output relation in a regression

setting is the regression plus noise form (7.4.21) where f (· ) is a deterministic func-

tion and the term wrepresents the noise or random error. It is typically assumed

that w is independent of x and E [w ] = 0. Suppose that we collect a training

set {hxi , yi i:i = 1, . . . , N } with xi = [xi1 , . . . , xin ]T , generated according to the

model (7.4.21). The goal of a learning procedure is to find a model h (x ) which is

able to give a good approximation of the unknown function f (x ).

Example

Consider an input/output mapping represented by the Dopler function

f( x) = 20p x(1 x) sin(2 π1 .05

x+ 0 .05 ) (10.1.1)

distorted by additive Gaussian noise wwith unit variance.

248 CHAPTER 10. NONLINEAR APPROACHES

Figure 10.4: Training set obtained by sampling uniformly in the input domain of a

Dopler function distorted with Gaussian noise.

The training set is made of N= 2000 points obtained by sampling the input

domain X = [0. 12, 1] through a random uniform distribution (Fig. 10.4). This

stochastic dependency and the related training dataset (see R script dopler.R ) will

be used to assess the performance of the techniques we are going to present.

10.1.1 Artificial neural networks

Artificial neural networks (ANN) (aka neural nets ) are parallel, distributed infor-

mation processing computational models which draw their inspiration from neurons

in the brain. However, one of the most important trends in recent neural computing

has been to move away from a biologically inspired interpretation of neural networks

to a more rigorous and statistically founded interpretation based on results deriving

from statistical pattern recognition theory.

The main class of neural network used in supervised learning for classification

and regression is the feed-forward network, aka as multi-layer perceptron (MLP).

Feed-forward ANNs (FNNs) have been applied to a wide range of prediction tasks

in such diverse fields as speech recognition, financial prediction, image compression,

adaptive industrial control.

10.1.1.1 Feed-forward architecture

Feed-forward NN have a layered architecture, with each layer comprising one or

more simple processing units called artificial neurons or nodes (Figure 10.5). Each

node is connected to one or more other nodes by real-valued weights (in the following

we will refer to them as parameters) but not to nodes in the same layer. All FNN

have an input layer and an output layer. FNNs are generally implemented with an

additional node, called the bias1unit , in all layers except the output layer. This

1Note that this has nothing to do with the estimator bias concept. In neural network literature,

bias is used to denote the intercept term

10.1. NONLINEAR REGRESSION 249

Figure 10.5: Two-layer feed-forward NN

node plays the role of the intercept term β0 in linear models.

For simplicity, henceforth, we will consider only FNN with one single output.

Let

nbe the number of inputs,

Lthe number of layers,

H(l) the number of hidden units of the lth layer (l = 1, . . . , L ) of the FNN,

w(l)

kv denote the weight of the link connecting the kth node in the l 1 layer

and the vth node in the l layer,

z(l)

v,v= 1 , . . . , H ( l)the output of the vth hidden node of the lth layer,

z(l)

0denote the bias for the l, l = 1, . . . , L layer.

Let H (0) =n and z (0)

v=x v,v= 0 , . . . , n.

For l 1 the output of the v th, v = 1, . . . , H (l) , hidden unit of the lth layer,

is obtained by first forming a weighted linear combination of the H (l1) outputs of

the lower level

a(l)

v=

H(l1)

X

k=1

w(l)

kv z ( l 1)

k+w ( l)

0vz (l 1)

0, v = 1, . . . , H (l)

and then by transforming the sum using an activation function to give

z(l)

v=g ( l)(a (l )

v), v = 1 , . . . , H ( l)

The activation function g (l) (· ) is typically a nonlinear transformation like the

logistic or sigmoid function

g(l) ( z) = 1

1 + e z (10.1.2)

250 CHAPTER 10. NONLINEAR APPROACHES

For L = 2 (i.e. single hidden layer or two-layer feed-forward NN), the in-

put/output relation is given by

ˆ y= h( x, αN ) = g (2) ( a(2)

1) = g (2) H

X

k=1

w(2)

k1z k +w (2)

01 z 0 !

where

zk = g (1)

n

X

j=1

w(1)

jk x j +w (1)

0kx 0

, k = 1, . . . , H

Note that if g (1) (· ) and g (2) (· ) are linear mappings, this functional form becomes

linear.

Once given the number of inputs and the form of the function g ( ·) two are the

parameters which remain to be chosen: the value of weights w (l) ,l = 1, 2 and the

number of hidden nodes H. Note that the set of weights of an FNN represents

the set of parameters αN introduced in Section 7.1 when the hypothesis function

h(· ) is modelled by a FNN. The calibration procedure of the weights on the basis

of a training dataset represents the parametric identification procedure in neural

networks. This procedure is normally carried out by a back-propagation algorithm

which will be discussed in the following section.

The number Hof hidden nodes represents the complexity sin Equation (7.9.54).

By increasing the value H, we increase the class of input/output functions that can

be represented by the FNN. In other terms, the choice of the number of hidden

nodes affects the representation power of the FNN approximator and constitutes

the structural identification procedure in FNN (Section 10.1.1.3) .

10.1.1.2 Back-propagation

Back-propagation is an algorithm which, once the number of hidden nodes His

given, estimates the weights αN ={w (l) , l = 1, 2} on the basis of the training set

DN . It is a gradient-based algorithm which aims to minimise the non-convex cost

function

SSEemp(αN ) =

N

X

i=1

(yi ˆ yi )2 =

N

X

i=1

(yi h (xi , αN ))2

where αN ={w(l) , l = 1, 2} is the set of weights.

The back-propagation algorithm exploits the network structure and the differen-

tiable nature of the activation functions in order to compute the gradient recursively.

The simplest (and least effective) back-prop algorithm is an iterative gradient

descent which is based on the iterative formula

αN ( k+ 1) = αN ( k) η SSE emp ( α N ( k ))

∂αN ( k)(10.1.3)

where αN (k ) is the weight vector at the k th iteration and η is the learning rate

which indicates the relative size of the change in weights.

The weights are initialised with random values and are changed in a direction

that will reduce the error. Some convergence criterion is used to terminate the

algorithm. This method is known to be inefficient since many steps are needed to

reach a stationary point, and no monotone decrease of SSEemp is guaranteed. More

effective versions of the algorithm are based on the Levenberg-Marquardt algorithm

(Section 8.6.2.6). Note that this algorithm presents all the typical drawbacks of the

gradient-based procedures discussed in Section 8.6.4, like slow convergence, local

minima convergence, sensitivity to the weights initialisation.

10.1. NONLINEAR REGRESSION 251

Figure 10.6: Single-input single-output neural network with one hidden layer, two

hidden nodes and no bias units.

In order to better illustrate how the derivatives are computed in (10.1.3), let us

consider a simple single-input (i.e. n= 1) single-output neural network with one

hidden layer, two hidden nodes and no bias units (Figure 10.6). Since

a1 ( x ) = w (2)

11 z 1 +w (2)

21 z 2

the FNN predictor takes the form

ˆ y( x) = h ( x, αN ) = g( a1 ( x)) = g( w(2)

11 z 1 +w (2)

21 z 2 ) = g (w (2)

11 g(w (1)

11 x) + w (2)

21 g(w (1)

12 x))

where αN = [w (1)

11 , w (1)

12 , w (2)

11 , w (2)

21 ] The backprop algorithm needs the derivatives of

SSEemp wrt to each weight w αN . Since for each w α N

SSEemp

∂w =2

N

X

i=1

(yi ˆ y( xi )) ˆ y( xi )

∂w

and the terms (yi ˆ y( xi )) are easy to be computed, we focus on ˆ y

∂w .

As far as the weights {w (2)

11 , w (2)

21 }of the hidden/output layer are concerned, we

have

ˆ y( x)

∂w (2)

v1

=∂g

∂a(2)

1

∂a(2)

1

∂w (2)

v1

=g0 (a (2)

1(x))z v (x) , v = 1 ,..., 2 (10.1.4)

where

g0 ( z) = e z

(1 + ez ) 2

As far as the weights {w (1)

11 , w (1)

12 }of the input/hidden layer

ˆ y( x)

∂w (1)

1v

=∂g

∂a(2)

1

∂a(2)

1

∂zv

∂zv

∂a(1)

v

∂a(1)

v

∂w (1)

1v

=g0 (a(2)

1(x))w (2)

v1g 0 (a (1)

v(x))x(10.1.5)

252 CHAPTER 10. NONLINEAR APPROACHES

Figure 10.7: Neural network fitting with s= 2 hidden nodes. The red continuous

line represents the neural network estimation of the Dopler function.

where the term g0 ( a2

1(x)) has been already obtained during the computation of (10.1.4).

The computation of the derivatives with respect to the weights of the lower layers

relies on terms which have been used in the computation of the derivatives with

respect to the weights of the upper layers. In other terms, there is a sort of back-

propagation of numerical terms from the upper layer to the lower layers that justifies

the name of the procedure.

R example

Tensorflow [1] is a open-source library developed by Google2 which had a great

success in recent years as flexible environment for building and training of neural

network architectures. In particular this library provides automatic differentiation

functionalities to speed up the backpropagation implementation.

The script tf nn.R uses the R tensorflow package (wrapper over the python li-

brary) to compute the derivatives (10.1.5) and (10.1.4) for the network in Figure 10.6

and checks that the TensorFlow result coincides with the one derived analytically.

R example

The FNN learning algorithm for a single-hidden layer architecture is implemented

by the R library nnet . The script nnet.R shows the prediction accuracy for different

number of hidden nodes (Figure 10.7 and Figure 10.8).

2https://www.tensorflow.org

10.1. NONLINEAR REGRESSION 253

Figure 10.8: Neural network fitting with s = 7 hidden nodes. The continuous red

line represents the neural network estimation of the Dopler function.

10.1.1.3 Approximation properties

Let us consider a two-layer FNN with sigmoidal hidden units. This has proven to

be an important class of network for practical applications. It can be shown that

such networks can approximate arbitrarily well any functional (one-one or many-

one) continuous mapping from one finite-dimensional space to another, provided

the number H of hidden units is sufficiently large. Note that although this result

is remarkable, it is of no practical use. No indication is given about the number of

hidden nodes to choose for a finite number of observations and a generic nonlinear

mapping.

In practice, the choice of the number of hidden nodes requires a structural identi-

fication procedure (Section 8.8) which assesses and compares several different FNN

architectures before choosing the ones expected to be the closest to the optimum.

Cross-validation techniques or regularisation strategies based on complexity-based

criteria (Section 8.8.2.5) are commonly used for this purpose.

Example

This example presents the risk of overfitting when the structural identification of

a neural network is carried out on the basis of the empirical risk and not on less

biased estimates of the generalisation error.

Consider a dataset DN = {xi , yi } ,i = 1, . . . , N where N = 50 and

x∈ N

[0,0,0],

100

010

001

is a 3-dimensional vector. Suppose that y is linked to x by the input/output rela-

254 CHAPTER 10. NONLINEAR APPROACHES

tionship

y= x2

1+ 4 log(|x 2 |)+5 x3

where xi is the i th component of the vector x. Consider as non-linear model a single-

hidden-layer neural network (implemented by the R package nnet ) with s = 15

hidden neurons. We want to estimate the prediction accuracy on a new i.i.d dataset

of Nts = 50 examples. Let us train the neural network on the whole training set by

using the R script cv.R. The empirical prediction MISE error is

\

MISEemp = 1

N

N

X

i=1

(yi h(xi , αN ))2 = 1. 6 106

where αN is obtained by the parametric identification step. However, if we test

h(· , αN ) on the test set we obtain

\

MISEts = 1

Nts

Nts

X

i=1

(yi h (xi , αN ))2 = 22.41

This neural network is seriously overfitting the dataset. The empirical error is a

very bad estimate of the MISE.

We perform now a K-fold cross-validation in order to have a better estimate of

MISE, where K = 10. The K = 10 cross-validated estimate of MISE is

\

MISECV = 24.84

This figure is a much more reliable estimation of the prediction accuracy.

The leave-one-out estimate K =N = 50 is

\

MISEloo = 19.47

It follows that the cross-validated estimate could be used to select a more ap-

propriate number of hidden neurons.

10.1.2 From shallow to deep learning architectures

Until 2006, FNNs with more than two layers have been rarely used in literature

because of poor training and large generalisation errors. The common belief was

that the solutions returned by deep neural networks were worse solutions than the

ones obtained with shallower networks. This was mainly attributed to two aspects:

i) gradient-based training of deep supervised FNN gets stuck in local minima or

plateaus, and ii) the higher the number of layers in a neural network, the smaller

the impact of the back-propagation on the first layers.

However, an incredible resurgence of the domain occurred from 2006 on when

some teams (notably the Bengio team in Montreal, the Hinton team in Toronto and

the Le Cun team in Facebook)3were able to show that some adaptation of the FNN

algorithm could bring remedy to the above-mentioned problems and lead to major

accuracy improvements with respect to other learning machines. In particular deep

architectures (containing up to hundreds of layers) showed a number of advantages:

some highly nonlinear functions can be represented much more compactly

with deep architectures than with shallow ones,

3Yoshua Bengio, Geoffrey Hinton, and Yann LeCun were awarded the 2018 Turing Award,

known as the Nobel Prize of computing

10.1. NONLINEAR REGRESSION 255

the XOR parity function for n -bit inputs4 can be coded by a feed-forward

neural network with O (log n ) hidden layers and O (n ), neurons, while a feed-

forward neural network with only one hidden layer needs an exponential num-

ber of the same neurons to perform the same task,

DL allows automatic generation and extraction of new features in large di-

mensional tasks with spatial dependency and location invariance,

DL allows easy management of datasets where inputs and outputs are stored

in tensors (multidimensional matrices),

DL relies on the learning of successive layers of increasingly meaningful rep-

resentations of input data (layered representation learning) and is a powerful

automatic alternative to time-consuming human crafted feature engineering,

portions of DL pre-trained networks may be reused for similar, yet different,

tasks (transfer learning) or calibrated in online learning pipelines (continuous

learning),

iterative gradient optimisation (Section 8.6.3) is a very effective manner of

ingesting huge amount of data in large networks.

new activation functions and weight-initialisation schemes (e.g. layer-wise

pretraining) improve the training process.

Also, new network architectures were proposed, like auto-encoders or convolu-

tional networks. An auto-encoder is a multi-input multi-output neural network that

maps its input to itself. It has a hidden layer that describes a code used to repre-

sent the input and is composed of two parts: an encoder function and a decoder

that produces a reconstruction. They can be used for dimensionality reduction or

compression (if the number of hidden nodes is smaller than the number of inputs).

Convolutional networks are biologically inspired architectures imitating the pro-

cessing of cortical cells. They are ideal for taking into consideration local and spatial

correlation and consist of a combination of convolution, pooling and normalisation

steps applied to inputs taking the generic form of tensors. The convolution phase

applies a number of filters with shared weights to the same image. It ensures trans-

lation invariance since the weights depend on spatial separation and not on absolute

positions. Pooling is a way to take large images and shrink them down while pre-

serving the most important information in them. This step allows the creation

of new features as the combination of previous level features. The normalisation

ensures that every negative value is set to zero.

Those works had such a major impact on theoretical and applied research that

nowadays, deep learning is a de-facto synonymous of the entire machine learning

domain and more generally of AI. This comeback has been supported by a number

of headlines in the news like the success of a deep learning solution in the ImageNet

Large-Scale Visual Recognition Competition (2012): (bringing down the state-of-

the-art error rate from 26.1% to 15.3%) or the DL program AlphaGo, developed

by the company DeepMind, beating the no.1 human GO player. Other impres-

sive applications of deep learning are near-human-level speech recognition, near-

human-level handwriting transcription, autonomous cars (e.g. traffic sign recog-

nition, pedestrian detection), image segmentation (e.g. face detection), analysis

of particle accelerator data in physics, prediction of mutation effects in bioinfor-

matics and machine translation (LSTM model sequence-to-sequence relationships).

4This is a canonical challenge in classification where the target function is a boolean function

whose value is one if and only if the n-dimensional input vector has an odd number of ones.

256 CHAPTER 10. NONLINEAR APPROACHES

The breakthroughs of DL in the AI community have been acknowledged by the

attribution of the 2019 ACM Turing prize to Bengio, Hinton and Le Cun.

The domain is so large and rich that the most honest recommendation of the

author is to refer the reader, for more details, to seminal books [85] and articles [122]

authored by the pioneers in this domain. Nevertheless, we would like to make a

number of pedagogical considerations about the role of deep learning with respect

to other learning machines:

DL models are not faithful models of the brain.

the astonishingly success of DL makes of it a privileged approach in recent

years, but definitely, it should not be considered a machine learning panacea.

DL, like all machine learning techniques, relies on a number of hyper-parameters

which affect its capacity, its bias/variance trade-off and the expected gener-

alisation power. The setting of those parameters has a major impact on the

generalisation power. An important factor of the recent success of deep net-

works learning is the effective integration of computational strategies already

adopted in other learning approaches, like regularisation, averaging, resam-

pling.

the success of DL, though fulgurant, if often restricted to some specific percep-

tual tasks, e.g. convolutional networks has been explicitly designed to process

data that come in the form of multiple arrays (1D for signals and sequences,

including language; 2D for images or audio spectrograms; and 3D for video

or volumetric images.)

there is no evidence that representation learning is by default a better strat-

egy than feature engineering: it is surely less biased but very probably more

variant.

DL is particularly successful in tasks where it is possible to collect (and label)

huge amounts of examples: nevertheless, there are still a number of chal-

lenging tasks where the number of examples is typically low or scarce (e.g.

bioinformatics or tie series forecasting),

the success of DL has been amplified by the advent of fast parallel graphics

processing units (GPUs), tensor processing units (TPU) and related libraries

(e.g. TensorFlow, Keras, PyTorch) that are convenient to program and allow

researchers to train networks 10 or 20 times faster.

any assumption of a priori superiority of DL over other techniques for a given

learning task is more often due to hype consideration than to a scientific

attitude that should instead rely on the validation of a number of alternative

strategies and the pondering of different criteria (accuracy, computational

cost, energy consumption, interpretability).

Exercise

The script keras.regr.R compares a Keras [43] implementation of a DNN and a

Random Forest (Section 11.4) in a very simple nonlinear regression task where a

single input out of nis informative about the target. The default setting of the

DNN is very disappointing in terms of NMSE accuracy (8.10.41) with respect to

the Random Forest. We invite the reader to spend some time performing DNN

model selection (e.g. by changing the architecture, tuning the number of layers

and/or the number of nodes per layers) or increasing the number of training points

to bring the DNN accuracy closer to the RF one. Is that easy? Is that fast? What

is your opinion?

10.1. NONLINEAR REGRESSION 257

10.1.3 From global modelling to divide-and-conquer

Neural networks are a typical example of global modelling. Global models have

essentially two main properties. First, they make the assumption that the relation-

ship between the inputs and the output values can be described by an analytical

function over the whole input domain. Second, they solve the problem of learning

as a problem of function estimation: given a set of data, they extract the hypothesis

which is expected to approximate the best the whole data distribution (Chapter 7).

The divide-and-conquer paradigm originates from the idea of relaxing the global

modelling assumptions. It attacks a complex problem by dividing it into simpler

problems whose solutions can be combined to yield a solution to the original prob-

lem. This principle presents two main advantages. The first is that simpler problems

can be solved with simpler estimation techniques; in statistics, this means to adopt

linear techniques, well studied and developed over the years. The second is that the

learning method can better adjust to the properties of the available dataset.

The divide-and-conquer idea evolved in two different paradigms: the modular

architectures and the local modelling approach.

Modular techniques replace a global model with a modular architecture where

the modules cover different parts of the input space. This is the idea of operating

regimes which assume a partitioning of the operating range of the system in order

to solve modelling and control problems [111]. The following sections will introduce

some examples of modular techniques.

10.1.4 Classification and Regression Trees

The use of tree-based classification and regression dates back to the work of Morgan

and Sonquist in 1963. Since then, methods of tree induction from samples have

been an active topic in the machine learning and the statistics community. In

machine learning the most representative methods of decision-tree induction are

the ID3 [158] and the C4 [159] algorithms. Similar techniques were introduced

in statistics by Breiman et al. [37], whose methodology is often referred to as the

CART (Classification and Regression Trees) algorithm.

A decision tree (see Fig. 10.9) partitions the input space into mutually exclusive

regions, each of which is assigned a procedure to characterise its data points (see

Fig. 10.10)

The nodes of a decision tree can be classified in internal nodes and terminal

nodes. An internal node is a decision-making unit that evaluates a decision function

to determine which child node to visit next. A terminal node or leaf has no child

nodes and is associated with one of the partitions of the input space. Note that

each terminal node has a unique path that leads from the root to itself.

In classification trees each terminal node contains a label that indicates the

class for the associated input region. In regression trees the terminal node con-

tains a model that specifies the input/output mapping for the corresponding input

partition.

Hereafter we will focus only on the regression case. Let mbe the number of

leaves and hj (· , αj ) the input/output model associated with the j th leaf. Once a

prediction in a query point qis required, the output evaluation proceeds as follows.

First, the query is presented to the root node of the decision tree; according to the

associated decision function, the tree will branch to one of the root's children. The

procedure is iterated recursively until a leaf is reached, and an input/output model

is selected. The returned output will be the value hj (q, αj ).

Consider for example the regression trees in Fig. 10.9, and a query point

q= ( xq , yq ) so that xq < x1 and yq > y1. The predicted output will be

258 CHAPTER 10. NONLINEAR APPROACHES

Figure 10.9: A binary decision tree.

Figure 10.10: Input space partitioning induced on the input space by the binary

tree in Fig. 10.9

10.1. NONLINEAR REGRESSION 259

yq = h2 ( q, α2 ) where α2 is the vector of parameters of the model localised in region

R2.

When the terminal nodes contain only constant models, the input/output map-

ping results in a combination of several constant-height planes put together with

crisp boundaries. In the case of linear terms, the resulting approximator is instead

a piecewise linear model.

10.1.4.1 Learning in Regression Trees

10.1.4.2 Parameter identification

A regression tree partitions the input space into mutually exclusive regions. In terms

of parametric identification, this requires a two-step procedure. First, the training

dataset is partitioned into m disjoint sets D Nj ; second, a local model hj (· , αj ) is

fitted to each subset DNj . The nature of the local model determines the kind of

procedure (linear or nonlinear) to be adopted for the parameter identification (see

Section 8.6).

R implementation

A regression tree with constant local models is implemented by the R library tree.

The script tree.R shows the prediction accuracy for different minimum number of

observations per leaf. (Figure 10.11 and Figure 10.12).

10.1.4.3 Structural identification

This section presents a summary of the CART procedure [37] for structural identifi-

cation in binary regression trees. In this case the structural identification procedure

addresses the problem of choosing the optimal partitioning of the input space.

To construct an appropriate decision tree, CART first grows the tree on the

basis of the training set, and then prunes the tree back based on a minimum cost-

complexity principle. This is an example of the exploratory approach to model

generation described in Section 8.8.1.

Let us see in detail the two steps of the procedure:

Tree growing. CART makes a succession of splits that partition the training data

into disjoint subsets. Starting from the root node that contains the whole

dataset, an exhaustive search is performed to find the split that best reduces

a certain cost function.

Let us consider a certain node t and let D (t ) be the corresponding subset of

the original DN . Consider the empirical error of the local model fitting the

N( t) data contained in the node t:

Remp ( t) = min

αt

N( t)

X

i=1

L( yi , ht ( xi , αt )) (10.1.6)

For any possible split s of node tinto the two children tr and tl , we define the

quantity

E (s, t ) = Remp (t ) (R emp (tl ) + R emp (tr ))

with N ( tr ) + N (tl ) = N ( t ) (10.1.7)

260 CHAPTER 10. NONLINEAR APPROACHES

Figure 10.11: Regression tree fitting with a minimum number of points per leaf

equal to s = 7.

Figure 10.12: Regression tree fitting with a minimum number of points per leaf

equal to s = 30.

10.1. NONLINEAR REGRESSION 261

that represents the change in the empirical error due to a further partition of

the dataset. The best split is the one that maximizes the decrease ∆E

s = arg max

sE( s, t) (10.1.8)

Once the best split is attained, the dataset is partitioned into the two disjoint

subsets of length N (tr ) and N ( tl ), respectively. The same method is recur-

sively applied to all the leaves. The procedure terminates either when the

error measure associated with a node falls below a certain tolerance level, or

when the error reduction ∆Eresulting from further splitting does not exceed

a threshold value.

The tree that the growing procedure yields is typically too large and presents a

serious risk of overfitting the dataset (Section 7.7). For that reason, a pruning

procedure is often adopted.

Tree pruning. Consider a fully expanded tree Tmax characterized by Lterminal

nodes.

Let us introduce a complexity-based measure of the tree performance

Rλ ( T ) = Remp ( T ) + λ | T| (10.1.9)

where λ is a parameter that accounts for the tree's complexity and |T | is the

number of terminal nodes of the tree T. For a fixed λwe define with T (λ)

the tree structure which minimizes the quantity (10.1.9).

The parameter λ is gradually increased in order to generate a sequence of tree

configurations with decreasing complexity

TL =Tmax TL1 ⊃ ··· ⊃ T2 T1 (10.1.10)

where Ti has i terminal nodes. In practice, this requires a sequence of shrinking

steps where for each step we select the value of λleading from a tree to a tree

of inferior complexity. When we have a tree Tthe next inferior tree is found

by computing for each admissible subtree Tt T the value λt which makes of

it the minimizer of (10.1.9). For a generic subtree Tt this value must satisfy

Rλ t (Tt ) R λ t ( T ) (10.1.11)

that is

Remp ( Tt ) + λt | Tt | ≤ Remp ( T) + λt | T|

which means

λt R emp ( T)Remp (Tt )

|T |−| Tt | (10.1.12)

Hence, λt =R emp (T)R emp (Tt )

|T|−|Tt | makes of T t the minimising tree. Therefore we

choose among all the admissible subtrees Tt the one with the smallest right-

hand term in Eq. (10.1.12). This implies a minimal increase in λ toward the

next minimising tree.

At the end of the shrinking process we have a sequence of candidate trees

that have to be properly assessed to perform the structural selection. As

far as validation is concerned, either a procedure of cross-validation or of

independent testing can be used. The final structure is then obtained through

one of the selection procedures described in Section 8.8.3.

262 CHAPTER 10. NONLINEAR APPROACHES

Regression trees are a very easy-to-interpret representation of a nonlinear in-

put/output mapping. However, these methods are characterised by rough disconti-

nuity at the decision boundaries, which might bring undesired effects to the overall

generalisation. Dividing the data by partitioning the input space shows typically

small estimator bias but at the cost of increased variance. This is particularly prob-

lematic in high-dimensional spaces where data become sparse. One response to the

problem is the adoption of simple local models (e.g. constant or linear). These

simple functions minimise the variance at the cost of an increased bias.

Another trick is to make use of soft splits, allowing data to lie simultaneously

in multiple regions. This is the approach taken by BFN.

10.1.5 Basis Function Networks

Basis Function Networks (BFN) are a family of modular architectures which are

described by a linear basis expansion, i.e. the weighted linear combination

y=

m

X

j=1

ρj (x)hj (10.1.13)

where the weights are returned by the activations of m local nonlinear basis functions

ρj and where the term hj is the output of a generic module of the architecture.

The basis or activation function ρj is a function

ρj :X [0 ,1] (10.1.14)

usually designed so that its value monotonically decreases towards zero as the input

point moves away from its centre cj .

The basis function idea arose almost at the same time in different fields and

led to similar approaches, often denoted with different names. Examples are the

Radial Basis Function in machine learning, the Local Model Networks in system

identification and the Neuro-Fuzzy Inference Systems in fuzzy logic. These three

architectures are described in the following sections.

10.1.6 Radial Basis Functions

A well-known example of basis functions are the Radial Basis Functions (RBF) [156].

Each basis function in RBF takes the form of a kernel

ρj = K( x, cj, Bj ) (10.1.15)

where cj is the centre of the kernel and Bj is the bandwidth. An example of

kernel functions is illustrated in Fig. 10.13. Other examples of kernel functions are

available in Appendix E. Once we define with ηj the set {cj , Bj } of parameters of

the basis function, we have

ρj =ρj (· , ηj ) (10.1.16)

If the basis ρj have localised receptive fields and a limited degree of overlap

with their neighbours, the weights hj in Eq. (10.1.13) can be interpreted as locally

piecewise constant models, whose validity for a given input is indicated by the

corresponding activation function for a given input.

10.1.7 Local Model Networks

Local Model Networks (LMN) were first introduced by Johansen and Foss [111].

They are a generalised form of Basis Function Network in the sense that the constant

10.1. NONLINEAR REGRESSION 263

Figure 10.13: A Gaussian kernel function in a two-dimensional input space.

weights hj associated with the basis functions are replaced by local models hj (· , αj ).

The typical form of a LMN is then

y=

m

X

j=1

ρj ( x, ηj ) hj ( x, αj ) (10.1.17)

where the ρj are constrained to satisfy

m

X

j=1

ρj ( x, ηj ) = 1 x ∈ X (10.1.18)

This means that the basis functions form a partition of unity [137]. This ensures

that every point in the input space has equal weight, so that any variation in the

output over the input space is due only to the models hj .

The smooth combination provided by the LMN formalism enables the repre-

sentation of complex nonlinear mappings on the basis of simpler modules. See the

example in Fig. 10.14 which shows the combination in a two-dimensional input

space of three local linear models whose validity regions is represented by Gaussian

basis functions.

In general, the local models hj (· , α ) in Eq. (10.1.17) can be of any form: linear,

nonlinear, physical models or black-box parametric models.

Note that, in the case of local linear models

hj ( x, αj ) =

n

X

i=1

aji xi + bj (10.1.19)

where the vector of parameters of the local model is αj = [aj1 , . . . , ajn , bj ] and x i

is the i th term of the vector x , a LMN architecture returns one further information

about the input/output phenomenon: the local linear approximation hlin of the

input/output mapping about a generic point x

hlin ( x ) =

m

X

j=1

ρj ( x, ηj ) n

X

i=1

aji xi + bj ! (10.1.20)

10.1.8 Neuro-Fuzzy Inference Systems

Fuzzy modelling consists of describing relationships between variables by means of

if-then rules, such as

If x is high then y is low

264 CHAPTER 10. NONLINEAR APPROACHES

(a) (b)

(c)

Figure 10.14: A Local Model Network with m= 3 local models: the nonlinear

input/output approximator in (c) is obtained by combining the three local linear

models in (a) according to the three basis functions in (b).

10.1. NONLINEAR REGRESSION 265

where the linguistic terms, as high and low , are described by fuzzy sets [200].

The first part of each rule is called the antecedent while the second part is called

the consequent. Depending on the particular form of the consequent proposition,

different types of rule-based fuzzy models can be distinguished [11].

Here we will focus on the fuzzy architecture for nonlinear modelling introduced

by Takagi and Sugeno [178]. A Takagi-Sugeno (TS) fuzzy inference system is a set

of m fuzzy if-then rules having the form:

If x1 is A 11 and x2 is A 21 . . . and xn is An1 then y =h1 (x1 , x2 , ..., xn , α1 )

. . .

If x1 is A 1m and x2 is A 2m . . . and xn is A nm then y =hm (x1 , x2 , ..., xn , αm )

(10.1.21)

The antecedent is defined as a fuzzy AND proposition where A kj is a fuzzy set on

the k th premise variable defined by the membership function µkj : < → [0, 1]. The

consequent is a function hj ( ·, αj ), j = 1, . . . , m , of the input vector [x1 , x2, . . . , xn ].

By means of the fuzzy sets A kj , the input domain is softly partitioned into m

regions where the mapping is locally approximated by the models hj ( ·, αj ).

If the TS inference system uses the weighted mean criterion to combine the local

representations, the model output for a generic query xis computed as

y=Pm

j=1 µ j (x) h j (x, α j )

Pm

j=1 µ j (x)(10.1.22)

where µj is the degree of fulfilment of the j th rule, commonly obtained by

µj ( x ) =

n

Y

k=1

µkj ( x )

This formulation makes of a TS fuzzy system a particular example of LMN

where

ρj ( x) = µ j ( x)

Pm

j=1 µ j (x)(10.1.23)

is the basis function and hj (· , αj ) is the local model of the LMN architecture.

In a conventional fuzzy approach, the membership functions and the consequent

models are fixed by the model designer according to a priori knowledge. In many

cases, this knowledge is not available; however a set of input/output data has been

observed. Once we put the components of the fuzzy system (memberships and

consequent models) in a parametric form, the TS inference system becomes a para-

metric model which can be tuned by a learning procedure. In this case, the fuzzy

system turns into a Neuro-Fuzzy approximator [108]. For a thorough introduction

to Neuro-Fuzzy architecture see [109] and the references therein. Further work on

this subject was presented by the author in [21, 22, 29, 20].

10.1.9 Learning in Basis Function Networks

Given the strong similarities between the three instances of BFN discussed above,

our discussion on the BFN learning procedure does not distinguish between these

approaches.

The learning process in BFN is divided in structural (see Section 8.8) and para-

metric identification (see Section 8.6). The structural identification aims to find

the optimal number and shape of the basis functions ρj ( · ). Once the structure of

the network is defined, the parametric identification searches for the optimal set of

parameters ηj of the basis functions (e.g. centre and width in the Gaussian case)

266 CHAPTER 10. NONLINEAR APPROACHES

and the optimal set of parameters αj of the local models (e.g. linear coefficients in

the case of local linear models).

Hence, the classes of parameters to be identified are two: the parameters of the

basis function and the parameters of the local model.

10.1.9.1 Parametric identification: basis functions

The relationship between the model output and the parameters ηj of the basis func-

tion is typically nonlinear, hence, methods for nonlinear optimisation are currently

employed. A typical approach consists in decomposing the identification procedure

into two steps: first, an initialisation step, which computes the initial location and

width of the basis functions, then a nonlinear optimisation procedure which uses

the outcome η (0)

jof the previous step as initial value.

Since the methods for nonlinear optimisation have already been discussed in

Section 8.6.2.2, here we will focus on the different initialisation techniques for Basis

Function Networks.

One method for placing the centres of the basis functions is to locate them at the

interstices of some coarse lattice defined over the input space [39]. If we assume the

lattice to be uniform with d divisions along each dimension, and the dimensionality

of the input space to be n, a uniform lattice requires dn basis functions. This

exponential growth makes the use of such a uniform lattice impractical for high

dimensional space.

Moody and Darken [134] suggested a K-means clustering procedure in the input

space to position the basis functions. The K-means method, described into detail in

Appendix A.2, takes as input the training set and returns mgroups of input vectors

each parameterized by a centre cj and a width σj . This method generally requires

a much smaller number of basis functions than the uniform partition, nevertheless

the basis location concerns only that part of the input space actually covered by

data. The assumption underlying this method is that similar inputs should produce

similar outputs and that these similar input pairs should be bundled together into

clusters in the training set. This assumption is reasonable but not necessarily

true in real problems. Therefore, the adoption of K-means clustering techniques

for supervised learning is essentially a heuristic technique and finding a dataset to

which this technique cannot be applied satisfactorily is not uncommon.

An alternative to K-means clustering for initialisation has been proposed in the

Neuro-Fuzzy literature [11, 12]. The initialisation of the architecture is provided by

ahyperellipsoidal fuzzy clustering procedure. This procedure clusters the data in

the input/output domain, obtaining a set of hyperellipsoids which are a preliminary

rough representation of the mapping. The parameters of the ellipsoids (eigenval-

ues) are used to initialise the parameters αj of the consequent models, while the

projection of their barycenters on the input domain determines the initial positions

of the membership functions (see Fig. 10.15).

10.1.9.2 Parametric identification: local models

A common approach to the optimisation of the parameters αj of local models is the

least-squares method (see Eq. (8.6.2) and (8.6.4)).

If the local models are nonlinear, some nonlinear optimisation technique is re-

quired (Section 8.6.2.2). Such a procedure is typically computationally expensive

and does not guarantee the convergence to the global minimum.

However, in the case of local linear models (Eq. 10.1.19), the parametric identi-

fication can take advantage of linear techniques. Assume that the local models are

linear, i.e.

hj ( x, αj ) = hj ( x, βj ) = xT βj (10.1.24)

10.1. NONLINEAR REGRESSION 267

Figure 10.15: The hyperellipsoidal clustering initialisation procedure for a single-

input single-output mapping. The training points (dots) are grouped in three el-

lipsoidal clusters after a procedure of fuzzy clustering in the input/output domain.

The projection of the resulting clusters in the input domain (x-axis) determines the

centre and the width of the triangular membership functions.

There are two possible variants for the parameter identification [137, 138]:

Local optimisation. The parameters of each local model are estimated indepen-

dently.

A weighted least squares optimisation criterion can be defined for each local

model, where the weighting factor is the current activation of the correspond-

ing basis function. The parameters of each model hj ( ·, βj ), j = 1, . . . , m , are

then estimated using a set of locally weighted estimation criteria

Jj ( βj ) = 1

N( yXβj )T Qj ( y j ) (10.1.25)

where Qj is a [N× N ] diagonal weighting matrix, having as diagonal elements

the weights ρj (x1 , ηj ), . . . , ρj (xN , ηj ). The weight ρj ( xi , ηj ) represents the

relevance of the ith example of the training set in the definition of the j th

local model.

The locally weighted least squares estimate ˆ

βj of the local model parameter

vector βj is

ˆ

βj = ( XT Qj X )1 XT Qj y (10.1.26)

Global optimisation. The parameters of the local models are all estimated at the

same time. If the local models are assumed to be linear in the parameters, the

optimisation is a simple least-squares problem. We get the following regression

model:

y=

m

X

j=1

ρj ( x, ηj ) xT βj = ΦΘ (10.1.27)

where Φ is a matrix [N× (n + 1)m]

Φ =

Φ1

.

.

.

ΦN

(10.1.28)

268 CHAPTER 10. NONLINEAR APPROACHES

with

Φi = [ρ1 (xi , ηj )x T

i, . . . , ρ m (x i , η j )x T

i] (10.1.29)

and Θ is a matrix [(n + 1)m× 1]

Θ =

β1

.

.

.

βm

(10.1.30)

The least-squares estimate ˆ

Θ returns the totality of parameters of the local models.

Note that the two approaches differ both in terms of predictive accuracy and

final interpretation of the local models. While the first approach aims to obtain

local linear models hj somewhat representative of the local behaviour of the target

in the region described by ρj [138], the second approach disregards any qualita-

tive interpretation by pursuing only a global prediction accuracy of the modular

architecture.

10.1.9.3 Structural identification

The structure of a BFN is characterised by many factors: the shape of the basis

functions, the number mof modules and the structure of the local models. Here,

for simplicity, we will consider a structural identification procedure which deals

exclusively with the number mof local models.

The structural identification procedure consists in adapting the number of mod-

ules to the complexity of the process underlying the data. According to the process

described in Section 8.8, different BFN architectures with different number of mod-

els are first generated, then validated and finally selected.

Analogously to Neural Networks and Regression Trees, two are the possible

approaches to the generation of BFN architectures:

Forward: the number of local models increases from a minimum mmin to a maxi-

mum value mmax .

Backward: we start with a large number of models and we proceed gradually by

merging basis functions. The initial number must be set sufficiently high such

that the nonlinearity can be captured accurately enough.

Once a set of BFN architectures has been generated, first a validation measure

is used to assess the generalisation error of the different architectures and then a

selection of the best structure is performed. An example of structural identification

of Neuro-Fuzzy Inference Systems based on cross-validation is presented in [21].

Note that BFN structural identification, unlike the parametric procedure de-

scribed in Section 10.1.9.2, is a non convex problems and cannot take advantage of

any linear validation technique. This is due to the fact that a BFN architecture,

even if composed of local linear modules, behaves globally as a nonlinear approxi-

mator. The resulting learning procedure is then characterised by an iterative loop

over different model structures as illustrated in the flow chart of Fig. 10.16.

10.1.10 From modular techniques to local modelling

Modular techniques are powerful engines but leave still some problems unsolved.

While these architectures have efficient parametric identification algorithms, they

are inefficient in terms of structural optimisation. If the parametric identification

takes advantage of the adoption of local linear models, the validation of the global

architecture remains a nonlinear problem which can be addressed only by compu-

tationally expensive procedures.

10.1. NONLINEAR REGRESSION 269

Figure 10.16: Flow-chart of the BFN learning procedure. The learning procedure

is made of two nested loops: the inner one (made of a linear and nonlinear step) is

the parametric identification loop which minimises the empirical error J, the outer

one searches for the model structure which minimises the validation criterion.

270 CHAPTER 10. NONLINEAR APPROACHES

The learning problem for modular architectures is still a problem of function

estimation formulated in terms of minimisation of the empirical risk over the whole

training set. The modular configuration makes the minimisation simpler but in

theoretical terms the problem appears to be at the same level of complexity as in a

generic nonlinear estimator.

Once also the constraint of global optimisation is relaxed, the divide-and-conquer

idea leads to the local modelling approach.

10.1.11 Local modelling

Local modelling is a popular nonparametric technique, which combines excellent

theoretical properties with a simple and flexible learning procedure.

This section will focus on the application of local modelling to the regression

problem. The idea of local regression as a natural extension of parametric fitting

arose independently at different points in time and in different countries in the

19th century. The early literature on smoothing by local fitting focused on one

independent variable with equally spaced values. For a historical review of early

work on local regression see [46].

The modern view of smoothing by local regression has origins in the 1950's and

1960's in the kernel methods introduced in the density estimation setting . As far

as regression is concerned, the first modern works on local regression were proposed

by Nadaraya [140] and Watson [193].

10.1.11.1 Nadaraya-Watson estimators

Let K (x, q, B ) be a nonnegative kernel function that embodies the concept of vicin-

ity. This function depends on the query point q, where the prediction of the target

value is required, and on a parameter B (0, ), called bandwidth , which represents

the radius of the neighbourhood. The function Ksatisfies two conditions:

0K (x, q, B ) 1 (10.1.31)

K( q, q, B ) = 1 (10.1.32)

For example, in the simplest one-dimensional case (dimension n= 1 of the in-

put space) both the rectangular vicinity function (also called uniform kernel) (Fig.

10.17)

K( x, q, B ) = ( 1 if kx q k< B

2

0 otherwise (10.1.33)

and the soft threshold vicinity function (Fig. 10.18)

K( x, q, B ) = exp ( xq) 2

B2 (10.1.34)

satisfy these requirements. Other examples of kernel functions are reported in

Appendix E.

The Nadaraya-Watson kernel regression estimator is given by

h( q ) = P N

i=1 K(x i , q, B )y i

PN

i=1 K(x i , q, B )(10.1.35)

where N is the size of the training set. The idea of kernel estimation is simple.

Consider the case of a rectangular kernel in one dimension (n = 1). In this case,

the estimator (10.1.35) is a simple moving average with equal weights: the estimate

10.1. NONLINEAR REGRESSION 271

Figure 10.17: Hard-threshold kernel function.

Figure 10.18: Soft-threshold kernel function.

at point q is the average of observations yi corresponding to the xi 's belonging to

the window [q B, q +B ].

If B → ∞ then the estimator tends to the average h = P N

i=1 y i

Nand thus for

mappings f ( · ) which are far from constant the bias become large.

If B is smaller than the pairwise distance between the sample points xi then the

estimator reproduces the observations h (xi ) = yi . In this extreme case, the bias

tends to zero at the cost of high variance. In general terms, by increasing B we

increase the bias of the estimator, while by reducing Bwe obtain a larger variance.

The optimal choice for B corresponds to an equal balance between bias and variance

(Section 7.7).

From a function approximation point of view, the Nadaraya-Watson estimator

is a least-squares constant approximator. Suppose we want to approximated lo-

cally the unknown function f ( ·) by a constant θ . The local weighted least-squares

estimate is

ˆ

θ= arg min

θ

N

X

i=1

wi (yi θ )2 = P N

i=1 w i y i

PN

i=1 w i

(10.1.36)

It follows that the kernel estimator is an example of locally weighted constant

approximator with wi =K (xi , q, B ).

The Nadaraya-Watson estimator suffers from a series of shortcomings: it has

large bias particularly in regions where the derivative of the regression function

f( x) or of the density π ( x ) is large. Further, it does not adapt easily to nonuniform

π( x).

An example is given in Fig. 10.19 where the Nadaraya-Watson estimator is used

to predict the value of the function f (x )=0 .9 + x2 in q = 0.5. Since most of the

272 CHAPTER 10. NONLINEAR APPROACHES

Figure 10.19: Effect of asymmetric data distribution on the Nadaraya-Watson esti-

mator: the plot reports in the input/output domain the function f = 0. 9 + x2 to be

estimated, the available points (crosses), the values of the kernel function (stars),

the value to be predicted in q = 0. 5 (dotted horizontal line) and the value predicted

by the NW estimator (solid horizontal line).

observations (crosses) are on the left of q, the estimate is biased downwards.

A more severe problem is the large bias which occurs when estimating at a

boundary region. In Fig. 10.20 we wish to estimate the value of f (x )=0 .9 + x2 at

q= 0 .5. Here the regression function has positive slope and hence the Nadaraya-

Watson estimate has substantial positive bias.

10.1.11.2 Higher order local regression

Once the weakness of the local constant approximation was recognised, a more

general local regression appeared in the late 1970's [44, 175, 114]. Work on local

regression continued throughout the 1980's and 1990's, focusing on the application

of smoothing to multidimensional problems [45].

Local regression is an attractive method both from the theoretical and the prac-

tical point of view. It adapts easily to various kinds of input distributions π(e.g.

random, fixed, highly clustered or nearly uniform). See in Fig. 10.21 the local

regression estimation in q = 0. 5 for a function f (x ) = 0. 9 + x2 and an asymmetric

data configuration.

Moreover, there are almost no boundary effects: the bias at the boundary stays

at the same order as in the interior, without use of specific boundary kernels (com-

pare Fig. 10.20 and Fig. 10.22).

10.1.11.3 Parametric identification in local regression

Given two variables x ∈ X ⊂ <n and y ∈ Y ⊂ < , let us consider the mapping

f:Rn R , known only through a set of Nexamples {h xi , yi i}N

i=1 obtained as

follows:

yi = f( xi ) + wi , (10.1.37)

10.1. NONLINEAR REGRESSION 273

Figure 10.20: Effect of a boundary on the Nadaraya-Watson estimator: the plot

reports in the input/output domain the function f = 0. 9 + x2 to be estimated, the

available points (crosses), the values of the kernel function (stars), the value to be

predicted in q = 0. 5 (dotted horizontal line) and the value predicted by the NW

estimator (solid horizontal line).

Figure 10.21: Local linear regression in asymmetric data configuration: the plot

reports in the input/output domain the function f = 0. 9 + x2 to be estimated, the

available points(crosses), the values of the effective kernel (stars), the local linear

fitting, the value to be predicted in q = 0. 5 (dotted horizontal line) and the value

predicted by the local regression (solid horizontal line).

274 CHAPTER 10. NONLINEAR APPROACHES

Figure 10.22: Local linear regression in boundary configuration: the plot reports in

the input/output domain the function f = 0. 9 + x2 to be estimated, the available

points (crosses), the values of the effective kernel (stars), the local linear fitting, the

value to be predicted in q = 0. 5 (dotted horizontal line) and the value predicted by

the local regression (solid horizontal line).

where i , wi is a random variable such that Ew [ wi ] = 0 and Ew [ wiwj ] = 0, j6 = i,

and such that Ew [w m

i] = µ m (x i ), m 2, where µ m (·) is the unknown m th moment

(Eq. (3.3.35)) of the distribution of wi and is defined as a function of xi . In particular

for m = 2, the last of the above-mentioned properties implies that no assumption

of global constant variance (homoscedasticity) is made.

The problem of local regression can be stated as the problem of estimating the

value that the regression function f (x ) = Ey [y |x ] takes for a specific query point q,

using information pertaining only to a neighbourhood of q.

By using the Taylor's expansion truncated to the order p , a generic smooth

regression function f (· ) can be approximated by

f( x)

p

X

j=0

f(j) ( q)

j!( xq)j(10.1.38)

for x in a neighbourhood of q. Given a query point q, and under the hypothesis of a

local homoscedasticity of wi , the parameter vector ˆ

βof a local linear approximation

of f (· ) in a neighbourhood of qcan be obtained by solving the locally weighted

regression (LWR)

ˆ

β= arg min

β

N

X

i=1 ny i x T

iβ 2 K(x i , q, B )o , (10.1.39)

where K (· ) is a kernel function, Bis the bandwidth, and a constant value 1 has been

appended to each vector xi in order to consider a constant term in the regression.

In matrix notation, the weighted least squares problem (10.1.39) can be written as

ˆ

β= arg min

β(y Xβ) T W(y Xβ ) (10.1.40)

10.1. NONLINEAR REGRESSION 275

where X denotes the [N× (n + 1)] input matrix whose ith row is x T

i,yis a [ N×1]

vector whose ith element is yi and W is a [N× N ] diagonal matrix whose ith diagonal

element is wii =p K (xi , q, B ). From least-squares theory, the solution of the above

stated weighted least squares problem is given by the [(n + 1) × 1] vector:

ˆ

β= ( XT WT W X )1 XT WT W y = ( ZTZ )1 ZT v= P Z T v, (10.1.41)

where Z = W X ,v = W y , and the matrix XT WT W X =ZT Z is assumed to be

non-singular so that its inverse P = (ZT Z )1 is defined.

Once obtained the local linear polynomial approximation, a prediction of f (q),

is finally given by:

ˆ yq = qT ˆ

β. (10.1.42)

10.1.11.4 Structural identification in local regression

While the parametric identification in a local regression problem is quite simple

and reduces to a weighted least-squares, there are several choices to be made in

terms of model structure. These are the most relevant parameters in local structure

identification:

the kernel function K,

the order of the local polynomial,

the bandwidth parameter,

the distance function,

the localised global structural parameters.

In the following sections, we will present in detail the importance of these structural

parameters and finally we will discuss the existing methods for tuning and selecting

them.

10.1.11.5 The kernel function

Under the assumption that the data to be analysed are generated by a continu-

ous mapping f ( ·), we want to consider positive kernel functions K ( ·,·, B ) that are

peaked at x =q and that decay smoothly to 0 as the distance between x and q

increases. Examples of different kernel functions are reported in Appendix E.

Some considerations can be made on how relevant is the kernel shape for the final

accuracy of the prediction. First, it is evident that a smooth weight function results

in a smoother estimate. On the other side, for hard-threshold kernels (10.1.33),

as q changes, available observations abruptly switch in and out of the smoothing

window. Second, it is relevant to have kernel functions with nonzero values on a

compact bounded support rather than simply approaching zero for |x q | → ∞ .

This allows faster implementations, since points further from the query than the

bandwidth can be ignored with no error.

10.1.11.6 The local polynomial order

The choice of the local polynomial degree is a bias/variance trade-off. Generally

speaking, a higher degree will generally produce a less biased, but a more variable

estimate than a lower degree one.

Some asymptotic results in literature assert that good practice in local polyno-

mial regression is to adopt a polynomial order which differs of an odd degree from

the order of the terms to be estimated [70]. In practice, this means that if the goal

276 CHAPTER 10. NONLINEAR APPROACHES

Figure 10.23: Too narrow bandwidth overfitting large prediction error e..

of local polynomial regression is to estimate the value of the function in the query

point (degree zero in the Taylor expansion (10.1.38)), it is advisable to use orders

of odd degree; otherwise, if the purpose is to estimate the derivatives in the query

point it is better to fit with even degrees. However, others suggest in practical

applications not to rule out any type of degree [46].

In the previous sections, we already introduced some consideration on degree

zero fitting. This choice very rarely appears to be the best choice in terms of

prediction, even if it presents a strong advantage in computational terms. By using

a polynomial degree greater than zero we can typically increase the bandwidth by a

large amount without introducing intolerable bias. Despite the increased number of

parameters the final result is smoother thanks to an increased neighbourhood size.

A degree having an integer value is generally assumed to be the only possible

choice for the local order. However, the accuracy of the prediction results to be

highly sensitive to discrete changes of the degree.

A possible alternative is polynomial mixing, proposed in global parametric fitting

by Mallows [129] and in local regression by Cleveland and Loader [46]. Polynomial

mixings are polynomials of fractional degree p =m +c where m is an integer and

0< c < 1. The mixed fit is a weighted average of the local polynomial fits of degree

mand m+ 1 with weight 1 cfor the former and weight cfor the latter

fp (· ) = (1 c) fm (· ) + cfm+1 (· ) (10.1.43)

We can choose a single mixing degree for all xor we can use an adaptive method

by letting p vary with x.

10.1.11.7 The bandwidth

A natural question is how wide the local neighbourhood should be so that the local

approximation (10.1.38) holds. This is equivalent to asking how large the bandwidth

parameter should be in (10.1.33). If we take a small bandwidth B, we are able to

cope with the eventual nonlinearity of the mapping, that is, in other terms, we keep

the modelling bias small . However, since the number of data points falling in this

local neighbourhood is also small, we cannot average out the noise and the variance

of the prediction will be consequently large (Fig. 10.23).

On the other hand, if the bandwidth is too large, we could smooth excessively

the data, then introducing a large modelling bias (Fig. 10.24). In the limit case of

an infinite bandwidth, for example, a local linear model turns to be a global linear

fitting which, by definition, cannot take into account any type of nonlinearity.

10.1. NONLINEAR REGRESSION 277

Figure 10.24: Too large bandwidth underfitting large prediction error e.

A vast amount of literature has been devoted to the bandwidth selection prob-

lem. Various techniques for selecting smoothing parameters have been proposed

during the last decades in different setups, mainly in kernel density estimation [112]

and kernel regression.

Two are the main strategies for the bandwidth selection:

Constant bandwidth selection. The bandwidth B is independent of the train-

ing set DN and the query point q.

Variable bandwidth selection. The bandwidth is a function B ( DN ) of the

dataset DN . For a variable bandwidth a further distinction should be made

between the local and global approach.

1. A local variable bandwidth B ( DN , q ) is not only function of the training

data DN but also changes with the query point q. An example is the

nearest neighbour bandwidth selection where the bandwidth is set to be

the distance between the query point and the k th nearest point [175].

2. A global variable bandwidth is a function B (DN ) of the data set but

is the same for all the queries. However, a further degree of distinction

should be made between the point-based case where the bandwidth B (xi )

is a function of the training point xi , and the uniform case where Bis

constant.

A constant bandwidth is easy to interpret and can be sufficient if the unknown

curve is not too wiggly, i.e. has a high smoothness. Such a bandwidth, however,

fails to do a good job when the unknown curve has a rather complicated structure.

To capture the complexity of such a curve a variable bandwidth is needed. A

variable bandwidth allows for different degrees of smoothing, resulting in a possible

reduction of the bias at peaked regions and of the variance at flat regions. Further,

a variable local bandwidth can adapt to the data distribution, to different level

of noise and to changes in the smoothness of the function. Fan and Gijbels [68]

argue for point-based in favour of query-based local bandwidth selection mainly for

computational efficiency reasons.

10.1.11.8 The distance function

The performance of any local method depends critically on the choice of the distance

function d :Rn × Rn R . In the following, we define some distance functions for

ordered inputs:

278 CHAPTER 10. NONLINEAR APPROACHES

Unweighted Euclidean distance

d( x, q) = v

u

u

t

n

X

j=1

(xj qj )2 = q (x q )T (x q ) (10.1.44)

Weighted Euclidean distance

d( x, q) = q ( x q )T MT M( x q ) (10.1.45)

The unweighted distance is a particular case of the weighted case for Mdiag-

onal with mjj = 1.

Unweighted Lp norm (Minkowski metric)

d( x, q) =

N

X

j=1 |x j q j |

1

p

(10.1.46)

Weighted Lp norm It is computed through the unweighted norm d (M x, Mq).

It is important to remark that when an entire column of Mis zero, all points along

the corresponding direction get the same relevance in the distance computation.

Also, notice that once the bandwidth is selected, some terms in the matrix Mcan

be redundant parameters of the local learning procedure. The redundancy can be

eliminated by requiring the determinant of Mto be one or fixing some element of

M.

Atkeson et al. [10] distinguish between three ways of using distance functions:

Global distance function. The same distance is used at all parts of the input

space.

Query-based distance function. The distance measure is a function of the cur-

rent query point. Examples are in [174, 97, 76].

Point-based local distance functions. Each training point has an associated

distance metric [174]. This is typical of classification problems where each

class has an associated distance metric [3, 4].

10.1.11.9 The selection of local parameters

As seen in the previous sections, there are several parameters that affect the accu-

racy of the local prediction. Generally, they cannot be selected and/or optimised

in isolation as the accuracy depends on the whole set of structural choices. At the

same time, they do not all play the same role in the determination of the final

estimation. It is a common belief in local learning literature that the bandwidth

and the distance function are the most important parameters. The shape of the

weighting function, instead, plays a secondary role.

In the following, we will mainly focus on the methods existing for bandwidth

selection. They can be classified in

Rule of thumb methods. They provide a crude bandwidth selection which in

some situations may result sufficient. Examples of rule of thumb is provided

in [69] and in [95].

10.1. NONLINEAR REGRESSION 279

Data-driven estimation. It is a selection procedure which estimates the gener-

alisation error directly from data. Unlike the previous approach, this method

does not rely on the asymptotic expression, but it estimates the values directly

from the finite data set. To this group belong methods like cross-validation,

Mallow's Cp , Akaike's AIC and other extensions of methods used in classical

parametric modelling.

There are several ways in which data-driven methods can be used for structural

identification. Atkeson et al. [10] distinguish between

Global tuning. The structural parameters are tuned by optimising a data driven

assessment criterion on the whole data set. An example is the General Memory

Based Learning (GMBL) described in [135].

Query-based local tuning. The structural parameters are tuned by optimising

a data driven assessment criterion query-by-query. An example is the lazy

learning algorithm proposed by the author and colleagues in [24, 31, 30].

Point-based local tuning. A different set of structural parameters is associated

with each point of the training set.

R implementation

A local linear algorithm for regression is implemented by the R library lazy [23].

The script lazy.R shows the prediction accuracy in the Dopler dataset for different

number of neighbours. (Figure 10.25 and Figure 10.26).

10.1.11.10 Bias/variance decomposition of the local constant model

An interesting aspect of local models is that it is easy to derive an analytical ex-

pression of the bias/variance decomposition.

In the case of a constant local model the prediction in qis

h( q, αN ) = 1

k

k

X

i=1

y[i]

280 CHAPTER 10. NONLINEAR APPROACHES

Figure 10.25: Locally linear fitting with a rectangular kernel and a bandwidth made

of 10 neighbors.

Figure 10.26: Locally linear fitting with a rectangular kernel and a bandwidth made

of 228 neighbours.

10.2. NONLINEAR CLASSIFICATION 281

computed by averaging the value of y for the k closest neighbours x [i] ,i = 1, . . . , k

of q.

The bias/variance decomposition takes the form discussed in Equation (5.5.15)

that is

MSE(q ) = σ 2

w+ 1

k

k

X

i=1

f( x[i] ) f( q)! 2

+σ2

w/k (10.1.47)

where σ 2

wis the variance of the noise and σ 2

w/k is the variance of a sample average

estimator based on kpoints (Equation (5.5.10). Note the behaviour of the MSE

term as a function of k. By increasing k(i.e. larger neighbourhood) the first

term is invariant, the bias is likely to increase (since farther points are potentially

uncorrelated with q) and the variance decreases.

10.2 Nonlinear classification

In Section 9.2.1, we have shown that optimal classification is possible only if the

quantities Prob {y = ck |x} ,k = 1, . . . , K are known. What happens if this is not

the case? Three strategies are generally used.

10.2.1 Direct estimation via regression techniques

If the classification problem has K= 2 classes and if we denote them by y = 0 and

y= 1

E[y| x] = 1 · Prob {y = 1| x}+ 0 ·Prob {y = 0 | x} = Prob {y = 1| x}

Then a binary classification problem can be put in the form of a regression prob-

lem where the output takes value in { 0,1} . This means that, in principle, all the

regression techniques presented so far could be used to solve a classification task.

In practice, most of those techniques do not make any assumption about the fact

that the outcome in a classification task should satisfy the probabilistic constrains,

e.g. 0 Prob {y = 1|x } ≤ 1. This means that only some regression algorithms (e.g.

local constant models) are commonly used for binary classification as well.

10.2.1.1 The nearest-neighbour classifier

The nearest-neighbour algorithm is an example of local modelling (Section 10.1.11)

algorithm for classification.

Let us consider a binary { 0,1} classification task where a a training set is avail-

able and the classification is required for an input vector q (query point). The

classification procedure of a k-NN classifier can be summarised in the following

steps:

1. Compute the distance between the query qand the training examples accord-

ing to a predefined metric.

2. Rank the observed inputs on the basis of their distance to the query.

3. Select a subset {x [1] , . . . , x[k] } of the k 1 nearest neighbors. Each of these

neighbours x[k] has an associated class y[k] .

4. Compute the estimation of the conditional probability of the class 1 by con-

stant fitting

ˆ p1 ( q) = P k

i=1 y [ i ]

k(10.2.48)

282 CHAPTER 10. NONLINEAR APPROACHES

202 4

0.0 0.2 0.4 0.6 0.8 1.0

X

Y

Number of neighbors= 1

202 4

0.0 0.2 0.4 0.6 0.8 1.0

X

Y

Number of neighbors= 11

202 4

0.0 0.2 0.4 0.6 0.8 1.0

X

Y

Number of neighbors= 21

202 4

0.0 0.2 0.4 0.6 0.8 1.0

X

Y

Number of neighbors= 31

202 4

0.0 0.2 0.4 0.6 0.8 1.0

X

Y

Number of neighbors= 41

202 4

0.0 0.2 0.4 0.6 0.8 1.0

X

Y

Number of neighbors= 51

202 4

0.0 0.2 0.4 0.6 0.8 1.0

X

Y

Number of neighbors= 61

202 4

0.0 0.2 0.4 0.6 0.8 1.0

X

Y

Number of neighbors= 71

202 4

0.0 0.2 0.4 0.6 0.8 1.0

X

Y

Number of neighbors= 101

202 4

0.0 0.2 0.4 0.6 0.8 1.0

X

Y

Number of neighbors= 201

202 4

0.0 0.2 0.4 0.6 0.8 1.0

X

Y

Number of neighbors= 301

202 4

0.0 0.2 0.4 0.6 0.8 1.0

X

Y

Number of neighbors= 391

Figure 10.27: kNN prediction (blue line) of the conditional probability (green line)

for different values of k . Dotted-lines represent the class-conditional densities.

or linear fitting

ˆ p1 ( q) = ˆ aq +ˆ

b

where the parameters ˆ aand ˆ

bare locally fitted by least-squares regression.

5. Return the prediction, either by majority vote or according to the conditional

probability.

It is evident that the hyperparameter kplays a key role in the trade-off between

bias and variance. Figure 10.27 illustrates the trade-off in a n = 1 and N= 400

binary classification task where the two class-conditional distributions are Normal

with means in 1 and 3, respectively. Note that by increasing kthe prediction

profile becomes smoother and smoother.

Figure 10.28 shows the trade-off in a n= 2 classification task. Note that, though

the separating region becomes closer to the optimal for large k, an extrapolation

bias occurs in regions far from the observed examples.

It is interesting to see that the kNN classifier can be justified in terms of the

Bayes theorem. Suppose that the dataset has the form DN = { (x1 , y1 ),..., (xN , yN ) }

where y ∈ {c1 , . . . , cK } and qRn is the query point where we want to compute

the a-posteriori probability. Suppose that the dataset contains Nj points labeled

with the class cj , i.e.

K

X

j=1

Nj = N

Let us consider a region Raround the input x having volume V . If the volume

10.2. NONLINEAR CLASSIFICATION 283

10 5 0 5 10

10 5 0 5 10

x1

x2

Number of neighbors= 1

10 5 0 5 10

10 5 0 5 10

x1

x2

Number of neighbors= 5

10 5 0 5 10

10 5 0 5 10

x1

x2

Number of neighbors= 40

Figure 10.28: kNN class predictions in the n = 2 input space for different values of k.

Dots represent the training points. Continuous black line is the optimal separating

hyperplane.

284 CHAPTER 10. NONLINEAR APPROACHES

is small enough, we may consider the density constant over the entire region5 . It

follows that the probability of observing a point within this volume is

P=ZR

p( z) dz p( x ) V p( x) P

V

Given a training dataset of size N, if we observe NR example in a region Rwe can

approximate P with ˆ

P=NR

Nand consequently obtain

ˆ p( x ) = N R

NV (10.2.49)

Consider now the query point qand a neighbouring volume V containing k

points of which kj k are labeled with the class cj . From (10.2.49) we obtain the

k-NN density estimate (A.1.11) of the class-conditional probability

ˆ p( q| cj ) = k j

Nj V

and of the unconditional density

ˆ p( q ) = k

NV

Since the class priors can be estimated by [

Prob {y = cj } =N j

Nfrom (7.3.16) it

follows

[

Prob {y =cj | q} =k j

kj= 1, . . . , K

This implies that in a binary { 0,1} case the computation (10.2.48) estimates the

conditional probability of the class 1.

10.2.2 Direct estimation via cross-entropy

The approach consists in modelling the conditional distribution Prob {y = cj | x} , j =

1, . . . , K with a set of models ˆ

Pj ( x, α), j = 1, . . . , K satisfying the constraints

ˆ

Pj ( x, α) > 0 and

K

X

j=1

ˆ

Pj ( x, α )=1 .

Parametric estimation boils down to the minimisation of the cross-entropy cost

function (8.6.5). Typical approaches are logistic regression and neural networks.

In logistic regression for a two-class task we have

ˆ

P1 ( x, α) = exp x T α

1 + expx T α = 1

1 + expxT α , ˆ

P2 ( x, α) = 1

1 + expx T α (10.2.50)

where x and α are [p, 1] vectors. This implies

log ˆ

P1 ( x, α)

ˆ

P2 ( x, α)= x T α

where the transformation log p

1p is called the logit transformation and the func-

tion (10.1.2) is the logistic function. In a nonlinear classifier (e.g. neural net)

log ˆ

P1 ( x, α)

ˆ

P2 ( x, α)= h( x, α)

5For a discussion about the validity of this assumption in large dimensional settings, refer to

Section 12.1

10.2. NONLINEAR CLASSIFICATION 285

where h (x, α ) is the output of the learner.

In a binary case (c1 = 1, c2 = 0) the cross-entropy function (to minimise) be-

comes

J( α) =

N

X

i=1

log ˆ

Py i ( xi , α) =

N

X

i=1 y i log ˆ

P1 ( xi , α) + (1 yi ) log(1 ˆ

P1 ( xi , α))=

=

N

X

i=1 y i h(x i , α) + log(1 + exp h(xi) (10.2.51)

In the logistic regression case (linear h) the cost function is minimised by iteratively

reweighted least squares. For a generic h, a gradient-based iterative approach is

required.

Another formulation of the binary case is (c1 = 1, c2 = 1) with

ˆ

P( y| x) = 1

1 + expyh(x,α)

that satisfies ˆ

P( y= 1| x) = 1 ˆ

P( y= 1| x). In this case the classification rule is

the sign function sign[h (x )] and the cost function to minimise is

J( α) =

N

X

i=1

log(1 + expyih(xi) ) (10.2.52)

also known as the log-loss function which is a monotone decreasing function of the

terms yi h (xi , α ) = yi hi called the margins. Minimise (10.2.52) is then equivalent

to minimise the set of training points for which yi and the prediction h (xi , α ) have

a different sign. The decreasing nature of the function expyh(x,α) is such that

negative margins are much more penalised than positive ones. Note that this is not

a property of the least-squares criterion (used in regression) (y h (x, α))2 where in

some cases a positive margin may be more penalised than a negative one. This is a

reason why regression techniques are not recommended in classification tasks.

10.2.3 Density estimation via the Bayes theorem

Since

Prob {y=ck |x }= p(x |y=ck ) Prob {y= ck }

p( x)

an estimation of Prob {x|y = ck } allows an estimation of Prob {y = ck | x} . Several

techniques exist in literature to estimate Prob {x |y = ck } .

We will present two of them in the following section. The first makes the as-

sumption of conditional independence to make easier the estimation. The second

relies on the construction of optimal separating hyperplanes to create convex regions

containing set of x points sharing the same class label.

10.2.3.1 Naive Bayes classifier

The Naive Bayes (NB) classifier has shown in some domains a performance compa-

rable to that of neural networks and decision tree learning.

Consider a classification problem with ninputs and a random output variable y

that takes values in the set {c1 , . . . , cK } . The Bayes optimal classifier should return

c ( x ) = arg max

j=1 ,...,K Prob {y= c j |x}

286 CHAPTER 10. NONLINEAR APPROACHES

We can use the Bayes theorem to rewrite this expression as

c ( x ) = arg max

j=1,...,K

Prob {x|y = cj } Prob {y = cj }

Prob {x}

= arg max

j=1 ,...,K Prob {x |y=c j }Prob {y = c j }

How to estimate these two terms on the basis of a finite set of data? It is easy

to estimate each of the a priori probabilities Prob {y = cj } simply by counting the

frequency with which each target class occurs in the training set. The estimation

of Prob {x|y = cj } is much harder. The NB classifier is based on the simplifying

assumption that the input values are conditionally independent given the target

value (see Section 3.5.4):

Prob {x|y = cj } = Prob {x1 , . . . , xn |y = cj } =

n

Y

h=1

Prob {xh |y = cj }

The NB classification is then

cNB ( x ) = arg max

j=1 ,...,K Prob {y=c j }

n

Y

h=1

Prob {xh |y =cj }

If the inputs xh are discrete variables the estimation of Prob {xh |y =cj } boils down

to the counting of the frequencies of the occurrences of the different values of x h

for a given class cj .

Example

Obs G1 G2 G3 G

1 P.LOW P.HIGH N.HIGH P.HIGH

2 N.LOW P.HIGH P.HIGH N.HIGH

3 P.LOW P.LOW N.LOW P.LOW

4 P.HIGH P.HIGH N.HIGH P.HIGH

5 N.LOW P.HIGH N.LOW P.LOW

6 N.HIGH N.LOW P.LOW N.LOW

7 P.LOW N.LOW N.HIGH P.LOW

8 P.LOW N.HIGH N.LOW P.LOW

9 P.HIGH P.LOW P.LOW N.LOW

10 P.HIGH P.LOW P.LOW P.LOW

Let us compute the NB classification for the query { G1=N.LOW G2= N.HIGH

G3=N.LOW Since

Prob {y = P.HIGH } = 2/ 10, Prob {y = P.LOW } = 5/10

Prob {y = N.H IGH } = 1/ 10, Prob {y = N .LOW } = 2/10

Prob {G 1 = N.LO W |y = P.HI GH } = 0/ 2, Prob {G 1 = N .LOW |y = P.LOW } = 1/5

Prob {G 1 = N.LO W |y = N.HIGH } = 1/ 1, Prob { G 1 = N .LOW |y = N.LO W } = 0/2

Prob {G 2 = N.H IGH |y = P.H IGH } = 0/ 2, Prob {G 2 = N.HI GH |y = P .LOW } = 1/5

Prob {G 2 = N.H IGH |y = N.HI GH } = 0/ 1, Prob {G 2 = N.HIGH|y = N .LOW } = 0/2

Prob {G 3 = N.LO W |y = P.HI GH } = 0/ 2, Prob {G 3 = N .LOW |y = P.LOW } = 3/5

Prob {G 3 = N.LO W |y = N.HIGH } = 0/ 1, Prob {G 3 = N .LOW |y = N.LO W } = 0/2

10.2. NONLINEAR CLASSIFICATION 287

it follows that

cNB ( x ) =

arg max

P.H,P.L,N.H,N.L{2/10 00 0,5/10 1/ 5 1/ 5 3/ 5, 1/1010 1,2/10 00 0}=

=P.LOW

The NB classifier relies on the naive (i.e. simplistic) assumption that the inputs

are independent given the target class. But why is this assumption done and when

may it be considered as realistic? There are essentially two reasons underlying the

NB approach, one of statistical nature and the second of causal nature. From a

statistical perspective the conditional independence assumption largely reduces the

capacity of the classifier by reducing the number of parameters (Section 4.1). This

is a variance reduction argument which makes the algorithm effective in large di-

mensional classification tasks. However, there are classification tasks which by their

own nature are more compliant with the NB assumptions than other. Those are

tasks where the features used to predict the class are descriptors of the phenomenon

represented by the class. Think for instance to the classification task where a doc-

tor predicts if a patient got a flu by means of symptomatic information (does she

cough? has he fever?). All those measures are correlated but they become indepen-

dent once we know the latent state. In a causal perspective (Chapter 13 ) NB makes

the assumption that the considered input features are effects of a common variable

(the target class) (Figure 13.9 left). Another example where this assumption holds

is fraud detection [52, 50] where the observed features (e.g. place and amount of

transaction) are consequences of the fraudulent action and then informative about

it.

10.2.3.2 SVM for nonlinear classification

The extension of the Support Vector (SV) approach to nonlinear classification re-

lies on the transformation of the input variables and the possibility of effectively

adapting the SVM procedure to a transformed input space.

The idea of transforming the input space by using basis functions is an intuitive

manner of extending linear techniques to a nonlinear setting.

Consider for example an input output regression problem where x ∈ X Rn .

Let us define mnew transformed variables zj = zj (x ), j = 1, . . . , m , where zj ( ·)

is a pre-defined nonlinear transformation (e.g. zj (x ) = log x1 + log x2 ). This is

equivalent to mapping the input space Xinto a new space, also known as feature

space,Z = {z= z( x ) |x ∈ X}. Note that, if m < n, this transformation boils down

to a dimensionality reduction and it is an example of feature selection (Chapter 12).

Let us now fit a linear model y =P m

j=1 β m z m to the training data in the new

input space zRm . By doing this, we carry out a nonlinear fitting of data simply

by using a conventional linear technique.

This procedure can be adopted for every learning procedure. However, it is still

more worthy being used in a SV framework. Before discussing it, we introduce the

notion of dot-product kernel.

Definition 2.1 (Dot-product kernel) . A dot-product kernel is a function K, such

that for all xi , xj ∈ X K (xi , xj ) = h z (xi ), z (xj )i (10.2.53)

where h z1 , z2 i =z T

1z 2 stands for the inner product and z(· ) is the mapping from

the original to the feature space Z.

288 CHAPTER 10. NONLINEAR APPROACHES

Let us suppose now that we want to perform a binary classification by SVM

in a transformed space z ∈ Z . For the sake of simplicity, we will consider a sepa-

rable case. The parametric identification step requires the solution of a quadratic

programming problem in the space Z

max

α

N

X

i=1

αi 1

2

N

X

i=1

N

X

k=1

αiαkyiyk z T

iz k = (10.2.54)

= max

α

N

X

i=1

αi 1

2

N

X

i=1

N

X

k=1

αiαkyiyk hzi , zk i = (10.2.55)

= max

α

N

X

i=1

αi 1

2

N

X

i=1

N

X

k=1

αiαkyiyk K( xi , xk ) (10.2.56)

subject to 0 =

N

X

i=1

αiyi , (10.2.57)

αi 0 , i = 1 , . . . , N (10.2.58)

What is interesting is that the resolution of this problem differs from the linear

one (Equation (9.2.70)) by the replacement of the quantities hxi , xk i with h zi , zk i =

K( xi , xk ).

This means that whatever the feature transformation z (x ) and whatever the

dimensionality m , the SVM computation requires only the availability of the Gram

matrix in the feature space, also referred to as the kernel matrix K . What is

interesting is that once we know how to derive the kernel matrix we do not even

need to know the underlying feature transformation function z (x).

The use of a kernel function is an attractive computational short-cut. In practice,

the approach consists in defining a kernel function directly, hence implicitly defining

the feature space.

10.3 Is there a best learner?

A vast amount of literature in machine learning served the purpose of showing the

superiority of some learning methods over the others. To support this claim, quali-

tative considerations and tons of experimental simulations have been submitted to

the scientific community. Every machine learning researcher dreams of inventing the

most accurate algorithm, without realising that the attainment of such an objective

would necessarily mean the end of machine learning... But is there an algorithm to

be universally preferred over others in terms of prediction accuracy?

If there was a universally best learning machine, research on machine learning

would be unnecessary: we would use it all the time. (Un)fortunately, the theoretical

results on this subject are not encouraging [58]. For any number Nof observations,

there exist an input/output distribution for which the estimate of generalisation

error is arbitrarily poor. At the same time, for any learning machine L1 it exists a

data distribution and another learning machine L2 such that for all N , L2 is better

than L1 .

It can be shown that there is no learning algorithm which is inherently superior

to any other, or even to random guessing. The accuracy depends on the match

between the (unknown) target distribution and the (implicit or explicit) inductive

bias of the learner.

This (surprising ?) result has been formalised by the No Free Lunch (NFL)

theorems by D. Wolpert [197]. In his seminal work, Wolpert characterises in prob-

abilistic terms the relation between target function, dataset and hypothesis. The

10.3. IS THERE A BEST LEARNER? 289

main difference with respect to other research on generalisation is that he does not

consider the generating process as constant (e.g. ffixed as in the bias/variance

decomposition 7.7.36), but he supposes the existence of a probability distribution

p( f) over the target functions fand that a learning algorithm implements a prob-

ability distribution p ( h) over the hypothesis h. For instance p (h = h|DN ) denotes

the probability that a learning algorithm will return the hypothesis hgiven the

training set DN6 . Based on this formalism, he encodes the following assumptions

in a probabilistic language:

the target distribution is completely outside the researcher's control[197] ,

the learning algorithm designer has no knowledge about fwhen guessing a

hypothesis function.

This means that, over the input space region where we observed no training ex-

amples (off-training region), the hypothesis his conditionally independent (Sec-

tion 3.5.4) of f given the training set:

p( h| f, DN ) = p( h| DN )

which in turn is equivalent to

p( f| h, DN ) = p( f| DN )

In other terms the only information about the target process that a hypothesis may

take advantage of, is the one contained in the training set.

He then derives the generalisation error of a learning algorithm Lconditioned

on a training set DN and computed on input values which do not belong to the

training set (i.e. off-training set region) as

X

x6∈ DN

Ef,h [L(f , h)| DN ]( x ) =

X

x6∈ DN Z f,h

L(h( x ) , f ( x )) p( f, h| DN ) df dh = X

x6∈ DN Z f,h

L(h( x ) , f ( x)) p ( f | DN ) p (h|DN )dfdh

(10.3.59)

It follows that the generalisation error7depends on the alignment (or match)

between the hypothesis hreturned by Land the target fwhich is represented by

the inner-product p (f| DN )p ( h|DN ). Since the target is unknown this match may

only be assessed a posteriori: a priori there is no reason to consider a learning

algorithm better than another. For any learning algorithm which is well aligned

with the distribution p (f| DN ) in the off-training set it is possible to find another

distribution for which the match is much worse. Equation (10.3.59) is one of the

several NFL results stating that there is no problem-independent reason to favour

one learning algorithm Lover another (not even random guessing) if

1. we are interested only in generalisation accuracy,

2. we make no a priori assumption on the target distribution,

3. we restrain to the accuracy over the off-training set region.

6Note that throughout this book we have only considered deterministic learning algorithms,

for which p (h| H ) is a Dirac function

7Note the differences between the definitions (7.7.36) and (10.3.59) of generalisation error.

In (7.7.36) fis fixed and DN is random: in (10.3.59) fis random and DN is fixed

290 CHAPTER 10. NONLINEAR APPROACHES

x1x2x3 yˆ y1 ˆ y2 ˆ y3 ˆ y4

0 0 0 1 1 1 1 1

0 0 1 0 0 0 0 0

0 1 0 1 1 1 1 1

0 1 1 1 1 1 1 1

1 0 0 0 0 0 0 0

1 0 1 0 0 0 0 0

1 1 0 ? 0 0 1 1

1 1 1 ? 0 1 0 1

Table 10.1: Off-training set prediction binary example

The presumed overall superiority of a learning algorithm (no matter the number of

publications or the H-factor of the author) is apparent and depends on the specific

task and underlying data generating process. The NFL theorems are then a modern

probabilistic version of the Hume skeptical argument (Section 2.4): there is no log-

ical evidence that the future will behave like the past. Any prediction or modelling

effort demands (explicitly or implicitly) an assumption about the data generating

process and the resulting accuracy is strictly related to the validity of such assump-

tion. Note that such assumptions underlie also learning procedures that seem to be

general purpose and data-driven like holdout or cross-validation: for instance, an

holdout strategy makes the assumption that the relation between the training por-

tion of the dataset and the validation one is informative about the relation between

the observed dataset and future query points (in off-training regions).

A NFL example

Let us consider a classification example from [61] where we have three binary in-

puts and one binary target which is a deterministic function of the inputs. Let us

suppose that the value of the target is known for 6 input configurations and that

we want to predict the value for the 2 remaining ones (Table ). Let us consider

4 classifiers which have identical behaviour for the training set yet they differ in

terms of their predictions for the off-training region. Which one is the best one in

the off-training set region? May we discriminate between them on the basis of the

training behaviour, if we make no additional assumption about the input/output

relationship in the off-training set region? The No Free Lunch answer is no. If

we assume no a priori information about the conditional distribution, we have 4

equiprobable off-training behaviours. On average the four predictors have the same

accuracy. Note that each predictor could have a justification in terms of nearest

neighbour (Table 10.2). For instance, the first classifier (in black) relies on the in-

ductive bias that the first and the third features are the most informative about the

target, i.e. if we consider in the dataset the nearest neighbours of the off-tranining

inputs we obtain two zeros as predictions. This is not the case of the fourth (red)

which makes implicitly a different hypothesis, (i.e. the target depends on x2 only)

and returns two ones accordingly.

10.4 Conclusions

A large part of machine learning research in the last forty years has been devoted to

the quest for the Holy Grail of generalisation. This chapter presented a number of

learning algorithms and their rationale. Most of those algorithms made the history

of the machine learning and were undeniably responsible for the success of the

10.4. CONCLUSIONS 291

Table 10.2: Off-training set prediction binary example: nearest-neighbour interpre-

tation of the four classifiers. The colours show which training points have been used

according to a nearest-neighbour strategy to return the off-training set predictions.

discipline. When they were introduced, and every time they were used afterwards,

they were shown to be competitive and often outperform other algorithms. So, how

may all this be compatible with the No Free Lunch result? First of all the NFL

does not deny that some algorithms may generalise well under some circumstances.

It simply states that there is not a single algorithm outperforming consistently all

the others. Also, NFL results assume that the off-training set is the most pertinent

measure for assessing algorithms. Last but not least, the success of statistics (and

ML) is probably indicative that the prediction tasks we are commonly confronted

with belong to not such a wide and uniform distribution but that some of them are

more probable than others.

Nevertheless, the NFL results may appear as frustrating to a young researcher

aiming to pursue a career in machine learning (and incidentally finding the Holy

Grail): this is not necessarily so if we define in a less utopian, yet more scientific

way, the mission of a data scientist. The mission of a data scientist should not

be the promotion of a specific algorithm (or family of algorithms) but acting as

a scientist through the analysis of data. This means (s)he should use his know-

how NOT to return information about the merits of his preferred algorithm BUT

about the nature of the data generating distribution (s)he is dealing with. The

outcome of a ML research activity (e.g. publication) should be additional insight

about the observed reality (or Nature) and not a contingent statement about the

temporary superiority of an algorithm. Newtons aim was to use differential calculus

to model and explain dynamics in Nature and not to promote a fancy differential

equation tool8. Consider also that every ML algorithm (also the least fashionable

and the least performing) might return some information (e.g. about the degree of

noise, nonlinearity) about the phenomenon we are observing. For instance, a low

accurate linear model tells us a lot about the lack of validity of the (embedded)

linear assumption in the observed phenomenon. In that sense wrong models might

play a relevant role as well since they might return important information about the

phenomenon under observation, notably (non)linearity, (non)stationarity, degree of

stochasticity, relevance of features and nature of noise.

8...or patent it !

292 CHAPTER 10. NONLINEAR APPROACHES

10.5 Exercises

1. Suppose you want to learn a classifier for detecting spam in emails. Let the binary

variables x1 , x2 and x3 represent the occurrence of the words "Viagra", "Lottery"

and "Won", respectively, in a email.

Let the dataset of 20 emails being summarized as follows

Document x1 (Viagra) x2 (Lottery) x3 (Won) y(Class)

E1 0 0 0 NOSPAM

E2 0 1 1 SPAM

E3 0 0 1 NOSPAM

E4 0 1 1 SPAM

E5 1 0 0 SPAM

E6 1 1 1 SPAM

E7 0 0 1 NOSPAM

E8 0 1 1 SPAM

E9 0 0 0 NOSPAM

E10 0 1 1 SPAM

E11 1 0 0 NOSPAM

E12 0 1 1 SPAM

E13 0 0 0 NOSPAM

E14 0 1 1 SPAM

E15 0 0 1 NOSPAM

E16 0 1 1 SPAM

E17 1 0 0 SPAM

E18 1 1 1 SPAM

E19 0 0 1 NOSPAM

E20 0 1 1 SPAM

where

0 stands for the case-insensitive absence of the word in the email.

1 stands for the case-insensitive presence of the word in the email.

Let y = 1 denote a spam email and y= 0 a no-spam email.

The student should

2. Estimate on the basis of the data of exercise 1:

Prob {x1 = 1, x2 = 1}

Prob {y= 0 |x2 = 1, x3 = 1}

Prob {x1 = 0 |x2 = 1}

Prob {x3 = 1 |y= 0, x2 = 0}

Prob {y= 0 |x1 = 0,x2 = 0,x3 = 0}

Prob {x1 = 0 |y= 0}

Prob {y= 0}

Solution:

Prob {x1 = 1, x2 = 1 }= 0.1

Prob {y= 0 |x2 = 1,x3 = 1 } = 0

Prob {x1 = 0 |x2 = 1 }= 0 .8

Prob {x3 = 1 |y= 0, x2 = 0 }= 0 .5

Prob {y = 0 |x1 = 0, x2 = 0,x3 = 0 } = 1

Prob {x1 = 0 |y = 0 } = 0.875

Prob {y = 0 } = 0.4

10.5. EXERCISES 293

3. Answer to the following questions (Yes or No) on the basis of the data of exercise 1:

Are x1 and x2 independent?

Are x1 and y independent?

Are the events x1 = 1 and x2 = 1 mutually exclusive?

Solution:

Are x1 and x2 independent? NO

Are x1 and y independent? NO

Are the events x1 = 1 and x2 = 1 mutually exclusive? NO

4. Consider the following three emails

M1: "Lowest Viagra, Cialis, Levitra price".

M2: "From Google Promo (GOOGLEPROMOASIA) Congratulation! Your

mobile won 1 MILLION USD in the GOOGLE PROMO"

M3: "This is to inform you on the release of the EL-GORDO SWEEPSTAKE

LOTTERY PROGRAM. Your name is attached to ticket number 025-11-464-

992-750 with serial number 2113-05 drew the lucky numbers 13-15 which con-

sequently won the lottery in the 3rd category."

Use a Naive Bayes Classifier to compute for email M1 on the basis of the data of

exercise 1::

the input x

Prob {y= SPAM| x} Prob {x}

Prob {y= NOSPAM| x} Prob { x}

the email class

Solution:

the input x = [1, 0, 0]

Prob {y= SPAM| x} Prob {x } = 1/ 180 = 0.0055

Prob {y= NOSPAM| x} Prob { x} = 1/40 = 0.025

the email class: NOSP

5. Use a Naive Bayes Classifier to compute for email M2 on the basis of the data of

exercise 1::

the input x

Prob {y= SPAM| x} Prob {x}

Prob {y= NOSPAM| x} Prob { x}

the email class.

Solution:

the input x = [0, 0, 1]

Prob {y= SPAM| x} Prob {x } = 1/18 = 0.055

Prob {y= NOSPAM| x} Prob { x} = 7/40 = 0.175

the email class is NOSPAM.

6. Use a Naive Bayes Classifier to compute for email M3 on the basis of the data of

exercise 1:

the input x

Prob {y= SPAM| x} Prob {x}

294 CHAPTER 10. NONLINEAR APPROACHES

Prob {y= NOSPAM|x } Prob { x}

the email class

Solution:

the input x = [0, 1, 1]

Prob {y= SPAM| x} Prob { x} = 5/18 = 0.27

Prob {y= NOSPAM|x } Prob { x} = 0

the email class is SPAM

7. Consider a classification task with two binary inputs and one binary target y

{−1,+1 } where the conditional distribution is

x1x2 P( y= 1| x1 , x2 )

0 0 0.8

0 1 0.1

1 0 0.5

1 1 1

Suppose that all the input configurations have the same probability.

Let the classifier be the rule:

IF x2 = 0 THEN ˆ y=1 ELSE ˆ y= 1.

Consider a test set of size N = 10000.

For this classifier compute:

the confusion matrix,

the precision,

the specificity (true negative rate)

the sensitivity (true positive rate)

Solution:

the confusion matrix,

ˆ y=1 ˆ y= 1

y= 1 TN=1750 FP=2250

y= 1 FN=3250 TP=2750

the precision TP/(TP+FP)=2750/5000=0.55

the specificity (true negative rate) TN/(TN+FP)=1750/4000=0.4375

the sensitivity (true positive rate) TP/(TP+FN)=2750/6000=0.458

8. Consider a classification task with two binary inputs and one binary target y

{−1,+1 } where the conditional distribution is

x1x2 P( y= 1| x1 , x2 )

0 0 0.8

0 1 0.1

1 0 0.5

1 1 1

Suppose that all the input configurations have the same probability.

Let the classifier be the rule:

IF x2 = 0 THEN ˆ y=1 ELSE ˆ y= 1.

Consider a test set of size N = 10000.

For this classifier compute:

10.5. EXERCISES 295

the confusion matrix,

the precision,

the specificity (true negative rate)

the sensitivity (true positive rate)

Solution:

the confusion matrix,

ˆ y= 1 ˆ y= 1

y= 1 TN=1750 FP=2250

y= 1 FN=3250 TP=2750

the precision, TP/(TP+FP)=2750/5000=0.45

the specificity (true negative rate) TN/(TN+FP)=1750/4000=0.4375

the sensitivity (true positive rate) TP/(TP+FN)=2750/6000=0.458

9. Consider a regression task with input x and output y and the following training set

X Y

0 0.5

-0.3 1.2

0.2 1

0.4 0.5

0.1 0

-1 1.1

Consider the three following models:

constant

1NN, Nearest Neighbour with K=1

3NN, Nearest Neighbour with K=3

Compute for the constant model

the vector of training errors ei =yi ˆ yi

the vector of leave-one-out errors e i

i=y i ˆ yi

i

the mean-squared training error ,

the mean-square leave-one-out error.

Compute for the 1NN model

the vector of training errors ei = yi ˆ yi

the vector of leave-one-out errors ei = yi ˆ yi

i

the mean-squared training error ,

the mean-squared leave-one-out error.

Compute for the 3NN model

the vector of training errors ei = yi ˆ yi

the vector of leave-one-out errors ei = yi ˆ yi

i

the mean-squared training error ,

the mean-squared leave-one-out error.

Solution: Constant model

296 CHAPTER 10. NONLINEAR APPROACHES

the vector of training errors ei = yi ˆ yi = [ 0 .2167 ,0 .4833 ,0 .2833 ,0 .2167 ,0 .7167 ,0 .3833]

the vector of leave-one-out errors ei

i=y i ˆ yi

i= [0. 26, 0 .58 , 0 . 34 , 0.26, 0 .86 ,0 . 46]

the mean-squared training error = 0.178

the mean-square leave-one-out error = 0.2564

1NN model:

the vector of training errors ei =yi ˆ yi = [000000]

the vector of leave-one-out errors ei =yi ˆ yi

i= [0. 5, 0. 7, 1, 0 . 5, 0.5, 0 .1]

[0.5, 0. 7, 1, 0 . 5, 1, 0. 1]

the mean-squared training error =0

the mean-squared leave-one-out error =0.375 or 0.5

3NN model

the vector of training errors ei = yi ˆ yi = [0 ,0 .6333 ,0 .5 ,0 ,0 .5 ,0 .1667]

the vector of leave-one-out errors e i = yi ˆ yi

i= [0. 2333, 0. 7, 0. 6667, 0, 0 .6667 ,0 .5333]

the mean-squared training error =0.1548

the mean-squared leave-one-out error=0.2862

10. Consider a classification task with three binary inputs and one binary target where

the conditional distribution is

x1x2x3 P( y= 1| x1 , x2, x3 )

0 0 0 0.8

0 0 1 0.9

0 1 0 0.5

0 1 1 1

1 0 0 0.8

1 0 1 0.1

1 1 0 0.1

1 1 1 0

Suppose that all the input configurations have the same probability.

Let the classifier be the rule:

IF x1 = 0 OR x2 = 0 THEN ˆ y= 1 ELSE ˆ y= 0.

Suppose we have a test set of size N = 10000.

Considering the class 1 as the positive class, for this classifier compute:

the confusion matrix,

the precision,

the specificity (true negative rate) and

the sensitivity (true positive rate)

Solution:

the confusion matrix,

x1x2x3 y = 1 ˆ y= 1 TP FP TN FN

0 0 0 1000 1250 1000 250 0 0

0 0 1 1125 1250 1125 125 0 0

0 1 0 625 1250 625 625 0 0

0 1 1 1250 1250 1250 0 0 0

1 0 0 1000 1250 1000 250 0 0

1 0 1 125 1250 125 1125 0 0

1 1 0 125 0 0 0 1125 125

1 1 1 0 0 0 0 1250 0

10.5. EXERCISES 297

ˆ y= 1 ˆ y= 0

y= 1 TP=5125 FN=125

y= 0 FP=2375 TN=2375

the precision = 5125/(5125+2375)=0.68

the specificity (true negative rate) = 2375/(2375+2375)=0.5

the sensitivity (true positive rate) = 5125/(5125+125)=0.976

11. Let us consider the following classification dataset where yis the binary target.

x1 x2 y

-4 7.0 1

-3 -2.0 1

-2 5.0 0

-1 2.5 1

1 1.0 0

2 4.0 1

3 6.0 0

4 3.0 1

5 -1.0 0

6 8.0 0

Consider the 1st classifier: IF x1 > h THEN ˆ y= 1 ELSE ˆ y= 0

Trace its ROC curve (considering 1 as the positive class)

Consider the 2nd classifier: IF x2 > k THEN ˆ y= 0 ELSE ˆ y= 1

Trace its ROC curve (considering 1 as the positive class)

Which classifier is the best one (1st/2nd)?

Solution:

1st classifier ROC curve (considering 1 as the positive class):

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

FPR

TPR

2nd classifier ROC curve (considering 1 as the positive class):

298 CHAPTER 10. NONLINEAR APPROACHES

0.0 0.2 0.4 0.6 0.8 1.0

0.2 0.4 0.6 0.8 1.0

FPR

TPR

Which classifier is the best one (1st/2nd)? The 2nd.

12. Consider a binary classification task and the training set

x1 x2y

1 1 -1

2 0.5 -1

1.5 2.5 -1

3 1.5 1

2.5 3 1

4 2.5 1

Consider a linear perceptron initialised with the boundary line x2 = 2 which classifies

as positive the points over the line. The student should:

1. Perform one step of gradient descent with stepsize 0.1 and compute the updated

coefficients of the perceptron line with equation

β0 + β1x1 + β2 x2 = 0

2. Trace the initial boundary line (in black), the updated boundary line (in red)

and the training points.

Solution:

In the initial perceptron β0 = 2, β1 = 0 and β2 = 1. The misclassified points are

the third and the fourth (opposite label). Since

∂R

∂β = X

miscl

yixi = 1 .5

2.5 3

1. 5 = 1 . 5

1

and ∂R

∂β0

=X

miscl

yi = 0

after one iteration β0 remains the same while

β t+1

1

βt+1

2= β t

1

βt

20. 1 1. 5

1 = 0

1 + 0.15

0.1 = 0.15

0. 9

Updated coefficients of the perceptron line are then

β0 = 2

β1 = 0 .15

β2 = 0 .9

10.5. EXERCISES 299

13. Consider the data set in exercise 9 and fit to it a Radial Basis Function with 2 basis

functions having as parameters µ (1) = 0. 5 and µ (2) = 0. 5. The equation of the

basis function is

ρ( x, µ) = exp(x µ)2

The student should

1. write in matrix notation the linear system to be solved for obtaining the weights

of the radial basis function

2. compute the weights of the radial basis function

Hint:

A= a 11 a12

a12 a22 A 1 = 1

a11a22 a2

12 a 22 a 12

a12 a11

Solution:

1. matrix notation w = (X0X ) 1 X0 Y where

X=

0. 779 0.105

0. 961 0.527

0. 779 0.779

0. 698 0.852

0. 613 0.914

0. 445 0.990

2. weights of the radial basis function : w = [1. 25, 0 .27]

14. Let us consider a classification task with 3 binary inputs and one binary output.

Suppose we collected the following training set

x1x2x3 y

0 1 0 1

0 0 1 0

0 1 0 0

0 1 1 0

1 0 0 0

1 0 1 1

1 1 0 0

0 1 1 0

1 0 1 0

1 0 0 0

1 1 0 0

0 1 1 0

1. Estimate the following quantities by using the frequency as estimator of prob-

ability

Prob {y= 1}

Prob {y= 1 |x1 = 0}

Prob {y= 1 |x1 = 0, x2 = 0, x3 = 0}

2. Compute the classification returned by using the Naive Bayes Classifier for the

input x1 = 0,x2 = 0,x3 = 0.

3. Suppose we test a classifier for this task and that we obtain a misclassification

error equal to 20%. Is it more accurate than a zero classifier, i.e. a classifier

returning always zero?

Solution: Let us note that N = 12

1. [

Prob {y = 1}= 2/12 = 1/6

[

Prob {y = 1|x1 = 0} = 1

6

300 CHAPTER 10. NONLINEAR APPROACHES

[

Prob {y = 1|x1 = 0,x2 = 0, x3 = 0} cannot be estimated using the fre-

quency since there is no observation where x1 = 0,x2 = 0, x3 = 0

2. Since

[

Prob {y = 1|x1 = 0, x2 = 0,x3 = 0} ∝

[

Prob {x1 = 0|y = 1} [

Prob {x2 = 0|y = 1} [

Prob { x3 = 0|y = 1} [

Prob {y = 1}=

(0. 5 0. 5 0. 5 1/ 6) = 0.02

and

[

Prob {y = 0|x1 = 0, x2 = 0,x3 = 0} ∝

[

Prob { x1 = 0|y = 0} [

Prob {x2 = 0|y = 0} [

Prob { x3 = 0|y = 0} [

Prob {y = 0}=

= (5/ 10 4 / 10 5/ 10 5/ 6) = 0.08

the NB classification is 0

3. A zero classifier would return always the class with the highest a priori prob-

ability, that is the class 0. Its misclassification error would be then 1/6. Since

1/ 5> 1 / 6 the classifier is less accurate than the zero classifier.

15. Let us consider a classification task with 3 binary inputs and one binary output.

Suppose we collected the following training set

x1x2x3 y

0 1 0 1

0 0 1 0

0 1 0 0

0 1 1 0

1 0 0 0

1 0 1 1

1 1 0 0

0 1 1 0

1 0 1 0

1 0 0 0

1 1 0 0

0 1 1 0

1. Estimate the following quantities by using the frequency as estimator of prob-

ability

Prob {y= 1}

Prob {y= 1 |x1 = 0}

Prob {y= 1 |x1 = 0, x2 = 0, x3 = 0}

2. Compute the classification returned by using the Naive Bayes Classifier for the

input x1 = 0,x2 = 0,x3 = 0.

3. Suppose we test a classifier for this task and that we obtain a misclassification

error equal to 20%. Is it working better than a zero classifier, i.e. a classifier

ignoring the value of the inputs?

Solution: Let us note that N = 12

1. [

Prob {y = 1}= 2/12 = 1/6

[

Prob {y = 1|x1 = 0} = 1

6

[

Prob {y = 1|x1 = 0,x2 = 0, x3 = 0} cannot be estimated using the fre-

quency since there is no observation where x1 = 0,x2 = 0, x3 = 0

10.5. EXERCISES 301

2. Since

[

Prob {y = 1|x1 = 0,x2 = 0,x3 = 0} ∝

[

Prob {x1 = 0|y = 1} [

Prob {x2 = 0|y = 1} [

Prob {x3 = 0|y = 1} [

Prob {y = 1}=

(0. 5 0. 5 0. 5 1/ 6) = 0.02

and

[

Prob {y = 0|x1 = 0,x2 = 0,x3 = 0} ∝

[

Prob {x1 = 0|y = 0} [

Prob {x2 = 0|y = 0} [

Prob {x3 = 0|y = 0} [

Prob {y = 0}=

(5/ 10 4/ 10 5/ 10 5/ 6) = 0.08

the NB classification is 0

3. A zero classifier would return always the class with the highest a priori prob-

ability, that is the class 0. Its misclassification error would be then 1/6. Since

1/ 5> 1 / 6 the classifier is less accurate than the zero classifier.

16. Consider a regression task with input x and output y . Suppose we observe the

following training set

X Y

0 .1 1

0 0.5

-0.3 1.2

0.2 1

0.4 0.5

0.1 0

-1 1.1

and that the prediction model is constant. Compute an estimation of its mean

integrated squared error by leave-one-out.

Solution: Since the leave-one-out error is

ei

i=y i P N

j=1 ,j6= i y j

N1

we can compute the vector of errors in leave-one-out

e1

11- 0.716=0.283

e2

20.5- 0.8= -0.3

e3

31.2- 0.683=0.516

e4

41- 0.716=0.283

e5

50.5- 0.8= -0.3

e6

60- 0.883=-0.883

e7

71.1- 0.7=0.4

and then derive the MISE estimation

\

MISEloo = P N

i=1(e i

i) 2

N= 0 .22

17. Consider a regression task with input x and output y . Suppose we observe the

following training set

X Y

0 .1 1

0 0.5

-0.3 1.2

0.3 1

0.4 0.5

0.1 0

-1 1.1

302 CHAPTER 10. NONLINEAR APPROACHES

and that the prediction model is a KNN (nearest neighbour) where K = 1 and the

distance metric is euclidean. Compute an estimation of its mean squared error by

leave-one-out.

Solution:

The leave-one-out error is

ei

i=y i y

i

where y

iis the value of the target associated to x

iand x

iis the nearest neighbour

of xi . Once we rank the training set according to the input value

X Y

-1 1.1

-0.3 1.2

0 0.5

0.1 1

0.1 0

0.3 1

0.4 0.5

we can compute the vector of errors in leave-one-out

e1

11.1-1.2=-0.1

e2

21.2- 0.5= 0.7

e3

30.5- 1=-0.5

e4

41- 0=1

e5

50- 1= -1

e6

61-0.5=0.5

e7

70.5- 1=-0.5

and then derive the MISE estimation

\

MISEloo = P N

i=1(e i

i) 2

N= 0 .464

18. Consider a regression task with input x and output y . Suppose we observe the

following training set

X Y

0.5 1

1 1

-1 1

-0.25 1

0 0.5

0.1 0

0.25 0.5

Trace the estimation of the regression function returned by a KNN (nearest neighbor)

where K = 3 on the interval [ 2, 1].

Solution: The resulting graph is piecewise constant and each piece has an ordinate

equal to the mean of three points. Once ordered the points according to the abscissa

X Y

x1 -1 1

x2 -0.25 1

x3 0 0.5

x4 0.1 0

x5 0.25 0.5

x6 0.5 1

x7 1 1

10.5. EXERCISES 303

these are the five sets of 3 points

x1 , x2, x3 ˆ y= 2. 5/ 3 (10.5.60)

x2 , x3, x4 ˆ y= 0 .5 (10.5.61)

x3 , x4, x5 ˆ y= 1 /3 (10.5.62)

x4 , x5, x6 ˆ y= 0 .5 (10.5.63)

x5 , x6, x7 ˆ y= 2 .5 /3 (10.5.64)

The transitions from xi , xi+1 , xi+2 to xi+1 , xi+2 , xi+3 , i = 1,..., 4 occur at the x = q

points where q xi = xi+3 qq =x i+3+x i

2

2.0 1.5 1.0 0.5 0.0 0.5 1.0

0.0 0.5 1.0 1.5 2.0

x

y

19. Consider a supervised learning problem, a training set of size N= 50 and a neural

network predictor with a single hidden layer. Suppose that we are able to compute

the generalisation error for different number H of hidden nodes and we discover

that the lowest generalisation error occurs for H= 3. Suppose now that the size of

the training set increases (N= 500). For which value of Hwould you expect the

lowest generalisation error? Equal, larger or smaller than 3? Justify your answer by

reasoning on the bias/variance trade-off in graphical terms (Figure).

Solution:

According to 7.7.46 the MISE generalisation error may be decomposed as the sum

of the squared bias, the model variance and the noise variance.

In Figure 10.29 we depict the first setting in black and the second one (i.e. increased

training set size) in red.

The relationship between the squared bias and the capacity of the model (number

H) is represented by the dashed line and the relationship between the variance

and the capacity is represented by the continuous thin line. The MISE (taking its

minimum in H = 3) is represented by the black thick line. Note that in the figure

we do not consider the noise variance since we are comparing two models for the

same regression task and then the noise variance is in this case an irrelevant additive

term.

If the training set size increases we can expect a variance reduction. This means

that the minimum of the MISE term will move to right. We should then expect that

the optimal number of hidden layers is H > 3 .

Note that additional observations have no impact on the squared bias while they

contribute to reduce the variance (red thin line). From the red thick line denoting

the MISE of the second setting, it appears that arg minH MSE (H ) moved to the

right.

20. Consider a feedforward neural network with two inputs, no hidden layer and a

logistic activation function. Suppose we want to use backpropagation to compute

the weights w1 and w2 and that a training dataset is collected. The student should

304 CHAPTER 10. NONLINEAR APPROACHES

0123456

0.0 0.5 1.0 1.5 2.0

complexity

MSE

Figure 10.29:

1. Write the equation of the mapping between x1 , x2 and y.

2. Write the two iterative backpropagation equations to compute w1 and w2 .

Solution:

1. ˆ y= g( z) = g( w1 x1 + w2x2 ) where g( z) = 1

1+e z and g 0 (z ) = e z

(1+ez )2

10.5. EXERCISES 305

2. The training error is

E=PN

i=1(y i ˆ yi )2

N

For j = 1,2

∂E

∂wj

= 2

N

N

X

i=1

(yi ˆ yi ) ˆ yi

∂wj

where ˆ yi

∂wj

=g0 (zi )xij

where zi =w1 x1i +w2 x2i

The two backpropagation equations are then

wj ( k + 1) = wj ( k ) + η 2

N

N

X

i=1

(yi ˆ yi ) g0 ( zi ) xij , j = 1 , 2

21. Consider a binary classification problem and the following estimations of the condi-

tional probability [

Prob {y = 1| x} vs. the real value of the target.

Trace a precision recall and the AUC curve

[

Prob {y = 1| x} CLASS

0.6 1

0.5 -1

0.99 1

0.49 -1

0.1 -1

0.26 -1

0.33 1

0.15 -1

0.05 -1

Solution: Let us first order the dataset in terms of ascending score

[

Prob {y = 1|x } CLASS

0.05 -1

0.10 -1

0.15 -1

0.26 -1

0.33 1

0.49 -1

0.50 -1

0.60 1

0.99 1

We let the threshold range over all the values of the score. For each value of the

threshold we define as positively classified the terms having a score bigger than the

threshold and negatively classified the terms having a score lower equal than the

threshold.

For instance for Thr=0.26 this is the returned classification

[

Prob {y = 1|x } ˆ yCLASS

0.05 -1 -1

0.10 -1 -1

0.15 -1 -1

0.26 -1 -1

0.33 1 1

0.49 1 -1

0.50 1 -1

0.60 1 1

0.99 1 1

306 CHAPTER 10. NONLINEAR APPROACHES

Then we measure the quantity of TP, FP, TN and FN and FPR = F P / (T N + F P ),

T P R = T P /(T P +F N )

Threshold TP FP TN FN FPR TPR

0.05 3 5 1 0 5/6 1

0.10 3 4 2 0 2/3 1

0.15 3 3 3 0 1/2 1

0.26 3 2 4 0 1/3 1

0.33 2 2 4 1 1/3 2/3

0.49 2 1 5 1 1/6 2/3

0.50 2 0 6 1 0 2/3

0.60 1 0 6 2 0 1/3

0.99 0 0 6 3 0 0

0.0 0.2 0.4 0.6 0.8

0.0 0.2 0.4 0.6 0.8 1.0

FPR

SE

22. Let us consider a classification task with 3 binary inputs and one binary output.

Suppose we collected the following training set

x1x2x3 y

1 1 0 1

0 0 1 0

0 1 0 0

1 1 1 1

0 0 0 0

0 1 0 0

0 1 1 0

0 0 1 0

0 0 0 0

0 1 0 0

1 1 1 1

1. Estimate the following quantities by using the frequency as estimator of prob-

ability

Prob {y= 1}

Prob {y= 1 |x1 = 0}

Prob {y= 1 |x1 = 0, x2 = 0, x3 = 0}

2. Consider a Naive Bayes classifier and compute its classifications if the same

dataset is used also for testing

3. Trace the ROC curve associated to the Naive Bayes classifier if the same dataset

is used also for testing. (Hint: make the assumption that the denominator of

the Bayes formula is 1 for all test points)

Solution:

1.

10.5. EXERCISES 307

Prob {y= 1 }= 3/11

Prob {y= 1 |x1 = 0 }= 0

Prob {y= 1 |x1 = 0, x2 = 0, x3 = 0 }= 0

2. Note that the values of x1 are identical to the ones of y . Then Prob { x1 = A|y =¬ A} =

0. It follows that if use a Naive Bayes and the test dataset is equal to the train-

ing set all the predictions will coincide with the values of x1 . The training error

is then zero

3. Since all the predictions are correct the ROC curve is equal to 1 for all FPR

values

23. Let us consider a binary classification task where the input xR2 is bivariate and

the categorical output variable ymay take two values: 0 (associated to red) and

1 (associated to green). Suppose that the a-priori probability is p (y =1)=0 .2

and that the inverse (or class-conditional) distributions are the bivariate Gaussian

distributions p (x |y = 0) = N (µ0 , Σ0 ) and p (x|y = 1) = N (µ1 , Σ1 ) where

µ0 = [0 ,0]T

µ1 = [1 ,1]T

and both Σ0 and Σ1 are diagonal identity matrices. The student should

1. by using the R function rmvnorm , sample a dataset of N = 1000 input/output

observations according to the conditional distribution described above,

2. visualise in a 2D graph the dataset by using the appropriate colors,

3. fit a logistic classifier to the dataset (see details below),

4. plot the evolution of the cost function J (α ) during the gradient-based minimi-

sation,

5. plot in the 2D graph the decision boundary.

Logistic regression estimates

ˆ

P(y = 1| x) = exp x T α N

1 + expx T α N = 1

1 + expxT αN , ˆ

P(y = 0| x) = 1

1 + expx T α N

where

αN = arg min

αJ(α )

and

J( α) =

N

X

i=1 y i x T

iα+ log(1 + exp x T

iα )

Note that αis the vector [α0 , α1, α2 ]T and that xi = [1, xi1 , xi2 ]T , i = 1,...,N.

The value of αN has to be computed by gradient-based minimisation of the cost

function J (α ) by performing I= 200 iterations of the update rule

α(τ) =α(τ1) ηdJ ( α (τ 1))

, τ = 1,...,I

where α (0) = [0, 0, 0]T and η = 0.001.

Solution:

See the file Exercise5.pdf in the directory gbcode/exercises of the companion R

package (Appendix F).

24. Consider a binary classification task where the input xR2 is bivariate and the

categorical output variable y may take two values: 0 (associated to red) and 1

(associated to green). Suppose that the a-priori probability is p (y =1)=0 .2 and

that the inverse (or class-conditional) distributions are

308 CHAPTER 10. NONLINEAR APPROACHES

green/cross class : mixture of three Gaussians

p( x|y= 1) =

3

X

i=1

wi N( µ1i , Σ)

where in µ 11 = [1, 1]T , µ12 = [ 1, 1], µ13 = [3, 3]T , and w1 = 0. 2, w2 = 0.3.

red/circel class: bivariate Gaussian p ( x |y= 0) = N(µ0 , Σ) where µ0 = [0, 0]T

The matrix Σ is a diagonal identity matrix.

The student should

by using the R function rmvnorm, sample a dataset of N = 1000 input/output

observations according to the conditional distributions described above,

visualise in a 2D graph the dataset by using the appropriate colours/marks,

plot the ROC curves of the following classifiers

1. linear regression coding the two classes by 0 and 1,

2. Linear Discriminant Analysis where σ2 = 1,

3. Naive Bayes where the univariate conditional distributions are Gaussian,

4. k Nearest Neighbour with k = 3, 5, 10.

The classifiers should be trained and tested on the same training set.

Choose the best classifier on the basis of the ROC curves above.

No R package should be used to implement the classifiers.

Solution:

See the file Exercise6.pdf in the directory gbcode/exercises of the companion R

package (Appendix F).

Chapter 11

Model averaging approaches

All the techniques presented so far require a model selection procedure where dif-

ferent model structures are assessed and compared in order to attain the best repre-

sentation of the data. In model selection the winner-takes-all approach is intuitively

the approach that should work the best. However, recent results in machine learning

show that the final accuracy can be improved not by choosing the model structure

which is expected to predict the best but by creating a model combining the output

of models with different structures. The reason is that every hypothesis h ( ·, αN ) is

only an estimate of the real target and, like any estimate, is affected by a bias and a

variance term. The theoretical results of Section 5.10 show that a variance reduction

can be obtained by combining uncorrelated estimators. This simple idea underlies

some of the most effective techniques recently proposed in machine learning. This

chapter will sketch some of them.

11.1 Stacked regression

Suppose we have mdistinct predictors hj (· , αN ), j = 1, . . . , m obtained from a

given training set DN . For example, a predictor could be a linear model fit on some

subset of the variables, a second one a neural network and a third one a regression

tree. The idea of averaging models is to design an average estimator

m

X

j=1

βj hj ( · , αN )

by linear combination which is expected to be more accurate than each of the

estimators taken individually.

A simple way to estimate the weights ˆ

βj is to perform a least-squares regression

of the output y on the m inputs hj (·, αN ). The training set for this regression is

then made by DN = {hi , yi }

y=

y1

y2

.

.

.

yN

H=

h1

h2

.

.

.

hN

=

h1 ( x1 , αN ) h2 ( x1 , αN ) . . . hm ( x1 , αN )

h1 ( x2 , αN ) h2 ( x2 , αN ) . . . hm ( x2 , αN )

.

.

..

.

..

.

..

.

.

h1 ( xN , αN ) h2 ( xN , αN ) . . . hm ( xN , αN )

where hi ,i = 1, . . . , N is a vector of mterms.

Once computed the least-squares solution ˆ

βthe combined estimator is

hcm ( x ) =

m

X

j=1

ˆ

βj hj ( x, αN )

309

310 CHAPTER 11. MODEL AVERAGING APPROACHES

Despite its simplicity, the least-squares approach might produce poor results

since it does not take into account the correlation existing among the hj and induced

by the fact that all of them are estimated on the same training set DN .

Wolpert [196] presented an interesting idea, called stacked generalisation for

combining estimators without suffering of the correlation problem. This proposal

was translated in statistical language by Breiman who introduced the stacked re-

gression principle [36].

The idea consists in estimating the m parameters ˆ

βj by solving the following

optimisation task

ˆ

β= arg min

β

N

X

i=1

y i

m

X

j=1

βj h(i )

j(x i )

2

where h(i)

j(x i ) is the leave-one-out estimate (8.8.2.3) of the jth model.

In other terms the parameters are obtained by performing a least-squares regres-

sion of the output y on the m inputs hj (· , α(i)

N). The training set for this regression

is then made by DN = {h

i, y i },i = 1, . . . , N

y=

y1

y2

.

.

.

yN

H=

h

1

h

2

.

.

.

h

N

=

h1 ( x1 , α(1)

N)h 2 (x 1 , α (1)

N). . . h m (x 1 , α (1)

N)

h1 ( x2 , α(2)

N)h 2 (x 2 , α (2)

N). . . h m (x 2 , α (2)

N)

.

.

..

.

..

.

..

.

.

h1 ( xN , α(N )

N)h 2 (x N , α (N )

N). . . h m (x N , α (N )

N)

where hj ( xi , α(i )

N) is the predicted outcome in x i of the jth model trained on

DN with the ith observation ( xi , yi ) set aside.

By using the cross-validated predictions hj (xi , α(i )

N) stacked regression avoids

giving unfairly high weight to models with higher complexity. It was shown by

Breiman, that the performance of the stacked regressor improves when the coeffi-

cients ˆ

βare constrained to be non-negative. There is a close connection between

stacking and winner-takes-all model selection. If we restrict the minimisation to

weight vectors w that have one unit weight and the rest zero, this leads to the

model choice returned by the winner-takes-all based on the leave-one-out. Rather

than choose a single model, stacking combines them with estimated optimal weights.

This will often lead to better prediction, but less interpretability than the choice of

only one of the m models.

11.2 Bagging

A learning algorithm is informally called unstable if small changes in the training

data lead to significantly different models and relatively large changes of accuracy.

Unstable learners can have low bias but have typically high variance. Unstable

methods can have their accuracy improved by perturbing (i.e. generating multiple

versions of the predictor by perturbing the training set or learning method) and

combining. Breiman calls these techniques P&C methods.

The bagging technique is a P&C technique which aims to improve accuracy for

unstable learners by averaging over such discontinuities. The philosophy of bagging

11.2. BAGGING 311

Figure 11.1: Histogram of misclassification rates of resampled trees: the vertical

line represents the misclassification rate of the bagging predictor.

is to improve the accuracy by reducing the variance: since the generalisation error of

a predictor h ( ·,αN ) depends on its bias and variance, we obtain an error reduction if

we remove the variance term by replacing h (·,αN ) with ED N [h(·,αN )]. In practice,

since the knowledge of the sampling distribution of the predictor is not available, a

non-parametric estimation is required.

Consider a dataset DN and a learning procedure to build a hypothesis αN from

DN . The idea of bagging or bootstrap aggregating is to imitate the stochastic process

underlying the realisation of DN . A set of Brepeated bootstrap samples D(b)

N,

b= 1 , . . . , B are taken from DN . A model α(b)

Nis built for each D ( b)

N. A final

predictor is built by aggregating the B models α(b)

N. In the regression case, the

bagging predictor is

hbag ( x ) = 1

B

B

X

b=1

h( x, α(b)

N)

In the classification case a majority vote is used.

R script

The R script bagging.R shows the efficacy of bagging as a remedy against overfit-

ting.

Consider a dataset DN = {xi , yi } ,i = 1, . . . , N of N = 100 i.i.d. normally dis-

tributed inputs x ∼ N ([0, 0, 0], I ). Suppose that y is linked to x by the input/output

relation

y= x2

1+ 4 log(|x 2 |)+5 x3 +

where ∼ N (0, 0 . 25) represents the noise. Let us train a single-hidden-layer neural

network with s = 25 hidden neurons on the training set (Section 10.1.1). The

prediction accuracy on the test set (N ts = 100) is \

MISEts = 70. 86. Let us apply a

bagging combination with B = 50 (R-file. The prediction accuracy on the test set

of the bagging predictor is \

MISEts = 6. 7. This shows that the bagging combination

reduces the overfitting of the single neural network. Below there is the histogram

of the \

MISEts accuracy of each bootstrap repetition. Figure (11.1) shows that the

bagging predictor is much better than average.

312 CHAPTER 11. MODEL AVERAGING APPROACHES

Tests on real and simulated datasets showed that bagging can give a substantial

gain of accuracy. The vital element is the instability of the prediction method.

If perturbing the learning set can cause significant changes in the predictor con-

structed, then bagging can improve accuracy. On the other hand, it can slightly

degrade the performance of stable procedures. There is a cross-over point between

instability and stability at which bagging stops improving.

Bagging demands the repetition of B estimations of h (·, α(b)

N) but avoids the

use of expensive validation techniques (e.g. cross-validation). An open question,

as in bootstrap, is to decide how many bootstrap replicates to carry out. In his

experiments, Breiman suggests that B 50 is a reasonable figure.

Bagging is an ideal procedure for parallel computing. Each estimation of h (· , α(b)

N),

b= 1 , . . . , B can proceed independently of the others. At the same time, bagging

is a relatively easy way to improve an existing method. It simply needs adding

1. a loop that selects the bootstrap sample and sends it to the learning machine

and

2. a back-end to perform the aggregation.

Note however that if the original learning machine has an interpretable structure

(e.g. classification tree), this is lost for the sake of increased accuracy.

11.3 Boosting

Boosting is one of the most powerful learning ideas introduced in the last ten years.

Boosting is a general method which attempts to boost the accuracy of any given

learning algorithm. It was originally designed for classification problems, but it

can profitably be extended to regression as well. Boosting [75, 168] encompasses a

family of methods. The focus of boosting methods is to produce a series of weak

learners in order to produce a powerful combination. A weak learner is a learner

that has accuracy only slightly better than chance.

The training set used for each member of the series is chosen based on the

performance of the earlier classifier(s) in the series. Examples that are incorrectly

predicted by previous classifiers in the series are chosen more often than examples

that were correctly predicted.

Thus Boosting attempts to produce new classifiers that are better able to predict

examples for which the current ensemble's performance is poor. Unlike Bagging,

the resampling of the training set is dependent on the performance of the earlier

classifiers. The two most important types of boosting algorithms are the Ada Boost

(Adaptive Boosting) algorithm (Freund, Schapire, 1997) and the Arcing algorithm

(Breiman, 1996).

11.3.1 The Ada Boost algorithm

Consider a binary classification problem where the output take values in {− 1,1}.

Let DN be the training set. A classifier is a predictor h (· ) which given an input

x, produces a prediction taking one of the values {− 1 ,1} . A weak classifier is one

whose misclassification error rate is only slightly better than random guessing.

The purpose of boosting is to sequentially apply the weak classification algo-

rithm to repeatedly modified versions of the data, thereby producing a sequence of

classifiers hj (· ), j = 1, . . . , m . The predictions of the m weak classifiers are then

combined through a weighted majority vote to produce the final prediction

hboo = sign

m

X

j=1

αj hj ( x, αN )

11.3. BOOSTING 313

The weights αj of the different classifiers are computed by the algorithm. The

idea is to give stronger influence to the more accurate classifiers in the sequence.

At each step, the boosting algorithm samples Ntimes from a distribution won the

training set which put a weight wi on each example (xi , yi ), i = 1, . . . , N of DN

Initially, the weights are all set to wi = 1/N so that the first step simply trains

the classifier in the standard manner. For each successive iteration j = 1, . . . , m

the probability weights are individually modified, and the classification algorithm

is re-applied to the resampled training set.

At the generic jth step the observations that were misclassified by the classifier

hj1 (· ) trained at the previous step, have their weights wi increased, whereas the

weights are decreased for those that were classified correctly. The rationale of the

approach is that, as the iterations proceed, observations that are hard to classify

receive ever-increasing influence and the classifier is forced to concentrate on them.

Note the presence in the algorithm of two types of weights: the weights αj ,j =

1, . . . , m that measure the importance of the classifiers and the weights wi ,i =

1, . . . , N that measure the importance of the observations.

Weak learners are added until some desired low training error has been achieved.

This is the algorithm in detail:

1. Initialise the observation weights wi = 1/N ,i = 1, . . . , N .

2. For j = 1 to m:

(a) Fit a classifier hj ( ·) to the training data obtained by resampling DN

using weights wi .

(b) Compute the misclassification error on the training set

\

MME(j)

emp =P N

i=1 w i I(y i 6=h j (x i ))

PN

i=1 w i

(c) Compute

αj = log((1 \

MME(j)

emp)/\

MME(j)

emp)

Note that αj > 0 if \

MME(j)

emp 1/ 2 (otherwise we stop or we restart)

and that αj gets larger as \

MME(j)

emp gets smaller.

3. (d) For i = 1, . . . , N set

wi wi ( exp[ α j ] if correctly classified

exp[αj ] if incorrectly classified

(e) The weights are normalised to ensure that wi represents a true distribu-

tion.

4. Output of the weighted majority vote

hboo = sign

m

X

j=1

αjhj ( x, αN )

R script

The R script boosting.R tests the performance of the Ada Boost algorithm in a

classification task. Consider the medical dataset Pima obtained by a statistical

survey on women of Pima Indian heritage. This dataset reports the presence of

314 CHAPTER 11. MODEL AVERAGING APPROACHES

diabetes in Pima Indian women together with other clinical measures (blood pres-

sure, insulin, age,...). The classification task is to predict the presence of diabetes

as a function of clinical measures. We consider a training set of N= 40 and a

test set of 160 points. The classifier is a simple classification tree which returns a

misclassification rate \

MMEts = 0. 36. We use a boosting procedure with m = 15

to improve the performance of the weak classifier. The misclassification rate of the

boosted classifier is \

MMEts = 0.3.

Boosting has its roots in a theoretical framework for studying machine learning

called the PAC learning model. Freund and Scapire proved that the empirical error

of the final hypothesis hboo is at most

m

Y

j=1 "2 r\

MME(j)

emp (1 \

MME(j)

emp)#

They showed also how to bound the generalisation error.

11.3.2 The arcing algorithm

This algorithm was proposed as a modification of the original Ada Boost algorithms

by Breiman. It is based on the idea that the success of boosting is related to the

adaptive resampling property where increasing weight is placed on those examples

more frequently misclassified. ARCing stays for Adaptive Resampling and Combin-

ing. The complex updating equations of Ada Boost are replaced by much simpler

formulations. The final classifier is obtained by unweighted voting. This is the

ARCing algorithm in detail:

1. Initialise the observation weights wi = 1/N ,i = 1, . . . , N .

2. For j = 1 to m:

(a) Fit a classifier hj to the training data obtained by resampling DN using

weights wi .

(b) Let ei the number of misclassifications of the ith example by the jclas-

sifiers h1 , . . . , hj .

(c) The updated weights are defined by

wi =1 + e 4

i

PN

i=1(1 + e 4

i)

3. The output is obtained by unweighted voting of the m classifiers hj .

R script

The R file arcing.R tests the performance of the ARCing algorithm in a classifica-

tion task. Consider the medical dataset Breast Cancer obtained by Dr. William H.

Wolberg (physician) at the University of Wisconsin Hospital in USA. This dataset

reports the class of cancer (malignant and benign) and other properties (clump

thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion,...).

The classification task is to predict the class of breast cancer on the basis of clinical

measures. We consider a training set of size N= 400 and a test set of size 299.

The classifier is a simple classification tree which returns a misclassification rate

\

MMEts = 0. 063. We use an arcing procedure with m= 15. It gives a misclassifica-

tion rate \

MMEts = 0.010.

11.4. RANDOM FORESTS 315

Boosting is a recent and promising technique which is simple and easy to program.

Moreover, it has few parameters (e.g. max number of classifiers) to tune. Boosting

methods advocate a shift in the attitude of the learning-system designer: instead

of trying to design a learning algorithm which should be accurate over the entire

space, she can instead focus on finding weak algorithms that only need to be better

than random.

Furthermore, a nice property of Ada Boost is its ability to identify outliers.

11.3.3 Bagging and boosting

This section makes a short comparison of bagging and boosting techniques. First of

all, in terms of bias/variance trade-off, it is important to stress that the rationale of

bagging is to reduce variance of low bias (and then high variance) learners trained

on i.d. (identically distributed) data while boosting aims to reduce sequentially the

bias of weak-learners trained on non i.d. data.

Like bagging, boosting avoid the cost of heavy validation procedures and, like

bagging, boosting trades accuracy for interpretability. As for bagging, the main

effect of boosting is to reduce variance and it works effectively for high variance

classifiers. However, unlike bagging, boosting cannot be implemented in parallel,

since it is based on a sequential procedure.

In terms of experimental accuracy, several research works (e.g. Breiman's work)

show that boosting seems outperforms bagging. Also, a number of recent theoretical

results show that boosting is fundamentally different from bagging [98].

Some caveats are notwithstanding worth to mention: the actual performance of

boosting on a particular problem is dependent on the data and the nature of the

weak learner. Also boosting can fail to perform well given insufficient data, overly

complex weak hypothesis, and definitely too weak hypothesis.

11.4 Random Forests

Ensemble learning is efficient when it combines low bias and independent estimators,

like non pruned decision trees.

Random Forests (RF) is an ensemble learning technique proposed by Breiman [38]

which combines bagging and random feature selection by using a large number of

non pruned decision trees. The rationale of RF is to reduce the variance by decor-

relating as much as possible the single trees. This is achieved in the tree-growing

process through a random selection of the input variables. In a nutshell, the algo-

rithm consists in:

1. generating by bootstrap a set of Btraining sets,

2. fitting to each of them a decision tree hb ( ·, αb ), b = 1, . . . , B where the set of

variables considered for each split (Section 10.1.4.3) is a random subset of size

n0 of the original one (feature bagging),

3. storing at each split for the corresponding split variable, the improvement of

the cost function,

4. returning as the final prediction the average of the Bpredictions

hrf ( x ) = 1

B

B

X

b=1

hb ( x, αb )

in a regression task and the majority vote in a classification task,

316 CHAPTER 11. MODEL AVERAGING APPROACHES

5. returning for each variable an importance measure.

Suppose that the B trees in the forest are almost unbiased, have a comparable

variance Var [hb ] = σ2 and a mutual correlation ρ . The RF regression predictor hrf

is then almost unbiased and from (3.10.88) its variance is

Var [hrf ] = (1 ρ2 )

B+ ρσ2

It appears then by increasing the forest size Band making the trees as uncorrelated

as possible, a Random Forest strategy reduces the resulting variance.

A rule of thumb consists of setting the size of the random subset to n0 = n . The

main hyperparameters of RF are the hyperparameters of single trees (e.g. depth,

max number of leaves), the number Bof trees and the size n0 of the random feature

set. Note that by reducing n0 we make the trees more decorrelated, yet we increase

the bias of each single tree (and then of the RF) by constraining its number of

features. In particular, a too small number n0 may be detrimental to accuracy in

configurations with very large nand small number of informative features.

11.4.1 Why are Random Forests successful?

Random Forests are often considered among the best "off-the-shelf" learning al-

gorithms since they do not require complex tuning to perform reasonably well on

chhallenging tasks. There are many reasons for their success [73]: (i) they use an

out-of-bag (Section 7.10.1) strategy to effectively manage the bias/variance trade-off

and to assess the importance of input variables, (ii) since based on trees, they easily

cope with mixtures of numeric and categorical predictor variables, (iii) they are

resilient to input outliers and invariant under monotone input transformation, (iv)

they embed a feature ranking mechanism based on an importance measure related

to the average cost function decrease during splitting, (v) they are fast to construct

and can be made massively parallel and (vi) there exist a number of very effec-

tive implementations (e.g. in the R package randomForest) and enhanced version

(notably gradient boosting trees).

11.5 Gradient boosting trees

Gradient boosting (GB) trees are an enhanced version of averaging algorithms which

rely on combining mtrees according to a forward stage-wise additive strategy [98].

The strategy consists of adding one component (e.g. a tree) at the time: after m

iterations, the resulting model is the sum of the Mindividual trees

hm ( x ) =

m

X

j=1

T( x, αj )

Given j 1 trees, the jth tree is learned such to compensate the error between

the target and the current ensemble prediction hj1 (x ). This means that

αj = arg min

α

N

X

i=1

L( yi , hj1 ( xi ) + T ( xi , α)) (11.5.1)

where αj contains the jth tree parameters, e.g. the set of disjoint regions and the

local model holding in each region. Note that in the forward stage-wise philosophy,

no adjustment of the previously added trees is considered.

11.6. CONCLUSION 317

It can shown that, for a regression task with a squared error loss function L , the

solution αj corresponds to the regression tree that best predicts the residuals

ri = yi hj1 ( xi ) , i = 1 , . . . , N

Gradient based versions exists for other differentiable loss criteria and for clas-

sification tasks. Also weighted versions of (11.5.1) exist

(αj , wj ) = arg min

α,w

N

X

i=1

L( yi , hj1 ( xi ) + wT ( xi , α))

where the contribution αj of each new tree is properly tuned.

A stochastic version of gradient boosting has been proposed in [77] where at

each iteration only a subsample of the training set is used to train the new tree.

Though gradient-boosting algorithms are considered ones of the most promising

in complex learning tasks, it is recommended to remember that their accuracy

depends, like all learning algorithms, on a number of hyperparameters, notably the

size of the constituent trees, the number mof iterations, contribution wj of each

tree, loss function degree and subsample size.

11.6 Conclusion

The averaging of ensembles of estimators relies on the counterintuitive principle

that combining predictors is (most of the time) more convenient than selecting

(what seems to be) the best. This principle is (probably together with the idea of

regularisation) one of the most genial and effective ideas proposed by researchers in

Machine Learning1. Most state-of-the-art learning strategies do owe a considerable

part of their success to the integration of the combination principle. Such principle

is so powerful that some authors suggest nowadays not to include combination in

the assessment of learning strategies (e.g. in new publications) given the risk that

the only visible beneficial effect is the one due to the combination.

The fact that this idea might appear counterintuitive sheds light on the stochas-

tic nature of the learning problem and the importance of taking a stochastic per-

spective to really grasp the problem of learning and generalising from a finite set of

observations.

11.7 Exercises

1. Verify by Monte Carlo simulation the relations (5.10.31) and 5.10.30 concerning the

combination of two unbiased estimators.

Hint: define an estimation task (e.g. estimate the expected value of a random

variable) and choose two unbiased estimators.

1...and a note of distinction should be here definitely attributed to the seminal work of re-

searchers like Jerome H. Friedman and Leo Breiman.

318 CHAPTER 11. MODEL AVERAGING APPROACHES

Chapter 12

Feature selection

In many challenging learning tasks, the number of inputs (or features) may be

extremely high: this is the case of bioinformatics [167] where the number of variables

(typically markers of biological activity at different functional levels) may go up to

hundreds of thousands. The race to high-throughput measurement techniques in

many domains allows us to easily foresee that this number could grow by several

orders of magnitude.

Using such a large number of features in learning may negatively affect general-

isation performance, especially in the presence of irrelevant or redundant features.

Nevertheless, traditional supervised learning algorithms techniques have been de-

signed for supervised tasks where the ratio between the input dimension and the

training size is small, and most inputs (or features) are informative. As a conse-

quence, their accuracy may rapidly degrade when used in tasks with few observa-

tions and a huge number of inputs.

At the same time, it is common to make the assumption that data are sparse

or possess an intrinsic low dimensional structure. This means that most input

dimensions are correlated, only a few of them contain information or equivalently

that most dimensions are irrelevant for the learning task.

For this reason, learning pipelines include more and more a feature selection

phase aiming to select a small subset of informative (or relevant) features to capture

most of the signal and avoid variance and instability issues during learning. In that

sense, feature selection can be seen as an instance of model selection problem where

the alternative models do not differ in terms of functional representation but in

terms of the used subset of inputs.

Example

This example illustrates the impact of the number of features on the model variance

in a learning task with a comparable number of features and observations. Let us

consider a linear regression dependency

y=β0 +β1 x1 +β2 x2 +β3 x3 +β4 x4 + w

where Var [w ]=0 .5, β0 = 0 . 5, β1 = 0 . 5, β2 = 0.5, β3 = 0 .5, β4 = 0 . 5. Suppose

we collect a dataset of N= 20 input/output observations where the input set

(n = 8) contains, together with the four variables x1 ,...,x4 , a set of 4 irrelevant

variables x5 ,..., x8 .

Let us consider a set of linear regression models with an increasing number of

features, ranging from zero (constant model) to 8.

The script bv linfs.R illustrates the impact of the number of features on the

average bias (estimated by Monte Carlo) and the average variance (both analytical

319

320 CHAPTER 12. FEATURE SELECTION

Figure 12.1: Trade-off bias/variance for different number of features. Bias and

variance are averaged over the set of Ninputs.

and Monte Carlo estimated) of the predictors. Figure 12.1 shows that the larger is

the number of features, the higher is the prediction variance. Note that the analyt-

ical form of the variance of a linear model prediction is presented in Section 9.1.14.

The bias has the opposite trend, reaching zero once the 4 inputs x 1,...,x4 are

included in the regression model. Overall, the more variables are considered, the

more bias is reduced at the cost of an increased variance. If a variable has no pre-

dictive value (e.g. it belongs to the set x5 ,...,x8 ), considering it merely increases

the variance with no benefit in terms of bias reduction. In general, if the addition

of a variable has a small impact on bias then the increase in prediction variance

may exceed the benefit from bias reduction [132]. The role of a feature selection

technique should be to detect those variables and remove them from the input set.

The benefits of feature selection have been thoroughly discussed in literature [89,

91]:

facilitating data visualisation and data understanding,

reducing the measurement and storage requirements,

reducing training and utilisation times of the final model,

defying the curse of dimensionality to improve prediction performance.

At the same time, feature selection implies additional time for learning since it

introduces an additional layer to the search in the model hypothesis space.

12.1 Curse of dimensionality

Feature selection addresses what is known in several scientific domains as the curse

of dimensionality. This term, coined by R E Bellman, refers to all computational

problems related to large dimensional modelling tasks.

The main issue in supervised learning is that the sparsity of data increases

exponentially with the dimension n. This can be illustrated by several arguments.

Let us consider a ndimensional space and a unit volume around a query point

xq Rn (Figure 12.2) [98]. Let V < 1 be the volume of a neighbourhood hypercube

of edge d . It follows that dn =V and d =V 1/n . Figure 12.3 illustrates the link

between neighbourhood volume Vand edge size dfor different values of n . It

12.1. CURSE OF DIMENSIONALITY 321

n=1 d=1/2 V=1/2

n= 2 d=1/2 V=1/4

n=3 d=1/2 V=1/8

Figure 12.2: Locality and dimensionality of the input space for different values of

n: unit volume (in black) around a query point (circle) containing a neighbourhood

(in red) of volume Vand edge d.

322 CHAPTER 12. FEATURE SELECTION

Figure 12.3: Neighbourhood volume vs. edge size for different values of n.

appears that for a given neighbourhood volume V, the edge length increases by

increasing n while for a given edge length d, the neighbourhood volume decreases

by increasing n . For instance if V = 0. 5 we have d = 0. 7,0.87,0. 98 for n = 2, 5,50;

if V = 0. 1 we have d = 0. 3, 0.63,0. 95 for n = 2, 5, 50. This means that for n = 50

we need to have an edge length which is 95% of the unit length if we want to barely

cover 10% of the total volume.

Let us now assess the impact of dimensionality on the accuracy of a local learn-

ing algorithm (e.g. knearest neighbour) by considering the relation between the

training set size N, the input dimension n and the number of neighbours k . If

the N points are uniformly distributed in the unit volume around the query point,

the number of neighbours kin the neighbourhood Vamounts to roughly k = NV .

Given the value of N and k (and consequently the local volume V) the edge dof the

neighbourhood increases with the dimension nand converges rapidly to one (Fig-

ure 12.4). This implies that if we use a kNN (nearest neighbour) learner for two

supervised learning tasks with same Nbut different n, the degree of locality of the

learner (represented by the length of d) is the smaller the larger is n . Analogously

if N and 0 < d < 1 are fixed, the number k = Ndn of neighbours in Vdecreases by

increasing n . In other terms, as nincreases the amount of local data goes to zero

(Figure 12.5) or equivalently all data sets are sparse for large n.

Let us now consider the case where k > 0 and 0 < d < 1 (degree of locality) are

fixed and Nmay be adjusted (e.g. by observing more points). Since

N= k/dn

we need to exponentially grow the size of the training set Nto guarantee a constant

kfor increasing n. Suppose that k= 10, d= 0 .1 and N= 100 for n= 1. If we

want to preserve the same number kof neighbours for increasing n then N has to

grow according to the following law

N= k/dn =10

(1/10)n = 10 n+1

For instance we need to observe N = 106 observations for n = 5 if we want the same

degree of locality we had for n= 1. This implies that given two supervised learning

tasks (one with n = 1 and the other with n >> 1), the second should be trained

with a number Nof a much higher order of magnitude (Figure 12.6) to guarantee

the same degree of locality of the n = 1 configuration.

Another interesting result about the impact of dimensionality on data distribu-

tion is the following: given N observations uniformly distributed in a ndimensional

unit ball centred at the origin, the median of the distance from the origin to the

closest data point is (1 1/ 2 1/N ) 1/n (Figure 12.7).

12.1. CURSE OF DIMENSIONALITY 323

Figure 12.4: Neighbourhood edge size vs. dimension n(for fixed N and k)

Figure 12.5: Number of neighbours K vs. dimension n for fixed N and d

324 CHAPTER 12. FEATURE SELECTION

Figure 12.6: Number of training examples required to preserve the same kind of

locality obtained for n = 1 with k= 10 and d= 0.1

020406080100

0.0 0.2 0.4 0.6 0.8

N=1M points

n

median distance

Figure 12.7: Median nearest neighbour as a function of nfor very large N.

12.2. APPROACHES TO FEATURE SELECTION 325

All those considerations should sound like a warning for those willing to extend

local learning approaches to large dimensional settings where familiar notions of

distance and closeness lose their meaning and relevance. Large dimensionality in-

duces high sparseness with negative impact on predictive accuracy as shown by the

bias/variance decomposition in (10.1.47). For a fixed Nand by increasing nthe

algorithm is more and more exposed to one of those two low generalisation con-

figurations: i) too small k, i.e. too few points are close to the query points (with

negative impact in terms of variance) or ii) too large d implying that the nearest

neighbours are not sufficiently close the query point (with negative impact on bias).

Though, from a bias/variance perspective, the curse of dimensionality is partic-

ularly harmful for local learning strategies, the other learning strategies should not

be considered immune either. A too large n/N ratio implies an overparametriza-

tion of the learned hypothesis and a consequent increase of the variance term in the

generalisation error which is hardly compensated by the related bias reduction. For

this reason, the adoption of a feature selection step is more and more common in

modern machine learning pipelines.

12.2 Approaches to feature selection

There are three main approaches to feature selection:

Filter methods: they are preprocessing methods. They attempt to assess

the merits of features from the data, ignoring the effects of the selected feature

subset on the learning algorithm's performance. Examples are methods that

select variables by ranking them through compression techniques (like PCA

or clustering) or computing correlation with the output.

Wrapper methods: these methods assess subsets of variables according to

their usefulness to a given predictor. The method searches a good subset using

the learning algorithm itself as part of the evaluation function. The problem

boils down to a problem of stochastic state-space search. Examples are the

stepwise methods proposed in linear regression analysis (notably the leaps

subset selection algorithm available in R [15]).

Embedded methods: they perform variable selection as part of the learn-

ing procedure and are usually specific to given learning machines. Examples

are classification trees, random forests, and methods based on regularisation

techniques (e.g. lasso)

Note that, in practice, hybrid strategies combining the three approaches above

are often considered as well. For instance in the case of a huge dimensional task

(e.g. n > 1000K as in epigenetics) it would make sense to first reduce the size of

features to a more reasonable size (e.g. some thousands or hundreds of features) by

filtering and then use some search approaches within this smaller space.

12.3 Filter methods

Filter methods are commonly used in very large dimensional tasks (e.g. n > 2000)

for the following reasons: they easily scale to very high-dimensional datasets, they

are quick because computationally simple, and they are independent of the classi-

fication algorithm. Also, since feature selection needs to be performed only once,

they can be integrated into validation pipelines comparing several classifiers.

However, they are not perfect. Filter methods, by definition, ignore any interac-

tion with the classifier and are often univariate or low-variate. The relevance of each

326 CHAPTER 12. FEATURE SELECTION

Figure 12.8: Two first principal components for a n= 2 dimensional Gaussian

distribution.

feature is assessed separately, thereby ignoring feature dependencies. This may be

detrimental in case of complex multivariate dependencies.

12.3.1 Principal component analysis

Principal component analysis (PCA) is one of the oldest and most popular pre-

processing methods to perform dimensionality reduction. It returns a set of linear

combinations of the original features so as to retain most of their variance and their

information. Those combinations may be used as compressed (or latent) versions

of the original features and used to perform learning in a lower dimensional space.

The method consists of projecting the data from the original orthogonal space

Xinto a lower-dimensional space Z , in an unsupervised manner, maximising the

variance and minimising the loss due to the projection. The new space is orthog-

onal (as the original) and its axes, called principal components, are specific linear

combinations of the original ones.

The first principal component (i.e. the axis z1 in Figure 12.8) is the axis

along which the projected data have the greatest variation. Its direction a =

[a

1, . . . , a

n]R n is obtained by maximising the variance of

z=a1 x·1 + · ·· + an x ·n = aT x

a linear combination of the original features. It can be shown that a is also the

eigenvector of the covariance matrix Var [x] associated to the largest eigenvalue [56].

The procedure for finding the other principal components is based on the same

principle of variance maximisation. The second principal component (i.e. the axis

z2 in Figure 12.8) is the axis, orthogonal to the first, along which the projected data

have the largest variation, and so forth.

12.3. FILTER METHODS 327

12.3.1.1 PCA: the algorithm

Consider the training input matrix Xhaving size [N, n]. The PCA consists of the

following steps:

1. the matrix Xis normalised and transformed to a matrix ˜

Xsuch that each

column ˜

X[ , j], j = 1, . . . , n, has null mean and unit variance1,

2. the Singular Value Decomposition (SVD) [83] (Appendix B.5.10) of ˜

Xis com-

puted ˜

X= UDV T

where U is a [N, N ] matrix with orthonormal columns, Dis a [N, n] rectan-

gular diagonal matrix with diagonal singular values d1 d2 ≥ ··· ≥ dn 0,

dj = p λj with λj eigenvalue of XT X and V is a [n, n] matrix whose or-

thonormal columns are the eigenvectors of XT X ,

3. the matrix ˜

Xis replaced by the linear transformation

Z=˜

XV = UD (12.3.1)

whose columns (also called eigen-features) are a linear combination of the

original features and the related variances are sorted in a decreasing order,

4. a truncated version of Zmade of the first h < n columns (associated to the h

largest singular values) is returned.

But how do we select the convenient number hof eigen-features? In the litera-

ture, three main strategies are considered:

1. fix a threshold αon the proportion of variance to be explained by the principal

components, e.g. choose h such that

λ1 +· ·· + λh

Pn

j=1 λ j α

where λj is the jth largest eigenvalue and P h

j=1 λ j is the amount of variance

retained by the first h components,

2. plot the decreasing values of λj as a function of j(scree plot) and choose the

value of h corresponding to a knee in the curve,

3. select the value of h as if it was a hyperparameter, e.g. by cross-validation.

The outcome of PCA is a rotated, compressed and lower dimension version of

the original input set { x1 ,...,xn } made of h < n orthogonal features {z1 ,...,zh },

sorted by decreasing variance. In that sense, PCA can be considered as a linear auto-

encoder where the encoding step is performed by (12.3.1) and the reconstruction

of the coded data to the original space is obtained by ˜

X= ZV T . It can also

be shown [56] that the PCA implements an optimal linear auto-encoder since it

minimises the average reconstruction error

N

X

i=1 kx i V T V x i k 2 (12.3.2)

which amounts, for hcomponents, to P n

j= h+1 λ j /N .

1A R dataframe may be easily normalised by using the R command scale.

328 CHAPTER 12. FEATURE SELECTION

Figure 12.9: A separable n= 2 dimensional binary classification task reduced to a

non separable one because of PCA dimensionality reduction.

PCA works in a completely unsupervised manner since the entire algorithm is

independent of the target y. Though such unsupervised nature reduces the risk

of overfitting, in some cases, it may cause a deterioration of the generalisation

accuracy since there is no reason that principal components be associated with y.

For instance, in the classification example of Figure 12.9, the choice of the first PCA

component would reduce the accuracy of the classifier instead of increasing it. In

order to account both for input variation and correlation with the target, supervised

versions of PCA exist, like principal component regression or partial least squares.

Another limitation of PCA is that it does not return a subset but a weighted

average of the original features (eigen-feature). In some cases, e.g. in bioinformatics

gene selection, PCA is then not recommended since it may hinder the interpretabil-

ity of the resulting model.

R script

The scripts pca.R and pca3D.R illustrate the PCA decomposition in the n = 2

and n = 3 case for Gaussian distributed data and compute the reconstruction

error (12.3.2).

The script pca uns.R illustrates the limits of PCA due to its unsupervised na-

ture. Consider a binary classification task with n= 2 and a separating boundary

between the two classes which is directed as the first component. In this case a

dimensional reduction is rather detrimental to the final accuracy since it transforms

the separable n= 2 problem into a non separable n= 1 problem (Figure 12.9).

PCA is an example of linear dimensionality reduction. In the machine learning

literature, however, there are several examples of nonlinear versions of PCA: among

the most important we mention the kernel-based version of PCA (KPCA) and

(deep) neural auto-encoders (Section 10.1.2).

12.3. FILTER METHODS 329

12.3.2 Clustering

Clustering, also known as unsupervised learning, is presented in Appendix A. Here

we will discuss how it plays a role in dimensionality reduction by determining groups

of features or observations with similar patterns (e.g. patterns of gene expressions

in microarray data).

The use of a clustering method for feature selection requires the definition of a

distance function between variables and the definition of a distance between clusters.

The two most common methods are

Nearest-neighbour clustering: the number of clusters is set by the user,

then each variable is assigned to a cluster at the end of an iterative procedure.

Examples are Self Organizing Maps (SOM) and K-means.

Agglomerative clustering: it is a bottom-up method where clusters are ini-

tially empty and sequentially filled with variables. An example is hierarchical

clustering (R command hclust) which starts by considering all the variables

as belonging to separate clusters. Next, it joins pairs of similar features in the

same cluster and then it proceeds hierarchically by merging the closest pairs

of clusters. The algorithm requires a measure of dissimilarity between sets of

features and a linkage criterion that quantifies the set dissimilarity as a func-

tion of the set elements pairwise distances. The visual output of hierarchical

clustering is a dendrogram, a tree diagram used to illustrate the arrangement

of the clusters. Figure 12.10 illustrates the dendrogram returned by a clus-

tering of features in a bioinformatics task. Note that the dendrogram returns

different clusters of features (and a different number of clusters) at different

heights. The choice of the optimal height cut is typically done by means of a

cross-validation strategy [126].

Clustering and PCA are both unsupervised dimensionality reduction techniques,

which are commonly used in several domains (notably bioinformatics). However, the

main advantage of clustering resides in the higher interpretability of the outcome.

Unlike the PCA linear weighting, the grouping of the original features is much more

informative and may return useful insights to the domain expert (e.g. about the

interaction of a group of genes in a pathology [92]).

12.3.3 Ranking methods

Unlike PCA and clustering, ranking methods are supervised filters since they take

into account the relation between inputs and target yto proceed with the selection.

Ranking methods consist of three steps: i) they first assess the importance (or

relevance) of each variable for the output by using a univariate measure, ii) they

rank them in decreasing order of relevance and iii) select the top k variables.

Relevance measures commonly used in assessing a feature are:

the Pearson linear correlation (the larger, the more relevant);

in case of binary classification tasks, the p-value of hypothesis tests like t-test

or Wilcoxon (the lower, the more relevant).

mutual information (Section 3.8) (the larger the more relevant).

Ranking methods are fast (complexity O (n )), and their output is intuitive and

easy to understand. At the same time, they disregard redundancies and higher-

order interactions between variables. Two typical situations where ranking does

not perform well are complementary and highly redundant configurations. In the

complementary case, two input features are very low informative about the target,

330 CHAPTER 12. FEATURE SELECTION

Figure 12.10: Dendrogram.

yet they are very informative if taken together (see the XOR configuration later).

Because of their low univariate relevance, ranking methods will rank them low and

consequently discard them. Otherwise, two variables could be both highly relevant

about the target but very similar (or identical). In this redundant case, both will

be ranked very high and selected, despite their evident redundancy.

Feature selection in a gene expression dataset

A well-known high-dimensional classification task is gene expression classification

in bioinformatics, where the variables correspond to genomic features (e.g. gene

probes), the observations to patients and the targets are biological phenotypes (e.g.

cancer grade). Because of the growing capabilities of sequencing technology, the

number of genomic features is typically much larger than patient cohorts' size.

In the script featsel.R we analyze the microarray dataset from [84]. This dataset

contains the genome expressions of n= 7129 genes for N = 72 patients, and V = 11

related phenotype variables. The expression matrix Xand the phenotype vector

Yare contained in the dataset data(golub). The script studies the dependency

between the gene expressions and the binary phenotype ALL.AML indicating the

leukaemia type: lymphoblastic leukaemia (ALL) or acute myeloid leukaemia (AML).

Relevant features are selected by correlation ranking and the misclassification errors

are computed for different sizes of the feature set.

12.4 Wrapping methods

Wrapper methods combine a search in the space of possible feature subsets with

an assessment phase relying on a learner and a validation (often cross-validation)

12.4. WRAPPING METHODS 331

technique. Unlike filter methods, wrappers take into consideration the interaction

between features, and this in a supervised manner. Unfortunately, this implies a

much higher computational cost, especially in the case of expensive training phases.

Also, the dependance of the final result on the learner choice could be considered as

anuisance factor confounding the impact of the feature set on the final accuracy2.

In other terms, the issue is: was the feature set returned by the wrapper because it

was good in general or only for that specific learner (e.g. a neural network)?

The wrapper search can be seen as a search in a space W = { 0, 1}n where a

generic vector w W is such that

w[ j] = ( 0 if the input jdoes NOT belong to the set of features

1 if the input jbelongs to the set of features

Wrappers look for the optimal vector w ∈ { 0, 1}n such that

w = arg min

w W

\

MISEw(12.4.3)

where \

MISEw is the estimation of the generalisation error of the model based on

the set of variables encoded by w. Since in real-settings the actual generalisation

error is not directly observable, the computation of \

MISEw requires the definition

of a learner and of a validation strategy.

Note that the number of vectors in Wis equal to 2n, that it doubles for each

new feature and that for moderately large n (e.g. n > 20), the exhaustive search

is no more affordable. For this reason, wrappers typically rely on heuristic search

strategies.

12.4.1 Wrapping search strategies

Three greedy strategies are commonly used to avoid the exponential complexity

O(2n ) of the exhaustive approach:

Forward selection: the procedure starts with no variables and progressively

incorporates features. The first selected input is the one that returns the

lowest generalisation error. The second input selected is the one that, together

with the first, has the lowest error, and so on, until no further improvement is

made or the required number of features is attained. An example of forward

selection is implemented in the R script fs wrap.R.

Backward selection: it works in the opposite direction of the forward ap-

proach by progressively removing features from the original feature set. The

procedure starts by learning a model using all the nvariables and, therefore,

requires at least N > n. Then the impact of dropping one feature at a time

from the current subset is assessed. The feature which is actually removed

is the one that yields the lowest generalisation error after deletion. In other

terms, it is the one whose absence causes the lowest increase (or highest de-

crease) of the generalisation error. The procedure iterates until the desired

number of features is attained.

Stepwise selection: it combines the previous two techniques by testing for

each set of variables, first the removal of features belonging to the set, then

the addition of variables not in the set.

2This is the reason why a blocking factor approach to control the variability due to the learner

algorithm and improve the robustness of the solution has been proposed in [28].

332 CHAPTER 12. FEATURE SELECTION

It can be shown that the forward and the backward strategies have a O (n2 ) time

complexity in the case of n steps: since the i th step (i = 0, . . . , n 1) requires

n iassessments to select (or remove) the (i + 1)th feature, the computational

complexity for n steps amounts to P n1

i=0 (n i) = n(n+1)

2.

Nevertheless, since such complexity cannot be affordable either in case of very

large n , it is common usage to reduce first the number of features by using a fast

filter method (e.g. ranking) and then apply a wrapper strategy on the remaining

number of features. Another trick consists of limiting the maximum size of the

feature set, then reducing the computational cost.

12.4.2 The Cover and van Campenhout theorem

The rationale of forward and backward greedy heuristics is that an optimal set of

size k should contain the optimal set of size k 1. Though this seems intuitive, in

the general case there is no reason why this relation should hold. A formal result

in that sense is provided by the Cover and van Campenhout theorem [58], which

contains a negative result about the aim of wrapper search techniques to find the

optimal subset by local procedures.

Let us consider a learning problem and denote by R (w ) the lowest functional

risk (7.2.6) for the subset of variables w . Cover and van Campenhout proved that

the only generally valid (i.e. which holds for all data distributions) monotonic

relation linking feature size and generalisation is :

w2 w1 R ( w1 ) R ( w2 ) (12.4.4)

i.e. by adding variables we reduce the minimal risk3.

Given n features, any ordering of the 2nsubsets which is consistent with the

above constraint is indeed possible. This means for any possible ordering, there

exists a distribution of the data that is compatible with that. If the three variables

optimal set is

{x ·1 ,x ·3 , , x ·13 }

there is no guarantee that the best set of four variables is a superset of w1 (as it is

assumed in forward selection). According to this theorem, there exists a distribution

for which the best set of 4 features could well be

{x·2 ,x·6 ,x·16 , x ·23 }

since this is not in contradiction with the constraint (12.4.4). In other words, the

Cover and van Campenhout theorem states that there are data distributions for

which forward/backward strategies could be arbitrarily bad.

12.5 Embedded methods

They are typically less computationally intensive than wrapper methods but are

specific to a learning machine. Well-known examples are classification trees, Ran-

dom Forests (Section 11.4), Naive Bayes (Section 10.2.3.1), shrinkage methods and

kernels.

3Note that this relation refers to the optimal model that could be learned with the input subset

wand that the notion of lowest functional risk does not take into consideration the model family

nor the finite-size setting. In other terms, this inequality refers only to the bias and not the

variance component of the generalisation error. So in practice though in theory R (w1 )R (w2 )

it could happen that GN (w1 )GN (w2 ) where GN is the generalisation error of the model learned

with N observations

12.5. EMBEDDED METHODS 333

12.5.1 Shrinkage methods

Shrinkage is a technique to improve a least-squares estimator by regularisation

and consists of reducing the model variance by adding constraints on the value

of coefficients. In what follows, we present two shrinkage approaches that penalise

the least-squares solutions having a large number of coefficients with values different

from zero. The rationale is that only those variables, whose impact on the empirical

risk is considerable, deserve a coefficient different from zero and should appear in

the fitted model. Shrinkage is an implicit (and more continuous) embedded manner

of doing feature selection since only a subset of variables contributes to the final

predictor.

12.5.1.1 Ridge regression

Ridge regression is an example of shrinkage method applied to least squares regres-

sion

ˆ

βr = arg min

b{

N

X

i=1

(yi x T

ib) 2 +λ

p

X

j=1

b2

j}=

= arg min

b(Y Xb ) T (Y Xb) + λb T b

where λ > 0 is a complexity parameter that controls the amount of shrinkage:

the larger the value of λ, the greater the amount of shrinkage. Note that if λ = 0

the approach boils down to a conventional unconstrained least-squares.

An equivalent formulation of the ridge problem is

ˆ

βr = arg min

b

N

X

i=1

(yi x T

ib) 2 ,

subject to

p

X

j=1

b2

jL

where there is a one-to-one correspondence between the parameter λ and L[98].

It can be shown that the ridge regression solution is

ˆ

βr = ( XT X+ λIp ) 1 XT Y(12.5.5)

where Ip is the [p, p ] identity matrix (p =n + 1) and it is typically recommended

that the Xcolumns are normalised (zero mean and unit variance) [132]. In algebraic

terms, a positive λensures that the matrix to be inverted be symmetric and strictly

positive definite.

If n >> N it is recommended to take advantage of the SVD decomposition (B.5.11)

to avoid the inversion of a too large matrix [100]. If we set X = U DV T then we

obtain from (12.5.5) and (B.9.15)

ˆ

βr = ( V DU T U DV T + λIp )1 V DU T Y= V ( RT R+ λIN ) 1 RT Y

where R = UD is a [N, N ] matrix and IN is the [N, N ] identity matrix.

In general, ridge regression is beneficial in numerical, statistical and inter-

pretability terms. From a numerical perspective, it is able to deal with rank deficient

matrices X and reduces the ill-conditioning of the matrix XT X . From a statistical

perspective, it reduces the variance of the least-squares solution ˆ

βr (Section (9.1.14))

at the cost of a slight bias increase. Given the predominance of the variance term

in high-dimensional tasks, ridge regression enables a reduction of the generalisation

error. Last but not least, pushing the absolute value of many coefficients to zero, it

allows the identification of a small (then interpretable) number of input features.

334 CHAPTER 12. FEATURE SELECTION

12.5.1.2 Lasso

Another well-known shrinkage method is lasso which estimates the linear parame-

ters by

ˆ

βr = arg min

b

N

X

i=1

(yi x T

ib) 2 ,(12.5.6)

subject to

p

X

j=1 |b j | ≤ L(12.5.7)

If on one hand the 1-norm penalty of the lasso approach allows a stronger constraint

on the coefficients, on the other hand it makes the solution nonlinear and demands

the adoption of a quadratic programming algorithm (details in Appendix B.8).

To formulate the problem (12.5.6) in the form (B.8.12) with linear constraints,

we may write the bj terms as the sum of two non-negative numbers

bj =b+

jb

j=|b j |+b j

2|bj | − b j

2

The function to optimize becomes

J( b) = bT XT Xb 2 YT Xb = ( b+ b )T XTX ( b+ b ) 2 YT X( b+ b ) =

b + b X T XXT X

XT X X T X b +

b + 2YT X 2YT X b +

b (12.5.8)

with the constraints

1 1 . . . 1

1 0 . . . 0

0 1. . . 0

0 0 . . . 1

b+

b

L

0

···

0

where the left-hand matrix is [2p + 1, 2 p ]. The first line of the inequality is (12.5.7)

since P p

j=1(b +

j+b

j) = P p

j=1 |b j | ≤ L.

Note that if L > P p

j=1 | ˆ

βj |the lasso returns the common least-squares solution.

The penalty factor Lis typically set by having recourse to cross-validation strategies.

Though the difference between ridge regression and lasso might seem negligible,

the use of a 1-norm penalty instead of a 2-norm has a sensible impact on the number

of final coefficients which are set to zero. Figure 12.11 visualises this in a bivariate

case: ˆ

βdenotes the least-squares solution which would be returned by both methods

if λ = 0. Since λ > 0, the minimisation combines the empirical risk function (whose

contour lines are the ellipsoids around ˆ

β) and the regularisation term (whose contour

lines are around the origin). Note that the only difference between the two figures

is the shape of the regularisation contour lines (related to the used norm). The

minimisation solution is a bivariate vector which lies somewhat (depending on the

λvalue) at the intersection of an empirical risk contour line and a regularisation

one. The figure shows that this intersection in the lasso case tends to be closer

to the axis β1 = 0, this meaning that the first estimated coefficient is set to zero.

Because of the circular shape of regularisation contours, this is much less probable

in the ridge regression case.

R script

The R script lasso.R implements the quadratic programming minimisation in (12.5.8)

by using the R library quadprog. The script applies the lasso strategy to a regression

12.5. EMBEDDED METHODS 335

Figure 12.11: Ridge regression vs lasso [98].

task where the number of features nis comparable to the number of observations N

and only a small number of features is relevant. The results show the impact of the

constraint L on the empirical risk and the evolution of the lasso solution moving

towards one of the axis. In particular the smaller L, the less importance is given to

minimise J , the larger the empirical risk and the smaller is the number of estimated

parameters different from zero.

The shrinkage approach has been very successful in recent years and several

variants of the methods mentioned above exist in literature: some adopt different

penalty norms, some combine different norms (e.g. Elastic-net) and some combine

shrinkage with greedy search (e.g. Least Angle Regression).

12.5.2 Kernel methods

Many learning algorithms, such the perceptron, support vector machine (SVM) and

PCA, process data in a linear manner through inner products (Section B.2). Those

techniques are exposed to two main limitations: the linear nature of the model and

the curse of dimensionality for large n.

Kernel methods [171] adapt those techniques by relying on the combination of

two smart ideas: i) address large dimension nproblems by solving a dual problem

in a space of dimension Nii) generalise the notion of inner product by adopting a

user-specified kernel function, i.e., a similarity function over pairs of data points.

Kernel functions operate in a high-dimensional, implicit feature space without

computing the coordinates of the data in that space. This allows to take advantage

of high nonlinear dimensional representations without actually having to work in

the high dimensional space.

336 CHAPTER 12. FEATURE SELECTION

Figure 12.12: Implicit transformation of the problem to a high-dimensional space.

12.5.3 Dual ridge regression

We introduced the dual formulation of the linear least-squares problem in Sec-

tion 9.1.18. Consider now a ridge regression problem (Section 12.5.1.1) with pa-

rameter λ ∈ <+ . The conventional least-squares solution is the [n, 1] parameter

vector

ˆ

β= ( X0 X+ λIn ) 1 X0 y

where In is the identity matrix of size n. Since from (B.9.15)

(X0 X + λIn )1 X0 =X0 (XX0 + λIN )1

where IN is the identity matrix of size N, the dual formulation is

ˆ

β= X0 ( XX0 +λIN )1 y= X0 α

where

α= ( K+ λIN )1 y

is the [N, 1] vector of dual variables and K = XX0 is the Kernel or Gram [N , N]

matrix. Note that all the information required to compute α is this matrix of inner

products.

The prediction for a test [Nts , n ] dataset Xts is

ˆ yts = Xts ˆ

β= Xts X0 α= Kts ( K+ λIN ) 1 y

where Kts is a [N ts , N ] matrix with k j,i = hxj , xi i, j = 1, . . . , Nts , , i = 1, . . . , N .

This derivation allows transforming a ndimensional linear task into a Ndimen-

sional one. This is of course very relevant if n >> N. However, the model remains

linear. What about non-linear models?

12.5.4 Kernel function

Suppose to apply the nonlinear transformation Φ : x ∈ <n Φ(x ) ∈ <M to

the inputs of the ridge regression problem discussed above (Figure 12.12). The

prediction for an input xwould now be

ˆ y= y0 ( K+ λIN )1 k

where

Ki,j = hΦ( xi ) , Φ( xj ) i , ki =h Φ( xi ) , Φ( x )i

12.6. SIMILARITY MATRIX AND NON NUMERIC DATA 337

The rational of kernel methods is that those inner products can be computed

efficiently without explicitly computing the mapping Φ thanks to a kernel func-

tion [171]. A kernel function is a function κ that for all x, z X satisfies

κ( x, z) = h Φ( x ) , Φ( z )i

where Φ is a mapping from Xto a feature space F. For instance

κ( x, z) = h x, z i2 = h Φ( x ) , Φ( z )i

where

Φ : x = (x1 , x2 ) Φ(x ) = (x 2

1, x 2

2, 2x 1 x 2 )F

Kernels de-couple the specification of the algorithm from the specification of the

feature space since they provide a way to compute dot products in some feature

space without even knowing what this space the function Φ are.

For instance

κ( x, z) = (1 + xT z )2

corresponds to a transformation to M= 6 dimensional space

Φ(x1 , x2 ) = (1, x2

1, x 2

2, 2x 1 , 2x 2 , 2x 1 x 2 )

A Gaussian kernel κ ( x, z ) = expγk x z k2 corresponds to a transformation to an

infinite-dimensional space.

Theoretically, a Gram matrix must be positive semi-definite (PSD). Empirically,

for machine learning heuristics, choices of a function κthat do not satisfy PSD

condition may still perform reasonably if κat least approximates the intuitive idea

of similarity.

The general idea of transposing a low-dimensional method to a nonlinear high-

dimensional setting by using a dual formulation is generally referred to as kernel

trick : given any algorithm that can be expressed solely in terms of dot products,

the kernel trick allows us to construct different nonlinear versions of it.

Kernel methods are together with deep learning and random forests among the

most successful methods in the history of machine learning. We decided to present

them in this section for their powerful strategy in dealing with settings with high

dimension and low number of observations. Their strength can however turn into

a weakness if we aim to scale the approach to very large N . At the same time, as

for all the other methods presented in this book, their generalisation accuracy is

strictly dependent on the adequate choice of the related hyperparameters. In the

case of kernel methods the most important hyperparameters are the regularisation

term λ , the analytical form of the kernel function and the related parameters.

12.6 Similarity matrix and non numeric data

In the previous sections, we have considered feature selection techniques for conven-

tional supervised tasks where data are numeric and represented in a conventional

tabular form DN . What about non-conventional tasks where the training set is not

a data table but a set of items? Examples of items could be music tracks, texts,

images, web sites or graphs. Often, in those cases, we are not able (or confident)

in encoding each item as a numeric vector of size n. Nevertheless, we could be con-

fident in defining a similarity score between pairs of items. For instance, we may

use musical genre to measure the similarity between tracks or user access statistics

to obtain the similarity between web sites.

As a result, we may encode the item set as a similarity matrix S of size [N, N ]

which becomes an alternative way of representing the dataset.

338 CHAPTER 12. FEATURE SELECTION

A symmetric factorisation of a symmetric [N, N ] matrix

S F F T (12.6.9)

is an approximation of the similarity matrix where Fis a [N, K] matrix. The

matrix F may be used as an approximate Kdimensional numeric representation of

the non-numeric item-set.

Note that the positive definitiveness of S is a necessary and sufficient condition

for having an exact factorisation, i.e. an identity in (12.6.9). This is guaranteed

in the numeric case where Sis the covariance matrix and the pairwise similarity is

computed by dot product. In the generic non-numeric case, techniques to repair the

positive definitiveness of Smay be adopted. An alternative is the use of optimisation

techniques to obtain Fas the solution of the minimisation task

F= arg min

UkS UU T k 2

F

Another limitation of the factorisation approach is that it is hardly scalable for

very large N . For such cases sampling based solution have been proposed in [2].

12.7 Averaging and feature selection

The role of averaging methods in supervised learning has been discussed in the

previous chapter. Averaging may play a crucial role also in dealing with large

dimensionality. Instead of choosing one particular feature selection method, and

accepting its outcome as the final subset, different feature selection methods can

be combined using ensemble approaches. Since there is not an optimal feature

selection technique and due to the possible existence of more than one subset of

features that fits the data equally well, model combination approaches have been

adapted to improve the robustness and stability of final, discriminative methods.

Ensemble techniques typically rely on averaging the outcome of multiple models

learned with different feature subsets. A well-known technique is the random sub-

space method [102], also known as feature bagging, which combines a set of learners

trained on random subsets of features.

12.8 Feature selection from an information-theoretic

perspective

So far, we focused on algorithmic methods to return a subset of relevant features,

without making any formal definition of relevance. In this section, we formalise the

notion of feature relevance by using concepts of information theory, like entropy,

mutual information and conditional information from Sections 3.8 and 3.8.1.

12.8.1 Relevance, redundancy and interaction

This section defines in information-theoretic terms what is a relevant variable in a

supervised learning task where Xis a set of ninput variables and yis the target.

These definitions are obtained by interpreting in information-theoretic terms the

definitions made by [117].

Definition 8.1 (Strong relevance) . A variable xi X is strongly relevant to the

target yif

I(Xi ;y ) < I (X ;y )

where X i is the set obtained by removing the variable xi from X.

12.8. INFORMATION-THEORETIC PERSPECTIVE 339

In other words, a variable is strongly relevant if it carries some information about

ythat no other variable can carry. Strong relevance indicates that the feature is

always necessary for an optimal subset.

Definition 8.2 (Weak relevance).A variable is weakly relevant to the target yif

it is not strongly relevant and

S X i :I ( S; y) < I ( {xi , S };y )

In other words, a variable is weakly relevant when it exists a certain context S

in which it carries information about the target. Weak relevance suggests that the

feature is not always necessary but may become necessary at certain conditions.

This definition makes clear that for some variables (typically the majority) the

relevance is not absolute but more a context-based notion. In a large variate setting,

those features are the hardest to deal with since their importance depends on the

other selected ones.

Definition 8.3 (Irrelevance) . A variable is irrelevant if it is neither strongly or

weakly relevant.

Irrelevance indicates that the feature is not necessary at all. This is definitely

the easiest case in feature selection. Irrelevant variables should be simply discarded.

Example

Consider a learning problem where n = 4, x2 = x3 +w 2

y=( 1 + w,x1 + x2 >0

0,else

where w and w2 are noise terms. Which variables are strongly, weakly relevant and

irrelevant?

Definition 8.4 (Markov blanket).Let us consider a set X of n r.v.s., a target

variable y and a subset My X . The subset My is said to be a Markov blanket of

y, y/ My iff

I(y ;X (My ) | My ) = 0

The following theorem can be shown [183, 150]:

Theorem 8.5 (Total conditioning). If the distribution has a perfect map in a DAG

(Section 4.3.2.1) then

x My I ( x; y| X (x,y ) )> 0

This theorem proves that, under specific assumptions about the distribution

discussed in Section 4.3.2.1, the Markov blanket of a target yis composed of the

set of all the strongly relevant variables in X.

Another useful notion to reason about the information of a subset of variables

is the notion of interaction.

Definition 8.6 (Interaction) . Given three r.v.s. x1 ,x2 and y we define the inter-

action between these three variables as

I(x1 ;y ) I(x1 ;y |x2 )

340 CHAPTER 12. FEATURE SELECTION

Figure 12.13: XOR classification separable task with two inputs and one binary class

taking two values (stars and rounds). Two variables x1 and x2 are complementary:

they bring alone no information but they bring the maximal information about y

when considered together.

The interaction term satisfies the following relation:

I(x1 ;y ) I(x1 ;y |x2 ) = I(x1 ;x 2) I(x1 ;x2 | y ) = I(x2 ;y ) I(x2 ;y |x1 )

In what follows, we show that it is possible to decompose the joint information

of two variables in the sum of the two univariate terms and the interaction. From

the chain rule (3.8.82)

I(x2 ;y| x1 ) + I(x1 ;y ) = I(x1 ;y |x2 ) + I(x2 ;y )

we have

I(x2 ;y| x1 ) = I(x2 ;y ) I(x1 ;y ) + I(x1 ;y| x2 )

By summing I (x1 ;y ) to both sides, from (3.8.82) it follows that the joint information

of two variables about a target ycan be decomposed as follows:

I({ x1 ,x2 }; y) = I(x1 ;y ) + I(x2 ;y ) [I(x1 ;y ) I(x1 ;y |x2 )]

| {z }

interaction

=

=I (x1 ;y ) + I (x2 ;y ) [I (x1 ;x2 )I (x1 ;x2 | y )]

| {z }

interaction

(12.8.10)

What emerges is that the joint information of two variables is not necessarily equal,

greater or smaller than the sum of the two individual information terms. All de-

pends on the interaction term: if the interaction term is negative, the two variables

are complementary, or in other terms, they jointly bring a higher information than

the sum of the univariate terms. This is typically the case of the XOR exam-

ple illustrated in Figure 12.13 [89]. In this case I (x1 ;y ) = 0, I (x2 ;y ) = 0 but

I({x1 ,x2 } ;y ) >0 and maximal. When they are redundant, the resulting joint

information is lower than the sum I ( x1 ;y ) + I (x2 ;y ).

Since (12.8.10) holds also when x1 and/or x2 are sets of variables, this result

sheds an interesting light about the non-monotonic nature of feature selection [199].

12.8. INFORMATION-THEORETIC PERSPECTIVE 341

12.8.2 Information-theoretic filters

In terms of mutual information the feature selection problem can be formulated as

follows. Given an output target yand a set of input variables X ={ x1 ,...,xn } ,

the optimal subset of dvariables is the solution of the optimisation problem

X = arg max

XS X,| XS |=dI(X S ;y) (12.8.11)

Thanks to the chain rule (3.8.82), this maximisation task can be tackled by

adopting an incremental approach (e.g. forward approach).

Let X = {xi }, i = 1, . . . , n the whole set of variables and XS the set of s variables

selected after ssteps. The choice of the (s + 1)th variable x(s+1) X XS can be

done by solving

x(s+1) = arg max

xk X X S

I({XS ,xk } ;y ) (12.8.12)

This is known as the maximal dependency problem and requires at each step multi-

variate estimation of the mutual information term I ({XS , xk } ;y ). Such estimation

is often inaccurate in large variate settings (i.e. large nand large s) because of

ill-conditioning and high variance issues.

In literature several filter approaches have been proposed to solve the optimi-

sation (12.8.12) by approximating the multivariate term I ({XS , xk } ;y ) with low

variate approximations. These approximations are necessarily biased, yet much less

prone to variance than their multivariate counterparts.

We mention here two of the most used information theoretic filters:

CMIM [74]: since according to the first (chain-rule) formulation

arg max

xk X X S

I({XS ,xk } ;y ) = arg max

xk X XS

I(xk ;y |XS )

this filter adopts the low-variate approximation

I(xk ;y| XS ) min

xj XS

I(xk ;y| xj )

mRMR (minimum Redundancy Maximal Relevance) [152]: the mRMR method

approximates at the (s + 1)th step I ({XS , xk } ;y ) with

I(xk ;y ) 1

sX

xi X S

I(xi ;xk )

where s is the number of features in XS . The method implements a forward

selection which selects at the (s + 1)th step

x(s +1) = arg max

xk X XS "I(x k ;y) 1

sX

xi XS

I(xi ; xk )#

that is a variable which has both high relevance I (xk ;y ) and low average

redundancy with the set XS .

12.8.3 Information-theoretic notions and generalisation

Most of this book has dealt with generalisation error to assess, compare and select

prediction models and in this chapter, we presented feature selection as an instance

of model selection. Nonetheless, this last part of the chapter has been mainly

referring to information-theoretic notions for performing feature selection. It is

342 CHAPTER 12. FEATURE SELECTION

then important to provide some elucidation on how information-theoretic notions

relate to generalisation error.

Given a set Xof input features and a target feature y, the quantity I ( X ;y ) is

not directly observable and has to be estimated before use. Since

I(X ;y ) = H(y ) H(y| X ),

maximising I (X ;y ) in (12.8.11) is equivalent to minimise H (y| X ). The term

H(y| X ) is the entropy (or uncertainty) of yonce the value of the input set X

is given. In the Normal case, this term is proportional to the conditional variance

(Equation (3.5.59)). It follows that finding the set of inputs Xwhich minimises

H(y| X ) boils down to find the set of features that attains the lowest generalisation

error (7.2.6).

In real-world settings, since the conditional entropy H (y| X ) is not observable,

it may be approximated by the generalisation error, e.g. by MISE in the regres-

sion case. Hopefully, the link between feature selection, generalisation error and

information-theory becomes clear: finding the set that maximises the mutual in-

formation in (12.8.11) boils down to find the set that minimises the estimated

generalisation error (12.4.3).

12.9 Assessment of feature selection

Most of the discussed techniques aim to find the best subset of features by per-

forming a large number of comparisons and selections. This additional search layer

increases inevitably the space of possible models and the variance of the resulting

one. Despite the use of validation procedures, low misclassification or low predic-

tion errors may be found only due to chance. As stated in [132], given a sufficiently

exhaustive search, some apparent pattern can always be found, even if all predictors

have come from a random number generator. This is due to the fact that, as a

consequence of the search process, the set of features is dependent on the data used

to train the model, introducing then what is called selection bias4 [132].

A bad (and dangerous) practice is using the same set of observations to select

the feature set and assess the accuracy of the classifier. Even if cross-validation

is used to assess the accuracy of the classifier, this will return an overoptimistic

assessment of the generalisation error (Figure 12.14). Cross-validation has to be

used to assess the entire learning process which is composed of both a feature

selection and a classification step. This means that for each fold, both feature

selection and classification has to be performed before testing on the observations

set aside. Keeping feature selection out of cross-validation will return an assessment

which will be as much biased as the number of observations is small.

If cross-validation cannot be carried out (e.g. because of too small training size),

then the use of external validation sets is strongly recommended.

If no additional data are available, an alternative consists of comparing the gen-

eralisation accuracy returned by cross-validation on the original data with the one

obtained by re-running the learning procedure with randomised datasets. This is

inspired by the method of permutation testing described in Section 6.6. The proce-

dure consists of repeating the feature selection and the cross-validation assessment

several times by using a randomised dataset instead of the original one. For in-

stance, a randomised dataset may be obtained by reshuffling the output vector, a

permuting operation that artificially removes the dependency between the inputs

and the output. After a number of repetitions with randomised datasets, we obtain

4For a formal justification of selection bias, look at Appendix C.11. A causal perspective on

selection bias is in Section 13.7.4

12.10. CONCLUSION 343

Figure 12.14: Selection bias associated to feature selection: the internal leave-one-

out is an overoptimistic estimator of the test generalisation error.

the null distribution of the accuracy in case of no dependency between inputs and

output. If the accuracy associated with the original data is not significantly better

than the one obtained with randomised data, we are overfitting the data. For in-

stance, let us consider a large-variate classification task where the cross-validated

misclassification error after feature selection is 5%. If we repeat the same learning

procedure 100 times with randomised datasets and we obtain a significant (e.g. 10)

number of times a misclassification error smaller or equal than 5%, this is a sign of

potential overfitting.

Making a robust assessment of a feature selection outcome has a striking im-

portance today because we are more and more confronted with tasks characterised

by a very large feature to sample ratio (e.g. in bioinformatics [8]), where a bad

assessment procedure can give too optimistic (overfitted) results.

Example

The script fselbias.R illustrates the problem of selection bias in the case of in-

tensive search during feature selection. Consider a linear input/output regression

dependency with n= 54 inputs (of which only 4 relevant and others irrelevant) and

a dataset of size N= 20. Let us perform a forward search based on internal leave-

one-out. Figure 12.14 shows the evolution of the internal leave-one-out MSE and a

more reliable estimation of the generalisation MSE based on an independent test set

(5000 i.i.d. examples from the same input/output process). It appears that, as the

feature set size increases, the internal leave-one-out error returns a very optimistic

estimation of the generalisation error. Therefore, the internal leave-one-out error is

unable to detect that the optimal size of the input set (i.e. the number of strongly

relevant variables) is equal to four.

12.10 Conclusion

Nowadays, feature selection is an essential component of a real-world learning

pipeline. This chapter discussed how the problem is typically addressed as a stochas-

tic optimisation task in a combinatorial state-space where the assessment of each

solution and the search strategy are key elements. Most heuristic approaches rely

on a monotonic assumption, stating that the best subset of size kis always con-

tained in the best subset of size > k. The theorem in Section 12.4.2 and the notions

344 CHAPTER 12. FEATURE SELECTION

of interaction discussed in Section 12.8.1 show that this assumption is simplistic.

Variables that are almost non-informative alone may become extremely informative

together since the relevance of a feature is context-based. This is made formal by

notions of graphical models (e.g. d-separation or Markov blanket) which unveil the

conditional nature of dependency (Chapter 4). Our opinion is that the best way

of conceiving feature selection is not by black-box optimisation but by reasoning

on the conditional structure of the distribution underlying the data. The final aim

should be, as much as possible, to shed light on the context-based role of each fea-

ture. In recent years there have been many discussions about the interpretability

of data-driven models though it is not always made clear what is the most valuable

information for the human user. We deem that in a large variate task the most use-

ful outcome should be an interpretable description of features, returning for each of

them a context-based degree of relevance. Accuracy is only a proxy of information:

the real information is in the structure.

12.11 Exercises

1. Consider the dataset

x1x2x3 y

1 1 0 1

0 0 1 0

0 1 0 0

0 1 1 0

1 1 0 0

1 0 1 1

1 0 0 1

0 1 1 0

1 0 1 0

1 0 0 0

1 1 0 0

0 1 1 0

Rank the input features in a decreasing order of relevance by using the correlation

ρxy =ˆ σxy

ˆ σx ˆ σy

as measure of relevance.

Solution: Since ρx 1 y = 0. 488, ρx 2 y = 0. 293, ρ x 3 y = 0. 192, the ranking is

x1 , x2 ,x3 .

2. Consider a regression task with two inputs x1 ,x2 and output y. Suppose we observe

the following training set

X1X2Y

-0.2 0 .1 1

0.1 0 0.5

1 -0.3 1.2

0.1 0.2 1

-0.4 0.4 0.5

0.1 0.1 0

1 -1 1.1

1. Fit a multivariate linear model with β0 = 0 to the dataset.

2. Compute the mean squared training error.

3. Suppose you use a correlation-based ranking strategy for ranking the features.

What would be the top ranked variable?

12.11. EXERCISES 345

Hint:

A= a 11 a12

a12 a22 A 1 = 1

a11a22 a2

12 a 22 a 12

a12 a11

Solution:

1. XT X = 2. 23 1. 45

1. 45 1. 31

(XT X )1 = 1. 599 1.77

1. 77 2. 72

XT Y= 2 .05

0.96

β= ( XTX)1 XT Y= 1 .58

1. 016

2. e =Y =

1. 21

0. 34

0. 08

0. 64

0. 73

0. 26

0. 54

It follows that the Mean Squared training error

amounts to 0.41.

3. Since

ρX 1 Y =PN

i=1(X i1µ 1 )(Y i µ Y )

qP N

i=1(X i1µ 1) 2 (Y iµ Y) 2

= 0.53

and

ρX 2 Y =PN

i=1(X i2µ 2 )(Y i µ Y )

qPN

i=1(X i2µ 2) 2 (Y iµ Y ) 2

= 0.48

where µ1 = 0. 24, µ2 = 0. 07, µY = 0. 75, X1 is the top ranked variable.

3. The .Rdata file bonus4.Rdata in the directory gbcode/exercises of the companion

R package contains a regression dataset with N = 200 observations, n= 50 input

features (in the matrix X) and one target variable (vector Y).

Knowing that there are 3 strongly relevant variables and 2 weakly relevant variables,

the student has to define and implement a strategy to find them.

No existing feature selection code has to be used. However, the student may use

libraries to implement supervised learning algorithms.

The student code should

return the position of the 3 strongly relevant variables and 2 weakly relevant

variables,

discuss what strategy could have been used if the number of strongly and

weakly variables was not known in advance.

Solution:

See the file Exercise4.pdf in the directory gbcode/exercises of the companion R

package (Appendix F).

346 CHAPTER 12. FEATURE SELECTION

Chapter 13

From prediction to causal

knowledge

We live in an uncertain and large dimensional world where it is getting easier and

easier to observe and collect large amounts of data but not necessarily control their

generation. This setting is called observational (Section 8.3) and refers to situations

in which we cannot interfere or intervene in the process of generating and capturing

the data. In this setting, machine learning is a powerful tool to detect dependencies

(e.g. correlations) between variables (Chapter 12). Figure 13.1 shows two highly

dependent variables: in such observational setting we may easily predict one variable

given the other, but what can we deduce about their causal relation, i.e. what will

happen to one variable once we manipulate the other? (Un)fortunately not much,

as shown by the label of the variables [190]1.

An accurate predictive model, built from observational data, might tell us very

little about what would happen in case of input manipulation. Most machine learn-

ing algorithms are not conceived to estimate causal effects but to predict observed

outcomes. Now, many of the most crucial questions in science are not merely pre-

dictive but causal. For example, what is the efficacy of a given drug in a given

population? What fraction of deaths from a given disease could have been avoided

by a given treatment or policy? What was the cause of death of a given individual

in a specific incident? What is the impact of education on criminality?

Answering those questions, and then taking the correct decision, requires being

able to make predictions under manipulations or potential experiments (and not

only observations). Is this possible if, for practical or ethical reasons, the experi-

ments or manipulations cannot be done? Have observational data-driven approaches

1You can find several other funny examples of spurious correlation in the website http://

tylervigen.com/spurious-correlations

Figure 13.1: A significant correlation between two causally unrelated variables.

347

348 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

Figure 13.2: Pearl's causal ladder [149]

any hope of untangling causality from association? This will be the topic of this

chapter.

13.1 About the notion of cause

The notion and importance of causal reasoning has been largely debated by philoso-

phers:

Men are never satisfied until they know the why of a thing (Aristotle).

I would rather discover one cause than gain the kingdom of Persia (Democri-

tus).

A thing cannot occur without a cause that produces it (Laplace).

Felix, qui potuit rerum cognoscere causas (Virgil).

At the same time, for many philosophers and statisticians, causation is a suspi-

cious metaphysical concept that should be avoided when discussing science and/or

making statistics. This suspicion derives from the Hume empiricist tradition (Sec-

tion 2.4). Empiricism always tried to interpret science without having recourse to

unobservable and hidden entities. In that perspective, causation is a hidden and

unobservable connection between things.

The philosopher John Stuart Mill (1843) was one of the first to make explicit

the properties of causal relationships. According to his definition, event A can be

defined to be a cause of event B if

1. A is repeatedly associated with B (concomitant variation)

2. A has to be present each time the effect B occurs (necessity ).

3. B occurs regularly when A is introduced (sufficiency )

13.2. CAUSALITY AND DEPENDENCIES 349

Although this definition is valuable from a historical perspective, it has a number

of limitations: it refers only to categorical aspects (no intensity), it assumes deter-

minism (no variability), and it relates to a univariate setting only (no context).

Also, pretending that logical induction can formalise causality is wishful think-

ing. Consider the following example from [147]. In logics if "A B " and "B C "

then "A C ". But "if the sprinkler is on then the ground is wet" and "if the

ground is wet then it rained" does not imply that "if the sprinkler is on then it

rained."

Now, interesting tasks are multivariate and stochastic. For this reason, we will

explore a more advanced formalism of causality (proposed by J. Pearl2[146]) which

relies on (and extends) probabilistic reasoning. We will show that such formalism

i) eliminates any aspect of vagueness from the definition of causality, ii) makes

this notion falsifiable and iii) pinpoints the crucial role of causal modelling in data

understanding and decision making.

The debate on the scientific role of causality is related to the debate about

the aim of science: is it more about prediction or explanation? This book has

largely covered the issue of prediction without having recourse to explanation or

interpretability aspects. In the following section, we will show that uncovering

mechanisms from data is a way of explaining something and returning human in-

terpretable and valuable information. As visualised in the J. Pearl causal ladder

(Figure 13.2) [149], causal information can be considered as the ultimate (and prob-

ably the most precious) outcome of knowledge discovery from data.

13.2 Causality and dependencies

A dependence between an input x and an output ydoes not always imply the

existence of a causal relationship between them. For instance, dependence may

occur when a latent phenomenon (confounder) causes two effects. In this case, a

statistically significant, yet non-causal, relation between effects is established. For

instance, in the example in Figure 13.1, the latent confounder is the time: both

(causally independent) variables change in the same manner with time and then

happen to be statistically correlated.

In general, it would be erroneous and fallacious to deduce a causality from the

existence of a statistical dependency only: "correlation does not imply causation"

as statisticians are used saying. To better illustrate this notion, let us consider the

following example.

The Caca-Cola study

The Caca-Cola marketing department aims to show that, unlike what feared by

most parents, the famous refreshing drink is so healthy for its drinkers to improve

their sports performances. To support this idea, the department funds a statistical

study on the relationship between the amount of Caca-Cola litres drunk per day

and the time spent (in seconds) by a drinker to run the 100 meters. Here it is the

dataset collected by statisticians:

2He was awarded the 2011 Turing Award, known as the Nobel Prize of computing

350 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

Figure 13.3: Sport performance improves with the amount of Caca-Cola drinking.

Is this a real causal effect?

Liters per day Seconds

1.00 11.9

1.09 12.8

0.20 14.3

0.18 15.3

1.02 12.2

1.04 12.5

1.06 12.6

0.00 16.3

1.08 12.7

0.18 17.7

0.50 14.0

0.17 17.6

illustrated by Figure 13.3 which plots the performance (in seconds) as a function of

the number of drunk litres per day.

The Caca-Cola marketing department is excited. Caca-Cola seems to have mag-

nificent effects on the sprinter performance, as illustrated by the significant corre-

lation between amount of litres and running time of Figure 13.3. The CEO of the

company triumphally extrapolates on the basis of a sophisticated machine learning

tool that any human being fed with more than 3 litres per day could easily beat

the world's record. In front of such enthusiasm, the League of Parents is skeptical

and ask for a simple elucidation: did the Caca-Cola statisticians record the age of

the sprinter? In front of the growing public opinion pressure, Caca-Cola is forced

to publish the complete dataset.

13.2. CAUSALITY AND DEPENDENCIES 351

Figure 13.4: Performance deteriorates as Caca-Cola drinking increases for young

athletes.

Age Liters per day Seconds

17 1.00 11.9

19 1.09 12.8

49 0.20 14.3

59 0.18 15.3

21 1.02 12.2

19 1.04 12.5

17 1.06 12.6

62 0.00 16.3

21 1.08 12.7

61 0.18 17.7

30 0.50 14.0

65 0.17 17.6

At last truth can triumph! Caca-Cola statisticians had hidden the real cause

of good (or bad performance): the age of the athletes! Since youngsters tend to

drink more Caca-Cola as well as having better sport performance, the first causal

relationships between drunk litres and performance was fallacious. On the contrary

a more detailed analysis on a homogenous group of young athletes show that Caca-

Cola tend to deteriorate the performances. Figure 13.4 plots the performance (in

seconds) as a function of drunk litres per day exclusively for the subgroup of young

(less than 30) athletes. Note that the Caca-Cola marketing department was not

wrong in claiming the existence of a significative relationship between Caca-Cola

and performance. On the contrary they were definitely wrong when they claimed

the existence of a cause-effect relationship between these two variables.

13.2.1 Simpson's paradox

The Caca-Cola example is a typical instance of Simpson's paradox: an association

between a pair of variables can consistently be inverted in each subpopulation of a

population when the population is partitioned, and conversely, associations in each

subpopulation can be inverted when data are aggregated.

352 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

Note that in the Simpson's paradox there is nothing paradoxical from the stand-

point of arithmetic and is simply due to the close connections between proportions,

percentages and probabilities. It is possible in fact to find eight integers such that

a/b < A/B and c/d < C/D but

(a + c ) /(b +d )> (A +C )/ (B +D )

For instance 1/ 5< 2/ 8 and 6/ 8< 4/ 5, yet 7/ 13 > 6/13.

Let us consider now the experimental outcome of a medical experiment where a

treatment T has been administered to patients and the binary outcome Yhas been

recorded. Note that G = 0 stands for females, G= 1 for males, Y= 1 stands for

recovery and T = 1 for treatment administration. Suppose that the distribution of

treatment among genders is represented by the table below:

T=0 T=1

G=0 5 8 P (T = 1|G = 0) = 8/13

G=1 8 5 P (T = 1|G = 1) = 5/13

P( G= 1| T= 0) = 8 /13 P( G= 1| T= 1) = 5 /13

It follows that the conditional distribution of recovery conditional on gender and

treatment is given by:

T=0 T=1

G=0 P (Y = 1|T = 0, G = 0) = 1/ 5P (Y = 1|T = 1, G = 0) = 2/ 8P (Y = 1|G = 0) = 3/13

G=1 P (Y = 1|T = 0, G = 1) = 6/ 8P (Y = 1|T = 1, G = 1) = 4/ 5P (Y = 1|G = 1) = 10/13

P( Y= 1| T= 0) = 7 /13 P( Y= 1| T= 1) = 6 /13

We obtain then the following probabilistic inequalities:

P( Y= 1| T= 0 , G = 0) = 1 /5 < P ( Y = 1| T= 1 , G = 0) = 2 /8

P( Y= 1| T= 0 , G = 1) = 6 /8 < P ( Y = 1| T= 1 , G = 1) = 4 /5

P( Y= 1| T= 0) = 7 /13 > P ( Y= 1| T= 1) = 6 /13

If we interpret the probabilistic relationships in causal terms, it appears that the

treatment is effective for both the female subpopulation (P (Y = 1|T = 1, G = 0) >

P( Y= 1| T= 0 , G = 0)) and the male population, yet this is not the case for the

entire population (P (Y = 1|T = 0) > P (Y = 1|T = 1)).

Though all this is perfectly compliant with arithmetic (and statistics) it poses

problems from a decision-making perspective: should we base a decision on the

statistics from the aggregate population or from the partitioned subpopulations?

Does it make sense to have a treatment which is effective both for men and women

but not for the union of them? In probabilistic terms, should we trust the marginal

or the conditional distribution? The answer of Pearl is that the answer cannot be

found in probability but has to be found in causal reasoning. This means that we

need a piece of missing information to answer the question and this information is

in the causal graph. In other words, translating a probabilistic relationship into a

causal relationship is the "original sin" in the Simpson paradox.

Figure 13.5 and 13.6 show two alternative causal graphs and, for the moment,

it is sufficient for the reader to know that a direct arrow from a node to another

represents a direct causal action. Knowing which causal graph corresponds to our

medical setting will allow solving the paradox [148].

Looking at both figures, it appears evident that in our case, the correct causal

graph is Figure 13.5: in fact, it is not conceivable that a treatment causes a person

gender (as in Figure 13.6), while it is possible that both a treatment and its efficacy

depend on the gender. This means that in our case the treatment is beneficial

and that the only meaningful probabilistic relation is the unconditional one P (Y=

1|T = 0) > P (Y = 1|T = 1).

We will show afterwards that for a graph like Figure 13.5, the causal dependency

of the treatment T on Y is correctly represented by conditioning on gender.

13.3. CAUSAL VS ASSOCIATIONAL KNOWLEDGE 353

Figure 13.5: Causal graph where Gis a direct cause of both T and Y.

Figure 13.6: Causal graph where Tis a direct cause of G.

The paradox derived from the fact that gender and treatment have an impact

on recovery and that gender has an impact on treatment as well. Females recover

much less (in probabilistic terms P (Y = 1|G = 0) < P (Y = 1|G = 1)) and take

more drugs (P (T = 1|G = 0) > P (T = 1|G = 1)). Using the non-conditioned

relation would be erroneous and counterintuitive.

Now let us keep fixed all the observed numbers but let us change the semantics

of variables. Suppose that Gstands for some beneficial mechanism P (Y = 1|G =

1) > P (Y = 1|G = 0) whose functioning is affected by the treatment P (G = 1|T =

1) < P (G = 1|T = 0). Which causal graph do you think is more adequate? Would

you condition on G or not, to measure the causal effect of T on Y?

13.3 Causal vs associational knowledge

Human common sense knowledge is about how things work in the world. Such

knowledge is (i) effective since it gives humans the ability to intervene in the world

and change it and (ii) causal because it is about the mechanisms that bring from

causes to effects. Here mechanism stands for an input/output relationship where

the choice of inputs determines the outputs, but the reverse does not hold.

The goal of many sciences is to formalise such common sense and understand the

mechanism by which variables came to take on the values they have, and to predict

what the values of those variables would be if the naturally occurring mechanisms

354 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

were subject to manipulations. It follows that the questions that motivate most

studies in the health, social and behavioural sciences are not associational but causal

in nature.

Causal analysis aims to infer not only beliefs or probabilities under static con-

ditions but also the dynamics of beliefs under changing conditions, for example,

changes induced by treatments or external interventions. However, according to

Pearl [146], associational studies deserved more interest than causal studies for the

following reasons:

associational assumptions, even untested, are testable in principle, given suffi-

ciently large samples and sufficiently fine measurements. Causal assumptions,

in contrast, cannot be verified even in principle unless one resorts to experi-

mental control.

associational assumptions can be expressed in the familiar language of prob-

ability calculus, and then they assumed an aura of scholarship and scientific

respectability. Causal assumptions, until recently, were deprived of that and

became suspect of informal, anecdotal or metaphysical thinking.

In order to address the lack of an adequate probabilistic formalism to represent

the notion of manipulation, Pearl introduced the do() operator, which allows us to

distinguish between the conventional observational notion of statistical dependency

(quantified in terms of conditional probability) and the interventional notion of

causal dependency.

Let us first remind that a (discrete) random variable yis said to be dependent

(Section 3.5.2) on a variable xif the distribution of yis different from the marginal

one when we observe the value x = x

Prob {y |x =x } 6 = Prob {y }

The property of dependency is symmetric , i.e. if y is dependent on x , then xis

dependent on yas well, i.e.

Prob {x|y =y } 6 = Prob {x}

The concept of causality describes a process where the control (not simply the

observation) of one event changes the likelihood of the occurrence of another event.

Definition 3.1 (Cause) . A variable x is a cause of a variable yif the distribution

of y is different from the marginal one when we set the value x = x

Prob {y | do (x =x )} 6 = Prob {y }

In other terms, xis a cause of yif we can change y by manipulating x but not

the other way round. Unlike dependency, causality is asymmetric, i.e.

Prob {x|do (y =y )} = Prob {x}

It is important to summarise then the differences between conditioning on in-

tervention and conditioning on observation.

Intervention is formalised by the do () operator and corresponds to set (or bet-

ter manipulate) a variable to a specific value. This manipulation may change the

probabilistic model and the nature of the dependencies between variables (e.g. the

stochastic dependency between a cause and an effect is lost once we manipulate the

effect).

Conditioning is formalised by the conditional probability operator and corre-

sponds to observe a specific value of a variable. By observing, we change nothing

to the causal mechanism: we simply narrow our focus to a set of observations. In

plain words, the world does not change, our perception does.

13.4. THE TWO MAIN PROBLEMS IN CAUSALITY 355

13.4 The two main problems in causality

The literature on causality addresses two main problems: (i) the estimation of causal

effects and (ii) the causal discovery from observational data. In the first problem,

the causal mechanism is known or guessed, but the goal is to assess causal effects

from an observational or experimental study. For instance, in a medical study, we

could be confronted with two groups of units (exposed/non-exposed), and we would

like to quantify the causal effect of the exposure (smoking versus not smoking). In

the second problem, we have no (or very limited) information about the causal

mechanism, and the goal is to reconstruct the causal mechanism from data. For

instance, in bioinformatics, this problem is encountered every time we want to infer

a transcriptional network from observed genomics data.

In what follows, we will show how the first problem has been addressed both by

the potential outcomes and the graphical modelling approaches. Then we will dis-

cuss the issue of causal discovery, notably thanks to structure learning in graphical

models.

13.5 Causality and potential outcomes

Causality is about estimating the effect of an action (or manipulation, treatment,

intervention), applied to a unit (e.g. a person, a physical object, a firm), over a

specific outcome. Although a unit has been (at a particular point in time) exposed

to a particular action (treatment/regime), in principle, it could have been exposed

to an alternative action at the very same point in time. This concept has been

formalised by Rubin [164] with the notion of potential outcome, denoting the random

distribution of an outcome variable which can be observed only if the action is

performed. In a binary treatment, for each realised action, there is a potential

outcome associated with the alternative action which could not be observed. The

effect of an action is then related to both potential outcomes (what happened and

what could have happened).

Let y (x) be read as the outcome y under treatment x . If x can take two values,

the potential outcomes y (1) (y (0) ) are two outcome variables (distinct from y) which

denote the distribution of the outcome if the (binary) action xhad (not) been done:

y(0) Prob { y| do( x = 0)}, y(1) Prob { y| do( x = 1)}

Note that the definition of those outcomes is irrespective of which action actually

occurred. The link between y (observation), y (0) and y (1) is

y=( y (1) if x= 1

y(0) if x= 0

Since in the reality we can only observe either y (1) or y (0) , the observed yi

satisfies the following relation:

yi =xi y (1)

i+ (1 x i )y (0)

i=y (0)

i+ (y (1)

iy (0)

i)x i .(13.5.1)

Potential outcomes and observations

Let us consider the notion of potential outcome in the context of the following

observed dataset:

356 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

unit treatment x observed yy (0) y (1)

1 1 y1 ? y1

2 0 y2y2 ?

.

.

..

.

..

.

..

.

..

.

.

i 1 yi ?y i

.

.

..

.

..

.

..

.

..

.

.

N 0 yNyN ?

For each unit in the population, the observed outcome is called the factual

outcome yF , while the other is called the counterfactual yCF . Note that sometimes

the factual outcome is drawn from y (0) and sometimes from y (1), according to the

performed action. This definition makes explicit that the observed random variable

yis only a lumped version of the potential outcomes.

Beyond treatment and outcome, the third major actor of the causal inference

problem are covariates , denoted by z . For instance, in a medical setting, they could

be measures related to individual medical history or family background information.

The key characteristic of these covariates is that they are a priori known to be

unaffected by the treatment assignment. We will see later that covariates may help

to improve the accuracy of the causal effect estimation by defining configurations

or groups of interest which might play a role in the assignment mechanism.

13.5.1 Causal effect

The causal effect is defined as the comparison of potential outcomes, for the same

unit, at the same moment in time. For a given unit (or individual) iand a binary

treatment, the unit level causal effect (or individualised treatment effect) is

y(1)

iy (0)

i.

The average causal effect over the population is

τ= E[y (1) ] E[y(0) ] .(13.5.2)

The definition of causal effect depends on the potential outcome but not on the

actual observed outcome: nevertheless, in practice, if we wish to estimate it, we will

have to use observed outcomes over multiple units and different times.

For the sake of estimation it is then crucial to know (or at least make assumptions

about) why some actions were made and others not: this is known as the assignment

mechanism and quantifies the statistical dependency between treatment, outcome

and covariates.

Example

Let us consider a heterogeneous population of patients in terms of ages. Let us

suppose that the older the patient, the higher is the death risk y. Let us give a

treatment (e.g. drug or hospitalisation) to the oldest patients only and compute

the average risk of treated patients vs untreated ones. Since the a priori risk of old

people was already high, after the treatment their average risk (though decreased

thanks to treatment) remains higher than the risk of non treated persons. In other

terms, though over the entire population

E[y (1) ] < E [y (0) ]

13.5. CAUSALITY AND POTENTIAL OUTCOMES 357

y(1)

y(0)

E[y| x=1]=E[y(1)| x=1]

E[y| x=0]=E[y(0)| x=0]

E[y(0) | x=1]

E[y(1) | x=0]

causal effect

bias

observed effect

Low risk

Age

Treatment

No treatment

Factual

y(0)i

y(1)i

High risk

y

Figure 13.7: Observed vs. causal effect. Left: y (0) distribution (untreated popu-

lation). Right: y (1) distribution (treated population). Black dots denote the set

observed patients: half of them are young and untreated (left) and half of them are

aged and treated (right).

i.e. the individual causal effect is constant and negative (risk reduction), we observe

(Figure 13.7)

E[y| x = 1] > E[y |x = 0]

i.e. a positive treatment effect (risk increase).

It follows that by looking only at the observed data, we would make the wrong

conclusion that the treatment is harmful. Note that this is analogous to consider

a surgeon with many deceased patients during the operations less talented than a

surgeon with few deceased patients: if we ignore the severity of the patients state,

the different death rate could be misleading since the first (smarter) surgeon could

be the one dealing (because more talented) with all difficult cases.

13.5.2 Estimation of causal effect

We can never measure a causal effect (for instance (13.5.2)) directly since the coun-

terfactual outcome is missing by definition; such missing data problem is called the

fundamental problem of causal inference.

In the absence of counterfactual data, can observed data be used to estimate

causal treatment effects for a given covariate? This is the aim of the causal analysis

which targets the quantity

E[y(1) | x = 1] E[y (0) |x = 1] = E[y (1) y (0) |x = 1]

representing the average difference between the risk of the treated and what would

have happened to them (counterfactual) had they not been treated.

358 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

It can be shown that

E[y |x = 1] E[y| x = 0]

| {z }

observed

=E [y (1) |x = 1] E [y (0) |x = 0] =

E[y(1) |x = 1] E[y (0) |x = 1]

| {z }

avg causal effect on the treated

+E [y(0) |x = 1] E [y (0) |x = 0]

| {z }

selection bias

(13.5.3)

The selection bias term shows that the observed causal effect may be biased

if the distribution of the potential outcomes is not independent of the assignment

strategy ((y (0) ,y (1) ) 6⊥x ). This is also known as a non-exchangeable configuration.

We invite the reader to retrieve the terms of (13.5.3) in Figure 13.7.

In order to draw valid causal inferences, we need to make assumptions about

the assignment mechanism , that is the distribution

P(x = 1|z = z, y (0), y(1) ) (13.5.4)

describing the probability of a unit of receiving a treatment given the distribution

of the covariates and of the potential outcomes of the problem.

13.5.3 Assignment mechanisms assumptions

An assignment mechanism is

individualistic if the treatment of a unit does not depend on other units. For

instance, sequential assignments are not allowed in this case.

probabilistic if the probability of assignment (13.5.4) is strictly between zero

and one. Each unit has a non null probability of being assigned both treat-

ments (level 0 and 1).

uncounfounded if it is conditionally independent of the potential outcomes,

i.e. P (x = 1|z = z, y (0), y(1) ) = P (x = 1|z =z ) = e (z ).

An individualistic, probabilistic and unconfounded assignment is also referred

to as a strongly ignorable treatment assignment. A randomised experiment is an

assignment mechanism that has a known probabilistic functional form that is con-

trolled by the researcher. In an observational study, the functional form of the

assignment mechanism is unknown.

13.5.4 About unconfoundness

The main obstacle to unbiased causal reasoning is confounding which occurs when

the assignment of the treatment is not independent of the potential outcomes, e.g.

because of omitted or non observable variables related to both the treatment and

the outcome. Unconfoundness (also known as ignorability ) states that potential

outcomes are independent of the observed treatment conditional on confounding

covariates

(y (0) ,y (1) ) x |z. (13.5.5)

In plain words, for individuals with the same z, knowing the treatment brings no

information about y (0) or y (1) or, equivalently, the treatment does not depend on

the causal type. For instance for individuals with the same age, the treatment is

not given to patients who are supposed to react better and knowing that a patient

got a treatment does not provide any information about the success odds.

13.5. CAUSALITY AND POTENTIAL OUTCOMES 359

Note that the lack of dependence between the treatment and the outcome does

not imply that the assignment is ignorable

yx 6⇒ (y (0) ,y (1) ) x

At the same time, an ignorable treatment but with a treatment effect different

from zero does not imply independence between the treatment and the outcome.

(y (0) ,y (1) ) x 6⇒ y x

The correct interpretation of unconfoundness is that for a given covariate z,

knowing which treatment has been given to a unit (e.g. observing x) does not

provide any additional information on the distribution p (y |do (x )). This is not the

case of the example of Figure 13.7 where by knowing the treatment we know the

age and then the death risk of the patient.

On the other way round, we have a non ignorable assignment mechanism if

the treatment is assigned on the basis of unmeasured characteristics of the unit.

Naively, we could then expect that to ensure plausibility of such assumption, we

should condition on as much pre-treatment information as possible. However, one

can never prove that the treatment assignment process in an observational study is

ignorable since it is always possible that the choice of treatment depends on relevant

yet latent information.

13.5.5 Randomised designs

A randomised experiment is a probabilistic assignment mechanism whose form is

known and controlled by the researcher. The aim of randomisation is to make the

treatment independent of its other causes, thereby destroying possible confounding

settings, due either to observed or unobserved variables. In case of binary treatment,

a random assignment process guarantees that the resulting treatment and control

groups are identical on any other characteristic (than the treatment) influencing

the outcome.

Though it is the "gold standard" procedure for causal reasoning, in practice

there are many concerns about its adoption, ranging from ethical concerns to the

representativeness of of the generated population with respect to the observed pop-

ulation.

The most common types of randomised designs are:

Bernoulli trials: they assign a treatment by tossing a fair coin for each unit.

Though this is the most intuitive design, it is exposed to a high risk of un-

helpful assignments.

Completely randomised experiment: a classical randomised experiment where

each unit has the same assignment probability (i.e. the number of treated units

is fixed), and this probability is different from 0.5. In this design, the risk is

that the treatment assignment neglects covariate effects.

Stratified (or conditionally) randomised experiments: they partition the pop-

ulation of units in blocks (strata), e.g. according to covariates (for instance,

age).

Paired randomised experiments: two units (chosen on the basis of covariates)

per block, one is treated and the other not.

360 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

13.5.5.1 Estimation of the treatment effect

Suppose we want to estimate the average treatment effect (13.5.2) in a completely

randomised experiment where the treatment xis binary. In a population of Nunits,

this is given by

τ=P N

i=1(y (1)

iy (0)

i)

N

Unfortunately, this quantity cannot be computed directly since for all i one of the

two terms is missing. Let us then consider an estimator

ˆ

τ=PN

i=1(ˆyi1 ˆyi0 )

N(13.5.6)

Such estimator can be implemented if the terms ˆ yare observable (i.e. dependent

on observations yi ) and is unbiased if

E[ˆ

τ] = τ(13.5.7)

Since from (13.5.1) yi =xi y (1)

i+ (1 x i )y (0)

i, we put

ˆyi1 = wi xi y (1)

i,ˆyi0 =wi (1 xi ) y (0)

i(13.5.8)

In order to guarantee the unbiasedness (13.5.7), the following constraints have

to be satisfied

E[

N

X

i=1

ˆyi1 ] =

N

X

i=1

y(1)

i, E[

N

X

i=1

ˆyi0 ] =

N

X

i=1

y(0)

i

Note that the first constraint, because of the unconfoundness (13.5.5) and (13.5.8),

boils down to

E[

N

X

i=1

ˆyi1 ] = E[

N

X

i=1

wi xi y(1)

i] =

N

X

i=1

wi E [xi ] y (1)

i=

N

X

i=1

y(1)

i

and analogously the second.

Since the treatment xis binary, E [ xi ] = P ( xi = 1). To ensure (13.5.7) we

must weigh each treated example with a weight equal to 1/P (xi = 1) and each

untreated example with 1/P (xi = 0). Since in a completely randomised experiment

N1 + N0 = N we have

P(xi = 1) = N1 /N, P ( xi = 0) = N0 /N

and from (13.5.6) and (13.5.8) we obtain

ˆ τ=P N

i=1(ˆ yi1 ˆ yi0 )

N= P N

i=1(x i y i1/(N 1 /N ) (1 x i )y i0 /(N 0 /N))

N=

=P N 1

j=1 y j1

N1 P N 0

j=1 y j0

N0

Note that this boils down to increase the weight of the observed treated cases

by a factor equal to N/N1 . In general, the lower the probability of an observation,

the higher should be its weight in the estimator of the treatment effect.

This derivation shows that in a randomised case, it is possible to obtain an

unbiased estimation of the causal effect by a proper weighting of the observations

and without access to counterfactual information.

13.5. CAUSALITY AND POTENTIAL OUTCOMES 361

13.5.5.2 Stratified (or conditionally) randomised experiments

This class of randomisation experiments is particularly important since it can be

used as a template to deal with observational experiments. In a stratified randomi-

sation setting, the population is first partitioned into blocks or strata so that the

units within each block are similar in terms of covariates expected to be predictive of

potential outcomes. The simplest case is when we stratify the population according

to gender (two blocks) to estimate the effect of a medical treatment.

Within each block, a completely randomised experiment is performed, and the

relative size of treatment groups is the same over strata. Analogously to the ran-

domised case, the estimation of the treatment effect is obtained by the weighted

average of the treatment effect over each stratum, with a weight proportional to the

population in the stratum.

13.5.6 Observational study

Unlike a randomisation experiment, in an observational study, the functional form

of the assignment mechanism is (at least partially) unknown. This means that we

could have a risk of outcome systematic differences between the treatment groups,

which are not due to treatment exposure (Figure 13.7). The differences between

the groups might be due to confounding variables (e.g., age, genetic susceptibility

to cancer) rather than treatments themselves.

The rationale of the potential outcome approach is to make the observational

setting as conform as possible to the experimental one in order to make causal

inference possible. Since an investigator cannot assign the treatment in nonexper-

imental studies, s(he) must rely on the remaining degree of freedom: the selection

of subjects. It is then necessary to make the assumption that the observational

study is a conditionally randomised experiment, or equivalently that for a given set

of covariate the unconfoundness hypothesis (13.5.5) (i.e. assignment independent

from potential outcomes given covariates) holds. Without such assumption (and

in the absence of alternatives) we would have no idea of how to use observational

values for causal inference.

If uncounfoundness holds, within a subpopulation defined by the same covariates

(e.g. females or males), the difference in the distributions of the observed outcomes

(i.e. treated and untreated) is an unbiased estimator of the treatment effect since

both treated and control units are assumed to be random samples from this subpop-

ulation. This makes irrelevant the fact that we do not know a priori the assignment

mechanism.

Unconfoundness is a formal version of the principle stating that you should

compare "like with like". As Rubin said: "it would make little sense to compare

disease rates in well-educated non-smokers and poorly educated smokers".

Note, however, that, unfortunately, such assumption is hardly testable in prac-

tice. No information in the observed data can tell us whether such assumption holds

or not. At the same time, the validity of a causal estimation based on observational

data relies on the hypothesis that we are conditioning on the right covariates, i.e.

the ones that make treatment and potential outcomes independent. Once condi-

tioned on those covariates, it is expected that the remaining degree of variability

related to the treatment choice (though not random like in randomisation studies)

will be unrelated to potential outcomes.

In practice, the choice of those variables is not easy, notably when the number of

confounding variables is large, or most of them are continuous. Literature abounds

in terms of generic recommendations (e.g. variables that are affected by the treat-

ment like intermediate outcomes should not be included), but no formal procedure

is proposed.

362 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

13.5.7 Strategies for estimation in observational studies

If the unconfoundness assumption is granted, the causal effect estimation may pro-

ceed according to four strategies [163]

1. Model-based imputation: it imputes the missing potential outcomes by build-

ing and using a model to predict what would have happened to a unit had

been exposed to the alternative treatment.

2. Weighting

3. Blocking

4. Matching methods: they impose the same distribution of the control and case

population with respect to some confounding factors.

The first method requires a model of the potential outcomes, the other three can

be implemented before seeing any outcome data. We will briefly discuss here only

matching which refers to a variety of procedures that restrict and reorganise the

original sample for statistical analysis. Matching is a way of discarding observations

so that the remaining data show good balance and overlap. The simplest form is

one-to-one matching where the data points are divided into pairs: each pair contains

both a treatment and a control unit, and the two units are as similar as possible

on the pre-treatment variables. For instance, if the ith unit is associated with zi ,

we match it to an unit whose set of covariates zj is as close as possible to zi . The

difference of the observed outcomes associated to those matched units is supposed to

be an accurate estimate of the causal effect. In multivariate settings, the similarity

is typically measured by metrics like the Mahalanobis distance (9.2.64). Note,

however, that, though the notion of similarity is intuitive, its adoption in a high

variate case is non trivial as discussed in Section 12.1.

In order to avoid the issue of specifying which variables to use for matching,

a well-known alternative is the adoption of propensity score modelling [86]. This

approach relies on classification techniques (e.g. logistic regression) to estimate

from data the propensity score Prob {x = 1|z } , that is a compact way to summarise

high-variate covariates in a single value. Note that the target of a propensity score

classification model is not the outcome ybut the treatment value x . However,

the aim of this approach is not to predict the treatment but define a metric of

the observations in the space of covariates: observations with similar estimated

propensity scores have similar profiles in terms of the covariates z and then should

be candidates for matching.

13.6 From potential outcomes to graphical models

Potential outcomes approach played a major historical role in the history of causality

in sciences. Its main merit was to define under which circumstances it is possible

to transpose methods from the randomisation setting to the observational one.

However, the opacity of the notion of unconfoundness in observational setting is

probably the most relevant Achilles' heel of this approach since no procedural way

exists to assess the validity of such condition, in particular in large variate problems.

According to J. Pearl [146], most investigators have difficulty in understanding

what ignorability means and tend to assume that it is automatically satisfied or

that it is likely to be satisfied the larger is the number of covariates. This led to the

naive (somewhat dangerous) illusion that adding more covariates causes no harm.

The lack of a formal algorithmic way to define the conditioning variables to assess

causal effects is the main reason to introduce the Pearl work on causal graphical

methods.

13.7. CAUSAL BAYESIAN NETWORK 363

We introduced in Chapter 4 the notion of graphical models, notably DAGs,

stressing their efficient representation of large variate distributions and the bijec-

tive mapping between graphical properties (d-separation) and probabilistic prop-

erties (conditional independence). What is also relevant here is their role from

a causal perspective. DAGs is a formalism that encodes and visualises the link

between (conditional) statistical associations and causal mechanisms. As such, it

allows the prediction of the impact on the probability distribution of manipulations

and interventions. Last but not least, it provides visual support to recognise and

avoid common mistakes in causal reasoning. Overall, DAGs are a very convenient

formalism to represent the following properties of causal relationships:

1. Transitivity: if event A causes event B and event B causes event C, then it

must also be true that A causes C.

2. Locality : if A causes C only through the effect of an intermediate B, then

the causal influence is blocked once the event B is kept fixed.

3. Irreflectiveness : an event cannot cause itself.

4. Asymmetry : if A is a cause of B (i.e. one sees changes of B when changing

A), then B cannot be a cause of A (i.e. one cannot expect changes of A when

changing B). Note that this does not exclude temporal feedback loops.

13.7 Causal Bayesian network

A Causal Bayesian Network (CBN) is a graphical model where the notion of edge

has a specific causal meaning and is then semantically richer than in conventional

Bayesian Networks where it merely represents a probabilistic dependence.

Definition 7.1 (Causal BN) . A BN is causal if for each edge xi xj , the variable

xi is one of the direct causes of xj (Definition 3.1).

Unlike graphical models which are used to encode an (in)dependence structure,

causal graphical models support a stronger interpretation, i.e. the manipulation of

xi induces a change of distribution of xj , whatever the value of all other variables.

This implies that, thanks to the notion of d-separation (Section 4.3.1), it is possible

to associate testable conditional independence patterns to specific causal patterns.

For instance the pattern x1 x2 ,x1 6⊥x2 |y is associated to the common effect

causal pattern in Figure 13.8 where conditioning on a collider (or a descendant)

makes the two causes conditionally dependent. Other interesting patterns are en-

coded in the DAG of Figure 13.9. On the left we have a variable ywhich is a

common cause of x1 and x2 where x1 6⊥x2 , x1 x2 |y . On the right we have a

causal chain pattern where x2 6⊥y, x2 y |x 1

Causal patterns: examples

We may visualise those dependency patterns and their sensitivity to conditioning

by running the scripts collid.R and chain.R . For instance the leftmost plot in

Figure 13.10 represents the pattern in Figure 13.8: variables x1 and x2 are inde-

pendent (black dots) but they become dependent when we restrict the dataset by

conditioning (red dots).

364 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

Figure 13.8: Common effect causal pattern.

Figure 13.9: Left: common cause pattern. Right: causal chain pattern.

420 2 4

420 2 4

COLLIDER (y=x1+x2): conditioned on y=0

x1

x2

1.0 0.5 0.0 0.5 1.0

1.0 0.5 0.0 0.5 1.0

FORK (x1=y+e1, x2=y^3+e2): conditioned on y=0

x1

x2

210 1 2

420 2 4

CHAIN (y=x1+e1, x2=2y+e2): conditioned on y=0

x1

x2

Figure 13.10: Visualisation of three causal patterns in terms of dependency (black

dots) and conditional dependency (red dots).

13.7. CAUSAL BAYESIAN NETWORK 365

Figure 13.11: Causal Bayesian Network.

13.7.1 Causal networks and Structural Causal Models

It may be shown that for every DAG characterised by a distribution, Pthere ex-

ists a set of equations, called Structural Causal Model (SCM), that generates a

distribution identical to P.

Definition 7.2 (Structural causal model) . A SCM consists of n equations of the

form

xi = fi (πi , wi ), i = 1, . . . , n (13.7.9)

where πi stands for the set of variables judged to be the immediate causes of x i

and wi represent the noise due to omitted factors.

A structural model is a set of n equations fi . If the noise terms wi are jointly

independent, the model is called Markovian. This implies that no noise term (un-

observed variable) influences more than one observed variable. This assumption is

also known as causal sufficiency that means that we observe all relevant variables.

Note that the symbol = in (13.7.9) should be read like an assignment symbol (e.g.

the symbol := in programming languages) rather than like an algebraic equality.

The SCM correspondent to the CBN is Figure 13.11 is

x= wx

y= wy

z=fz ( x, y, wz )

k=fk ( z, wk )

v=fv ( y, wv )

w=fw ( v, ww )

We see that each edge in the DAG is associated with a function in the SCM.

13.7.2 Pre and post-intervention distributions

The essence of causal reasoning by CBNs is that the observational (pre-interventional)

and the post-interventional distributions are not necessarily the same, yet they are

somewhat related, and their relation is made explicit by the causal graph. A causal

graph is not only a model of the (in)dependences in an observational setting, but

also indicates how those (in)dependences change as a consequence of experiments

and interventions.

366 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

kv

x

z

y

kv

x=x

z

y

P(y | x =x) P(y | x=do(x))

Figure 13.12: DAG surgery: on the left (right) the BN associated to the pre(post)-

intervention distribution.

Consider for instance two different causal configurations: x1 y x2 and

x1 y x2 . Suppose we are interested in the post-intervention distribution

of y once we act on (i.e. we manipulate) x1 . The post intervention distribution

in the first configuration is given by the conditional distribution p (y| x1 ,x2 ): the

post intervention distribution in the second configuration is the marginal density

p(y |x2 ). In this second configuration, intervening on x1 removed the probabilistic

dependency between x1 and y (i.e. after intervention the value of x1 provides

no more information about y). At the same time, the observation of x2 is still

informative about y.

If we consider instead the post-intervention distribution of x2 under an inter-

vention on y , this is the same (and equal to p (x2 |y )) in both causal configurations.

It is important then in a general setting to understand how to move from pre-

intervention (observational) data to a post-intervention setting. The major advan-

tage of DAGs is that organising causal knowledge in a graphical manner allows

predicting the effect of external interventions with a minimum of extra information.

Interventions can be modelled by removing links or equivalently adding conditional

independence relationships. This makes explicit the difference between observing

and doing : observing does not change the causal structure, while a do() action

induces a "surgery" change in the topology of the DAG, or equivalently it modifies

a set of functions in the SCM (Figure 13.12).

In particular we define atomic intervention (or manipulation) the intervention

where a variable xi is forced to take a value xi (denoted by do ( xi =xi )). The

atomic intervention pulls xi out of the SCM functional mechanism xi = fi (. . . ) and

places it under the influence of a new mechanism that sets xi = xi , while keeping

all the other mechanisms unperturbed. This amounts to removing the equation

xi =fi (. . . ) in the corresponding SCM and replacing it with the equation xi =xi .

In plain words, the effect of manipulations is to disconnect the manipulated variables

from their natural causes.

13.7.3 Causal effect estimation and identification

Post-intervention distribution is required to estimate the causal (treatment) effect

P(y = y|do( x= x0 )) P(y = y|do( x= x"))

13.7. CAUSAL BAYESIAN NETWORK 367

Unfortunately, the post-intervention distribution is not observable. So it is essential

trying to answer the following question: "Can the controlled (post-intervention)

distribution be estimated from data governed by the pre-intervention distribution?".

This is the problem of identification.

A fundamental theorem in causal analysis states that such identification is fea-

sible whenever the model is Markovian, i.e. the graph is acyclic (i.e., containing no

directed cycles), and all the error terms are jointly independent.

Theorem 7.3 (Adjustment for direct causes [146]) . Let πi the set of direct causes

of xi and let y be any other variable disjoint of πi xi . The post-intervention

distribution of yis

P(y = y|do( xi )) = X

πi

P(y = y| xi , πi ) P( πi ) (13.7.10)

where P (y =y |xi , πi ) and P (πi ) are pre-intervention distributions.

The essence of this theorem is that it transforms a causal statement (left-hand

term of (13.7.10)), i.e. a statement that cannot be directly estimated from observa-

tions, into a probabilistic statement (right-hand term), i.e. a statement that can be

estimated from observational data. The post-intervention conditional distribution

is then obtained by conditioning P (y =y |xi ) further on the parents of xi and then

averaging the results where the weights are given by the prior probability of πi . In

the case of multiple interventions, this general rule holds:

Theorem 7.4 (Truncated factorisation). For any Markovian model, the distribu-

tion generated by an intervention do( XS = X0 ) on a set XS of endogenous variables

is given by the truncated factorisation

P(x1 ,x2 ,..., xk |do( XS = X0 )) = Y

i|i6∈S

P(xi | πi )|X S =X0

where πi are the parents of xi and P( xi |πi ) are pre-intervention distributions.

We have to remove from the factorisation all factors associated with the inter-

vened variables (members of XS ).

Example

Let us consider the problem of computing the post-intervention distribution in two

different (yet related) trivariate causal configurations. In Figure 13.13 we have a

causal configuration3 where z is confounding the effect of xon the outcome y.

The pre-intervention distribution P(associated to the left-side of Figure 13.13)

is

P(y = y|x= x) = X

z

P(y = y|x= x, z= z) P(z = z|x= x) (13.7.11)

The post-intervention distribution P0 (associated to the right-side of Figure 13.13)

and obtained by removing all the edges pointing towards xis

P(y = y|do( x)) = P0 (y = y|x= x) =

=X

z

P0 (y = y|x= x, z= z ) P0 (z = z|x= x) = X

z

P0 (y = y|x= x, z= z ) P0 (z = z)

=X

z

P(y = y|x= x, z= z ) P(z = z)6 = P(y = y|x= x) (13.7.12)

3Check the analogy with Figure 13.5.

368 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

z

x y

z

x y

P: pre-intervention P': post-intervention

Figure 13.13: Confounding configuration.

z

x y

P: pre-intervention P': post-intervention

z

x y

Figure 13.14: Intermediary configuration.

This is in agreement with Theorem 7.3 and shows the difference between conditional

distribution P (y =y |x =x ) and causal intervention P (y =y |do (x)).

Let us consider now the causal configuration4 where z plays the role of inter-

mediate between x and y (Figure 13.14). Since the pre- and the post-intervention

DAGs coincide, so the pre and the post-intervention distribution do as well.

P(y = y|do( x)) = P(y = y|x= x)

This means that in order to measure the causal effect of x and y in the configuration

of Figure 13.14, conditioning is not required. Note that the difference between

those two configurations is i) only due to the different underlying causal structures

and ii) could not be detected by relying on conditional independence tests at the

observational level (see the Simpson paradox in Section 13.2.1).

13.7.3.1 Backdoor criterion

We discussed in Section 13.6 that the unconfoundness assumption, essential for es-

timating causal effects in an observational study, is non-testable in the potential

outcomes approach. Theorem 7.3 shows that indeed we need a causal graph to per-

form causal reasoning. But what about the most general case, e.g. non observable

direct parents (Figure 13.15)? How to select a set of variables (also called suffi-

cient set ) such that conditioning on it, we can estimate the causal effect with no

bias? The backdoor criterion [146] is a graphical criterion which allows to find the

sufficient set.

Definition 7.5. A set of variables Z is admissible (or "sufficient") for adjustment

if two conditions hold:

4Check the analogy with Figure 13.6.

13.7. CAUSAL BAYESIAN NETWORK 369

Figure 13.15: Unobserved parent configuration: zis a latent variable. [146]

Figure 13.16: What is the causal effect of x on y?

1. No element of Zis a descendant of the treatment x.

2. The elements of Z block (Definition 3.5) all "backdoor" paths from x to y,

namely all paths that end with an arrow pointing to x.

Given a sufficient set Z, the average causal effect of x(boolean) on y (boolean)

is

Prob {y = 1|do (x = 1)} − Prob {y = 1| do (x = 0)}=

=X

Z

[Prob {y = 1|x = 1,Z =Z } − Prob {y = 1|x = 0,Z =Z } ]P (Z =z )

In plain words, the sufficient set takes the role of the parents in (13.7.10). By

conditioning on this set, the observational distribution may be used to estimate

causal effects.

Exercise

Compute the backdoor set for measuring the causal effect of x on y ,w on y in the

causal structure of Figure 13.16.

The rationale of the backdoor criterion is that the total association between

treatment and effect is a composition of the true causal effect and the non-causal

association that is generated by the backdoor paths. The backdoor paths in the

diagram carry spurious associations from x to y , while the paths along the arrows

from x to y carry causative associations. The causal effect of x on the outcome yis

identifiable if all spurious paths are blocked, and no new spurious paths are created.

Blocking the backdoor paths (by conditioning on Z) ensures that the observed

association between x and y is the causal effect of x on y . This criterion can be

applied systematically to diagrams of any size and shape, avoiding the ambiguity

related to unconfoundness seen in the potential-response framework.

370 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

Figure 13.17: Application of rule R2.

13.7.3.2 Beyond sufficient set: do-calculus

Adjusting for sufficient covariates is only one of many methods that permits us

to estimate causal effects in nonexperimental studies. Pearl (1995a) has presented

examples in which such a set of variables does not exist (or is not observable)

and where the causal effect can nevertheless be estimated consistently. This led to

the frontdoor criterion (intermediary variable)5. All those results were masterfully

regrouped by J. Pearl in a set of rules which constitute the foundations of the

calculus of intervention.

A causal effect of x on y is identifiable when the expression containing the

do( x=x ) operator can be transformed in an expression containing conventional

probabilistic statements. Let G ¯

x(G x) the graph obtaining by deleting from Gall

arrows pointing to (emerging from) x

1. P (y |do (x), z, w ) = P (y |do (x), w ) if (y z |w, x )G ¯

x

2. P (y |do (x), do(z ) , w ) = P (y |do (x), z, w ) if (y z |w, x)G ¯

xz

3. P (y |do (x), do(z ) , w ) = P (y |do (x ) , w ) if (y z |w, x )G ¯

xz( w) where z (w ) is

the subset of z that are not ancestors of any wnode in G ¯

x

The rule R1 formalises the notion of surgery, while the rule R2 formalises the

notion of backdoor since it may be used to transform an action into an observation.

An example of the adoption of rule R2 is in Figure 13.17 where P (y| do(z), w ) =

P( y| z, w) since (y z |w )G z (as shown by d-separation in the lower graph Gz ).

R3 may be used to remove a do() operator, e.g.

P( y|do( z)) = P( y)

if (y z )G z i.e. a cause y is not affected by the manipulation of the descendent z.

13.7.4 Selection bias

Another important insight of CBN on the risks of associative data analysis is related

to the notion of selection bias. Selection bias refers to any association created as

5For more details on a genial interpretation of the smoke/cancer causal relation in terms of

front-door criterion, we refer to [146]

13.7. CAUSAL BAYESIAN NETWORK 371

Figure 13.18: Selection bias: conditioning on a descendant of the outcome. [163]

Figure 13.19: Selection bias: spurious association due to selection. [163]

a result of the process by which individuals are selected into the analysis. In more

general terms, it refers to biases that arise from conditioning on a common effect of

two variables, one of which is either the treatment or a cause of treatment, and the

other is either the outcome or a cause of the outcome. Selection bias in observational

studies may be due to the design (e.g. enrolment criteria) or unintended. A typical

case of bias is due to censoring (informative censoring) or to the restriction of

the study to volunteering persons (self-selection bias). Note also that the risk of

selection bias is not limited to the observational setting. In fact, randomisation

protects against confounding but not against selection bias when the selection occurs

after the randomisation.

Below we will present several examples of selection biases, taken from [163],

which illustrate the risk of spurious associations due to implicit conditioning. In

Figure 13.18 we see an example of selection bias due to conditioning on a descendant

y (e.g. by thresholding) of the outcome y. The relation between xand uis

unconfounded, but the choice of limiting the training set to outcome values which

are smaller than a threshold may induce selection bias.

Figure 13.19 represents a causal graph where xstands for college education, i

for the presence of impaired memory (precursor of Alzeihmer) in memory and y

denotes the Alzeihmer pathology. sis a variable denoting the examples which were

included in the study (0 stands for not selected). Suppose we pool two datasets:

persons with college education (x= 1) and persons with impaired memory (i = 1).

The variable sis then the logical OR of x and i . In the resulting dataset, the

patients with x= 0 will have necessarily i= 1 and then y= 1. As a consequence of

the related selection bias, a negative association between education and Alzheimer

will appear, in spite of their independence made visible by d-separation.

The examples above illustrate two major benefits of a causal representation of

the mechanism underlying data: DAGs visualise how causal relationships translate

into associations and provide a formal tool to detect biased situations (confounding,

selection bias). All this however requires the capability of drawing (or inferring) a

diagram that adequately describes contextually plausible associations.

372 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

Another famous example of selection bias occurred in relation to the Challenger

Space Shuttle disaster in 1986 [94]. The night before the launch, an emergency

meeting took place to understand whether it was dangerous to proceed with the

launch given the exceptional low temperature. The committee concluded that no

relationship existed between temperature and past accidents. Unfortunately, a pos-

teriori it emerged that such conclusion was due to the fact that only data related

to past accidents (and then biased) were taken into consideration. If data related

to launches with no damage would have been included in the analysis as well, the

relation would have been easily detected and the tragedy probably avoided6.

13.8 Counterfactual

A counterfactual question that now you, patient reader of this book, could legiti-

mately ask yourself is "How much would I have been happier if I had decided NOT

to read such a book ?" The reasoning in order to answer such question should rely

on

1. your current state due to the choice of reading this book,

2. the causal model linking the decision and your happiness state (e.g. repre-

sented in Figure 13.20)

Counterfactual reasoning is probably one of the most important characteristics of

human intelligence and refers to the human ability to imagine unseen scenarios and

predict unobserved situations. This is the reason why Pearl [149] put counterfactuals

at the uppermost level of the reasoning skills about causality (Figure 13.2). While

do-operators P (y =y| do (x = 1)) model the average consequence of an intervention

(do(x = 1)) on a population, counterfactuals P (y(x=1) = 1|e ) model the impact

of an unseen intervention (do (x = 1)) on a segment of the population defined by

some evidence e. Counterfactual reasoning combines the observed evidence with

the effect of an intervention in three steps:

1. Abduction: use the evidence eto infer information about the pre-intervention

state (e.g. your motivation this morning or the quality of this book),

2. Intervention: modify the structural model to keep into account the action,

3. Prediction: combine the pre-intervention distribution with the action to pre-

dict what would have happened.

In order to link the factual and the counterfactual worlds, Pearl proposes the

adoption of a twin network merging the distribution of the observational world and

the distribution of the world which was manipulated according to the counterfactual

action (e.g. in the case of our reader, decide not to open this book). Such a twin

network (Figure 13.21) is a handy way to use the DAG formalism to model and

answer counterfactual questions.

Counterfactual example

Consider the talented doctor House, often urged to operate desperate cases, and

the success rate of his operations. Let x= 1 denote the fact of being operated

by doctor House, and y = 1 the death of a patient. Consider the three following

differences:

1. P (y = 1|x = 1) P (y = 1|x = 0)

6https://bookdown.org/egarpor/PM-UC3M/glm- challenger.html

13.8. COUNTERFACTUAL 373

Happiness

this morning

Happiness

Now

Decision

Quality of

lecture

Motivation

Figure 13.20: Causal model of the reader satisfaction.

Happiness

Before

Happiness

Decision

Quality of

lecture

Motivation

Happiness (d=0)

Decision=0

Figure 13.21: Twin counterfactual model of the reader satisfaction.

374 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

2. P (y = 1|do (x = 1)) P (y = 1| do (x = 0)) = P (y (1) = 1) P (y (0) = 1)

3. P (y (1) = 1|x = 0) P (y (0) = 1|x = 0)

The first quantity is positive, not because the doctor House is a bad doctor, but

because he is used dealing with desperate situations. The second difference is neg-

ative since being operated by dr. House reduces the death risk. The third quantity

is counterfactual and states what would have happened to a patient who has not

been operated on by dr. House if this would have indeed happened. This quantity

should be more negative than the second one since the observation x = 0 provides

additional evidence that the case was not desperate.

Counterfactual reasoning in law

It is interesting to see how law directives (e.g. the ones against discrimination) are

often written in counterfactual language: The central question in any employment-

discrimination case is whether the employer would have taken the same action had

the employee been of a different race (age, sex, religion, national origin, etc.) and

everything else had been the same. In counterfactual notation, this boils down to

estimate the following quantity

P(y(x=1) = 1|x = 0 ,y= 0)

where y = 0 where y denote the hiring and xhis race. The quantity above returns

the probability that the refused person (observation y= 0) with race x = 0 would

have been hired if the perceived race would have been different.

13.9 Causal structure identification

The estimation of causal effects requires the availability of a causal graph. Even

when experimental interventions are possible, performing the large amount of ex-

periments that would be required to discover causal relationships between tens or

thousands of variables is not practical. Causal discovery aims to return plausible

explanations for observable associations in terms of DAGs. The rationale is that

some causal relationships can be tested without doing experimentation. This means

that some causal dependencies can be inferred from non-temporal statistical data

if one makes certain simplifying assumptions about the underlying process of data

generation.

Causal Markov assumption (Definition 3.1): the causal interpretation is that

all dependencies (or associations) between variables are due to causal relations.

Note that is an oversimplification since dependencies between variables could

be generated in non causal ways as well (see selection bias in Section 13.7.4).

Faithfulness: all independencies found in the distribution are due to d-separations

in the graph or equivalently a d-connection implies a dependency (Section 4.3.2.1).

In causal terms, this means that all causal pathways should induce a depen-

dency, though this is not always the case (e.g. two causal pathways could

cancel out each other).

Stability: the set of independencies of the associated distribution depends

only on the structure of the graph and not on the parametrisation. Unstable

independencies (i.e. dependencies that disappear with a parameter change)

are unlikely to occur in the data, so all the independencies are structural.

13.9. CAUSAL STRUCTURE IDENTIFICATION 375

Causal sufficiency: A set of variables Xis causally sufficient if for every pair

of variables x1 ,x2 X , every common direct cause of x1 and x2 is also a

member of X. If this is not the case, there are latent variables.

There are two main families of causal structure identification: score-based and

constraint-based algorithms. Score-based algorithms search, within a set of can-

didate structures, the one which optimises some cost function. Commonly used

score functions are the z-score of hypothesis test (e.g. under the assumption of

Gaussian linear dependencies), maximum-likelihood or information-theoretic score

(e.g. BIC score). Such algorithms transform the problem of structure identification

into a problem of optimisation. Though a number of state-of-the-art algorithms

may be used to address such optimisation, the large size of the search space makes

such an approach unfeasible for a moderate number of variables. Constraint-based

algorithms are more commonly used and discussed in the following sections.

13.9.1 Constraint-based approaches

We have extensively discussed in Chapter 4 the relation between topological proper-

ties and conditional independence in DAGs. The rationale of the constraint-based

approach is to use conditional independence relations as constraints during the

learning of the DAG structure from data. The idea is to derive from data a number

of testable implications (notably by estimating conditional independence) and use

them to disambiguate causal configurations (e.g. directionality) as much as possible.

The resulting algorithm iteratively looks for a DAG compliant with the statistical

constraints and consists of 2 main steps:

1. Skeleton discovery compliant with conditional independence patterns.

2. Orientation based on v-structures and acyclic constraints.

The final goal is to discover the class of Markov equivalent DAGs (Definition 3.10)

which is consistent with the available dataset.

13.9.1.1 Normal conditional independence test

In order to speed up the conditional independence tests, constraint-based algorithms

often assume a Normal distribution. Consider a multivariate normal vector X , such

that xi ,xj X , XS X and s is the dimension of XS . Then

ρx i x j |XS = 0 I(xi ,xj | XS )=0

The sample partial correlation ˆ ρx i x j |XS can be computed by regression, inversion

of the covariance matrix or recursively (Section 3.8.3). To test the null hypothesis

ˆ ρx i x j |XS = 0 the Fisher's z-transform

Z=1

2log 1 + ˆ ρx i x j |XS

1 ˆ ρx i x j |XS

is typically considered. The null hypothesis is rejected with the significance level α

(false positive rate) if

Z N s3 >Φ1 (1 α/2)

where Φ(·) is the cumulative distribution of a standard Normal variable (Sec-

tion 3.4.2).

376 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

Let H the empty graph over X// forward strategy

for each pair ( xi , xj ) in X do

Search XS such that I ( xi , xj | XS ) = 0 // conditional independence

if no XS exists then

connect (xi ,xj ) in H

else

Si,j =X S

end if

end for

Figure 13.22: IC algorithm: Si,j stores the set of variables that make xi and x j

conditionally independent).

Let H the complete undirected graph over X// backward strategy

for each pair ( xi , xj ) in H do

for XS (V\ ( xi ,xj )) do

if I( xi , xj | XS )=0 then

Delete edge (i, j) from H

Si,j =X S

break

end if

end for

end for

Figure 13.23: SGS algorithm.

13.9.1.2 Skeleton discovery

The first example of constraint-based algorithm is the IC algorithm (Figure 13.22)

which adopts a forward and exhaustive approach. It starts with the empty graph

and for each pair (xi ,xj ) of nodes it adds a connecting edge if no conditioning set

makes them d-separated, otherwise it stores the separating set in the variable Si,j .

The number of tests in the worst case is bounded by

n

2 2 n2

where n is the number of variables.

Given the high computational cost of this approach, some IC variants start with

the complete undirected graph (backward) or limit the conditioning size. An ex-

ample is the SGS algorithm (Figure 13.23) which requires, however, an exponential

search. In order to bound the computational complexity, the PC (Peter and Clark)

algorithm [172] (Figure 13.24) sets a maximum size Lof the conditioning size. For

a given L , the number of tests is bounded polynomially by

2 n

2 L1

X

i=0 n2

i 3n2 (n 2)L

13.9.1.3 Dealing with immoralities in the skeleton

A potential immorality in the skeleton is a triplet of variables such that xi xk x j

but no edge exists between xi and xj (i.e I ( xi ;xj |xk )> 0 or equivalently xk / Si,j )

It is then possible to extend the skeleton retrieved in the previous section by

marking the immoralities, orienting the edges of the associated collider and then

13.9. CAUSAL STRUCTURE IDENTIFICATION 377

Let H the complete undirected graph over X

for l = 0 to L do

for adjacent xi , xj in H do

for XS (N ( xi )N ( xj )) \ ( xi xj ) such that | XS |=l do

if I ( xi ,xj | XS ) = 0 then

Delete edge (i, j) from H

Si,j =X S

break

end if

end for

end for

l= l+ 1

end for

Figure 13.24: PC algorithm: N ( xi ) denotes the set of neighbors of the node xi and

Lthe maximum conditioning set size.

Let H be the skeleton

for each non adjacent pair ( xi ,xj ) with a common neighbor xk in H do

if xk / Si,j then

Add arrowheads pointing to xk (v-structure)

end if

end for

Orient as many undirected edges as possible by avoiding 1) new v-structures

and 2) directed cycles

Figure 13.25: Edge orientation strategy.

obtaining a Partially DAG. This is the first part of the edge orientation strategy in

Figure 13.25. Once the immoralities are identified, we have identified the equiva-

lence class of the resulting PDAG.

The second part of the edge orientation strategy (Figure 13.25) relies on the

following rules:

R1: Orient j kinto j kwhenever there is an arrow i j and i, k are not

adjacent.

R2: Orient i jinto i jwhenever i kj

R3: Orient i jinto i jwhenever i k jand i l jand k, l are not

adjacent.

R4: Orient i jinto i jwhenever i k land k ljand k, l are

not adjacent.

PC example

The diagrams in Figure 13.26B-E illustrate the of steps of a PC algorithm re-

constructing the ground truth DAG in Figure 13.26A [80]. Step B relies on the

independence x y . Step C is due to the relation x w |z, y w |z . In D, the

algorithm creates a collider from triplet x z y since no x y exists. Step E

implements the R1 orientation rule.

378 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

Figure 13.26: PC algorithm [80]. Diagram in A denotes the real DAG. The dia-

grams B to F represent the inferred structure at different PC steps. The first step

corresponds to the complete undirected graph. The last two steps are performed

by edge orientation.

13.9.1.4 Limitations

Constraint-based algorithms are well-known structure identification algorithms and

have been intensively used in a lot of practical domains. Nevertheless, they suffer

from some limitations:

I-equivalence classes: two graphs that are I-equivalent (Definition 3.8) cannot

be distinguished by constraint-based approaches without resorting to manip-

ulative experimentation or temporal information.

Conditional independence: the use of conditional independence tests requires

an assumption on the dependence (e.g. linear as in Section 13.9.1.1 or nonlin-

ear). In the nonlinear case, it may be particularly expensive to have recourse

to conditional tests.

Finite-sample error propagation: the sequential nature of the constraint-based

algorithms suffers from an error propagation (false positive) which is hard to

monitor and control.

Asymptotic results of correctness: if the conditional independence decisions

are correct in the large sample limit, the PC algorithm is guaranteed to con-

verge to the true Markov Equivalence Class in the large sample limit, assuming

i.i.d. samples and the Markov, Faithfulness, Sufficiency assumptions.

Curse of dimensionality: the exponential complexity nature of the algorithm

(Figure 13.27) makes those algorithms inadequate in large dimensional set-

tings (e.g. bioinformatics).

13.10 Beyond conditional independence

The importance of causal reasoning in large dimensional settings (notably bioinfor-

matics) paved the way for developing alternative strategies for performing causal

inference from data. We will limit here to sketch two main approaches: the use

13.10. BEYOND CONDITIONAL INDEPENDENCE 379

0 200 400 600 800 1000

1e+06 1e+20 1e+34

number of varia bles n

number of CI tests (log scale)

d=15

d=10

d=5

Figure 13.27: Constraint-based algorithms complexity: upper bound on the number

of conditional tests as a function of the problem dimension nand conditioning size

L.

of causal feature selection strategies and the adoption of data-driven techniques to

deal with indistinguishable situations.

13.10.1 Causality and feature selection

Feature selection and causal inference are related by the notion of Markov blanket

(MB) [182]. A Markov Blanket (definition 8.4) is the smallest set of strongly relevant

variables [118], i.e. variables containing information about the target which cannot

be obtained from any other variable (definition 8.1). A MB of a target contains

direct causes (parents), direct effects (children) and spouses (nodes that share a

child with the target). Feature selection techniques, being able to discriminate

between causes and effects, may then play a major role in causal modelling.

Tsamardinos et al. [182] proposed several algorithms to identify the Markov

Blanket of a target variable. Pellet and Elisseef [151] proposed an algorithm in

two steps: the first builds an approximate structure of the causal graph using a

feature selection algorithm, and the second improves it by local adjustments and

orientation.

Most existing algorithms of causal feature selection [88] decompose feature selec-

tion and causality and rely on conditional independence test to orient arcs and detect

causal relationships. Nevertheless, not many approaches address large feature-to-

sample ratio settings. The author proposed a causal filter in [33] which integrates

into the cost function the relevance component and a causal component and ad-

dresses the issues of large feature-to-sample ratio settings. The algorithm, called

mIMR, is a causal extension of the mRMR algorithm (Section 12.8.2) and was

successfully applied to bioinformatics applications [32].

13.10.2 Beyond observational equivalence

Approaches like Bayesian network or mIMR rely on notions of (conditional) in-

dependence and faithfulness to detect causal patterns in the data. They cannot

deal with indistinguishable (or equivalent) configurations (Section 4.3.3) like the

two-variables setting and the completely connected triplet configuration.

However, in what follows, we will see that indistinguishability does not prevent

380 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

Figure 13.28: An example of Markov Blanket in a Causal Bayesian Network. [90]

the existence of statistical algorithms able to reduce the uncertainty about the causal

pattern. Note that this is a recent result in contrast with the common belief that

it was impossible to learn a causal graph with two variables (as stated for instance

by Wasserman in [192]). Indeed this impossibility is limited to the Gaussian case

where the two variables are linked by a linear relation: in this case the joint Gaussian

distribution is fully determined by the mean and covariance matrix and there is no

way to reconstruct directionality form the distribution. However, asymmetry in the

dependence structure is a fundamental property that distinguishes causation from

association. It follows that, under some specific constraints (e.g. non Gaussian noise

or nonlinear relationships), some asymmetric features of the distribution (beyond

(in)dependence relations) might be informative about the causal structure.

13.10.2.1 Learning directionality in bivariate associations

A bivariate association between two variables x and y is necessarily due to one of

those reasons:

1. a causal influence going from x to y,

2. a causal influence going from y to x,

3. a (possibly unobserved) common cause (confounder) of x and y (note that

also time can play the role of confounder),

4. a (possibly unobserved) common effect of x and y (inducing selection bias).

or any of their combinations.

In recent years several approaches addressed the two variable settings (e.g.

ANM [104] and IGCI [54]) by using asymmetric statistical properties (e.g. due

to non Gaussian noise or nonlinear mapping) to detect causal patterns. An addi-

tive noise model (ANM) dependence between two variables x and y satisfies the

13.10. BEYOND CONDITIONAL INDEPENDENCE 381

Figure 13.29: Illustration of asymmetric effects due to the existence of a causal

dependency x y [104]

following relations:

y=f ( x) + w, wx

where the noise term is independent from the input x. Consider a bivariate training

set related to two random variables x and y and suppose that either x y or y x

holds. The algorithm proposed in [104] is based on the idea that if x y is additive,

the independence between cause and noise does not hold for y x . The algorithm

steps are:

1. Regress both y on x and x on y.

2. Compute the residuals wy and wx for the two regressions.

3. Compute two independence tests (e.g. HSIC): wy x and wx y .

4. Compare the two independence test statistics ˆ

Cxy and ˆ

Cyx to determine

whether x y is more probable than y x or vice-versa.

A further important step in this direction has been made by the ChaLearn cause-

effect pair challenge [87] where participants should learn from data the answer to

questions like Does altitude cause temperature or vice-versa? 7 . Hundreds of pairs

of real variables with known causal relationships from several domains (chemistry,

climatology, ecology, economy, engineering, epidemiology, genomics, medicine) were

made available to competitors for training their models. Real data were intermixed

with controls (pairs of independent variables and pairs of variables that are depen-

dent but not causally related) and semi-artificial cause-effect pairs (real variables

mixed in various ways to produce a given outcome). The good accuracy obtained

by several competitors showed that learning strategies can address with success

(or at least significantly better than random) indistinguishable configurations. This

competition opened the way to a recent research direction which poses causal infer-

ence as the problem of learning to classify probability distributions [154]. In those

approaches causal inference is typically based on two steps:

1. a featurisation of the observed dataset,

7See YouTube video "CauseEffectPairs" by Isabelle Guyon at https://www.youtube.com/

watch?v=pgoZ5lmRRvE.

382 CHAPTER 13. FROM PREDICTION TO CAUSAL KNOWLEDGE

2. the training of a binary classifier to distinguish between causal directions.

Existing approaches differ mainly in the featurisation step: for instance, [127] pro-

posed an approach based on kernel mean embeddings.

The author proposed instead a machine learning approach based on mutual in-

formation in [34, 35], called D2C. Given two variables, the D2C approach infers from

a number of asymmetric statistical features of the n -variate distribution the proba-

bility of the existence of a directed causal link. Causal inference is then addressed

as a supervised learning task where: i) inputs are asymmetric features describing

the probabilistic dependency and ii) output is a class denoting the existence of the

causal link.

Once sufficient training data are made available, conventional feature selection

algorithms and classifiers can be used to return a prediction. The rationale of those

approaches is that, though "correlation does not imply causation", it happens that

"causation creates dependence" (if faithfulness holds). In other terms, causality

leaves footprints (e.g. asymmetric descriptors) in the statistical distribution that

can be reused to reduce the uncertainty about the existence (or directionality) of a

causal relationship.

13.11 Concluding remarks

A final chapter on causal inference in an introductory machine learning book could

appear as an unnecessary burden for an already exhausted reader. In fact, the

author deems that conventional machine learning books tend to create in the reader

a feeling of overconfidence about the power of predictive models. On the contrary,

in many crucial real-life applications, the real question at stake is causal (e.g. will

the lockdown policy have an impact on epidemics?) and not associative. Only an

understanding of causal relationships can support a reliable prediction of how a

system will behave once subject to intervention.

It is then important to stress the difference between the notion of probability

conditional on observation (Prob {y|x =x }) and probability conditional on ma-

nipulation (Prob {y | do (x ) = x } ). Those two quantities are different and confuse

them may have disastrous consequences (e.g. overconfidence) in terms of decision

making. Though machine learning provides many solutions to the algorithmic esti-

mation of Prob {y|x =x} , the practitioner should be encouraged to think whether

Prob {y|x =x} is the quantity (s)he is really interested in. For instance, though a

trader would be more than happy in improving the estimation of Prob {y |x =x }

(where x stands for today stock price and ystands for tomorrow's), this quantity

would be of little use for an economist aiming to predict the impact of a Tobin-like

tax on the markets8.

Notions like the Simpson paradox, confounding, selection bias, latent variables,

counterfactuals should sound like a "forewarned is forearmed" message to machine

learning practitioners (and over-optimistic deep-learning evangelists :-). Accurate

prediction (whatever the depth of your network) does not imply accurate under-

standing either good decision making. So, the old data mining adage "We are

drowning in data and starving for knowledge" should instead read 'We are drown-

ing in associations and starving for causality".

8For a thorough analysis of the difference between predicting modelling and causal predictive

modelling, we refer the reader to [173]

Chapter 14

Conclusions

We have come to the end, almost. We will take a few words to remind you that

machine learning is not perfect and to cover a bit of ethical considerations. Then

we will conclude with some take-home messages and final recommendations.

14.1 About ML limitations

From the dawn of the AI discipline, machine learning has been considered a key

component of autonomous intelligent agents. In recent years, though a full-fledged

artificial intelligence does not seem within reach yet, machine learning found great

success thanks to its data-driven and assumption-free nature.

This book insisted on the fact that no modelling effort may be completely

assumption-free. Assumptions (though often implicit and hard-coded in the al-

gorithms) are everywhere in the learning process, from problem formulation to data

collection, model generation and assessment! When such assumptions happen to

match with reality, the resulting method is successful: if it is not the case, the result

may be disappointing (see NFL theorem).

Another misjudgment about machine learning is to consider it as a reliable proxy

of human learning. Machine learning owes its success to the generic and effective way

of transforming a learning problem into a (stochastic) optimisation one. Machines

do not think or learn like us: if this is the key to their success, this makes them

fragile, too. A large part of human rational decision making and understanding

cannot be reduced to the optimisation of a cost function.

This has been put into evidence by a recent trend taking a critical attitude

about the machine learning approach (e.g. the i.i.d. assumption) and its limits.

For instance, research on adversarial learning aims to show that the limited training

and validation set may induce very optimistic expectations about the generalisation

of learners. Recent research showed that automatic learners, which appear to be

accurate emulators of human knowledge (e.g. in terms of classification accuracy),

may be easily fooled once required to work in specific situations. A well-known

example (Figure 14.1) shows that deep learning classifiers, able to reach an almost

100% accuracy rate in recognising animal images, may return pitiful predictions,

when confronted with properly tweaked inputs [67]. Though this seems anecdotal,

such vulnerabilities in learning machines could be very dangerous in safety-critical

settings (e.g. self-driving cars).

Another interesting research direction is the study of the robustness of automat-

ically learned model in settings which are not identical to the one used for training,

e.g. because of nonstationarity and concept drift. This is particularly critical in

health problems where models returning high-quality predictions for a specific co-

383

384 CHAPTER 14. CONCLUSIONS

Figure 14.1: Bad generalisation in front of adversarial examples

hort (e.g. in a given hospital) miserably fail when tested on different patients (e.g.

from another hospital). How to transfer learned models to close settings is then

another hot topic in recent learning research. In this context, causal interpretation

of data generation could play an important role in reducing the risk of drift and

increasing the model stability.

14.2 A bit of ethics

Last but not least, a word of ethics should not hurt in a book for computer scien-

tists. "Data-driven does not necessarily mean "objective. Machine learning models

predict what they have been trained to predict and their forecasts are only as good

as the data used for their training. In that sense, machine learning can reinforce hu-

man prejudices if trained on biased data sets derived from human decisions. Feeding

learners with biased data can have dangerous consequences [141]. In 2016 Twitter

chatbot Tay began uttering racist statements after a single interaction day. The

predictive justice software COMPAS, deciding whether a suspect should be incar-

cerated before trial or not, has been accused of being racially biased by an NGO.

In 2015, Google Photos identified two African American people as "gorillas".

Every ML practitioner should be aware that even models developed with the

best of intentions may exhibit discriminatory biases, perpetuate inequality, or per-

form less well for historically disadvantaged groups 1. Recent efforts in modelling

and introducing fairness in machine learning (most of the time based on causal

considerations) are then more than welcome.

At the same time, the problem goes beyond the scientific and technical realm

and involves human responsibility. Automating a task is a responsible decision-

making act which implies encoding (implicitly or explicitly) ethical priorities in an

autonomous agent. A self-driving car that decides to brake (or not to brake) is

somewhat trading the cost of human life vs. the cost of an over-conservative action.

The choice of entrusting to machines tasks that could have an impact on human

security or human sensibility should never exempt humans (from the programmer

to the decision maker) from legal and moral responsibilities of probable errors.

To conclude, the ethical dilemma of ML may be summarized by the contrapo-

sition of the two citations at the beginning of this book: on the one hand, any

machine learning endeavour harbours the ambition (or the illusion) of catching the

1see also the presentation https://tinyurl.com/y4ld3ohz

14.3. TAKE-HOME NOTIONS 385

essence of reality with numbers or quantitative tools. On the other hand, "not

everything that counts" (notably ethics) can be counted or easily translated into

numerical terms.

14.3 Take-home notions

Quoting Einstein "the supreme goal of all theory is to make the irreducible basic

elements as simple and as few as possible without having to surrender the adequate

representation of a single datum of experience". This sentence catches probably

the primary take-home concept in machine learning: trade-off, notably the trade-

off between bias and variance, underfitting and overfitting, parametric and non-

parametric, false positive and false negative, type I and type II error,...(please add).

Other bulk notions the author would like you to remember (or revise) are:

information theory: it is a powerful language to talk about stochastic depen-

dence,

estimators: do not forget they are random variables with their own sampling

distribution, and, even if very (very) good, they may be wrong sometimes2,

conditional probability: supervised learning is all about estimating it,

conditional (in)dependence and its non-monotonicity: mastering the complex-

ity (and beauty) of high dimensionality goes that way.

14.4 Recommendations

We would like then to end this manuscript not by selling you a unique and superior

way of proceeding in front of data but by proposing some golden rules for anyone

who would like to adventure in the world of statistical modelling and data analysis:

However complex is your learning algorithm (either adaptive or deep or pre-

ferred by GAFA), do not forget it is an estimator, and as such, it makes

assumptions (often implicitly). Each approach has its own assumptions! Be

aware of them before using one.

Simpler things first! According to Wasserman [192], using fancy tools like

neural nets,...without understanding basic statistics is like doing brain surgery

before knowing how to use a band-aid.

Reality is probably almost always nonlinear but a massive amount of (theo-

retical and algorithmic) results exists only for linear methods.

Expert knowledge MATTERS... But data too :-)

It is better to be confident with a number of alternative techniques (preferably

linear and nonlinear) and use them in parallel on the same task.

Resampling and combining are at the forefront of the data analysis techniques.

Do not forget to test them when you have a data analysis problem.

Do not be religious about learning/modelling techniques. The best learning

algorithm does NOT exist.

2...even the divine Roberto Baggio missed a penalty in the 1994 FIFA World Cup Final against

Brazil :-(

386 CHAPTER 14. CONCLUSIONS

Statistical dependency does not imply causality, though it may shed some

light on it.

and the best motto for a machine learner:

Once you stop learning, you start dying (Albert Einstein).

Appendix A

Unsupervised learning

A.1 Probability density estimation

The probability density estimation is the problem of inferring a probability density function

pz , given a finite number of data points { z1 , z2 ,...,z N}drawn from that density function.

We distinguish three alternative approaches to density estimation:

Parametric. This approach assumes a parametric model of the unknown density prob-

ability. The parameters are estimated by fitting the parametric function to the

observed dataset. This approach has been extensively discussed in Chapter 5.

Nonparametric. This approach does not assume any a priori form of the density model.

The form of the density is entirely determined by the data and the number of

parameters grows with the size of the dataset.

Semi-parametric. In this approach the number of parameter is not fixed a priori but is

independent of the size of the dataset.

A.1.1 Nonparametric density estimation

The term nonparametric is used to describe probability density functions whose functional

form is not specified in advance, but is dependent on data [162, 144].

Let us consider a random variable z with density probability pz (z ) and a region R

defined on the zspace. The probability that a value zdrawn according to pz (z ) falls

inside Ris

PR = prob{z R } =Z R

pz ( z) dz (A.1.1)

Let us define with k the random variable which represents the number of points which

falls within R , after we have drawn N points from pz (z ) independently.

From (C.1.1) we have that its probability distribution is

pk ( k) = N!

k!( N k)! P k

R(1 P R) ( Nk) (A.1.2)

Moreover, the random variable k /N satisfies

E[k /n] = PR(A.1.3)

and

Var[k /N ] = E [(k /N PR )2 ] = P R (1 P R )

N(A.1.4)

Since according to (A.1.4), the variance of k /N converges to zero as N → ∞, it is reason-

able to expect that the fraction k/N return a good estimate of the probability PR

PR

=k

N(A.1.5)

387

388 APPENDIX A. UNSUPERVISED LEARNING

At the same time, if we assume that pz (z ) is continuous and does not vary appreciably

over R , we can approximate PRwith:

PR =Z R

p( z) dz

=p( z) V(A.1.6)

with V volume of R . From (A.1.5) and (A.1.6) it follows that for values of z inside R

p( z)

=k

NV (A.1.7)

In order for (A.1.5) to hold it is required to have a large R . This implies a sharply peaked

pz ( z). In order for (A.1.6) to hold it is required to have a small R. This ensures a pz ( z)

constant in R . These are two clashing requirements. We deduce that it is necessary to

find an optimal trade-off for R in order to guarantee a reliable estimation of pz (z ). This

issue is common to all nonparametric approaches to density estimation.

In particular, we will introduce two of them

Kernel-based. This approach fixes R and searches for the optimal number of points k.

k-Nearest Neighbor (k-NN). This approach fixes the value for kand searches for the

optimal R.

The two approaches are discussed in detail in the following sections.

A.1.1.1 Kernel-based methods

Consider a random vector zof dimension [n× 1] and suppose we take an hypercube region

Rwith sides of length Bcentered on the point z. The volume of Ris

V= Bn

Let us now define a kernel function (or Parzen window) K (u ) as

K( u) = 1 if | u j | <1/ 2j = 1 ,...,n

0 else (A.1.8)

where uj is the j th component of the vector u . It follows that the quantity

K zzi

B

is equal to unity if zi is inside the hypercube centered at z with side B.

Therefore, given a set of N points, the number of points falling inside Ris given by

k=

N

X

i=1

K zzi

B (A.1.9)

From (A.1.7) and (A.1.9) it is possible to define the kernel-based estimate of the prob-

ability density for the kernel (A.1.8) as

ˆ p( z) = P N

i=1 K z z i

B

NBn (A.1.10)

Note that the estimate (A.1.10) is discontinuous over the z-space. In order to smooth

it we may choose alternative kernel functions, as the Gaussian kernel. The kernel-based

method is a traditional approach to density estimation. However, two are the most relevant

shortcomings of this approach:

1. it returns a biased estimator [25],

2. it requires the memorisation of the whole set of observations. As a consequence the

estimation is very slow if there is an high number of data.

A.1. PROBABILITY DENSITY ESTIMATION 389

A.1.1.2 k-Nearest Neighbors methods

Consider an hyper sphere Rcentered at a point z, and let us grow it until it contains

a number of kpoints. Using Eq. (A.1.7) we can derive the k-Nearest Neighbor (k-NN)

density estimate

ˆ pz ( z) = k

NV (A.1.11)

where k is the value of a fixed a priori parameter, Nis the number of available observations

and V is the volume of the hyper sphere.

Like kernel-based methods, k-NN is a state-of-the-art technique in density estimation.

However it features two main shortcomings

1. the quantity (A.1.11) is not properly a probability, since its integral over the whole

Z space is not equal to one but diverges

2. as in the kernel method, it requires the storage of the whole dataset.

A.1.2 Semi-parametric density estimation

In semi-parametric techniques the size of the model does not grow with the size of the data

but with the complexity of the problem. As a consequence, the procedure for defining the

structure of the model is more complex than in the approaches previously seen.

A.1.2.1 Mixture models

The unknown density function is represented as a linear superposition of mbasis functions.

The distribution is called called mixture model and has the form

pz ( z) =

m

X

j=1

p( z| j) π( j) (A.1.12)

where m is a parameter of the model and typically m << N . The coefficients π (j ) are

called mixing coefficients and satisfy the following constraints

m

X

j=1

π( j) = 1 0 π( j) 1 (A.1.13)

The quantity π (j ) is typically interpreted as the prior probability that a data point be

generated by the j th component of the mixture. According to Bayes' theorem, the corre-

sponding posterior probabilities is

p( j| z) = p(z| j )π (j)

p( z)= p(z| j )π (j)

Pm

j=1 p(z| j )π(j )(A.1.14)

Given a data point z , the quantity p (j| z ) represents the probability that the component j

had been responsible for generating z.

An important property of mixture models is that they can approximate any continuous

density with arbitrary accuracy provided the model has a sufficient number of components

and provided the parameters of the model are tuned correctly.

Let us consider a Gaussian mixture model with components p (z| j )N (µj , σ2

j) and

suppose that a set of Nobservations is available. Once fixed the number of basis functions,

the parameters to be estimated from data are the mixing coefficients π (j ), and the terms

µj and σj .

The procedure of maximum likelihood estimation of a mixture model is not simple,

due to existence of local minima and singular solutions. Standard nonlinear optimisation

techniques can be employed, once the gradients of the log-likelihood with respect to the

parameters is given. However, there exist algorithms which avoid the complexity of a

nonlinear estimation procedure. One of them is the EM algorithm, which will be introduced

in the following section.

390 APPENDIX A. UNSUPERVISED LEARNING

A.1.2.2 The EM algorithm

The expectation-maximisation or EM algorithm [57] is a simple and practical method for

estimating the mixture parameters avoiding complex nonlinear optimisation algorithm.

The assumption of the EM algorithm is that the available dataset is incomplete. This

incompleteness can either be due to some missing measurements or because some imaginary

data are introduced to simplify the mathematical form of the likelihood function.

The second situation is assumed to hold in the case of mixture models. The goal of

the EM algorithm is then to maximize the likelihood of the parameters of a mixture model

assuming that some data is missing in the available dataset.

The algorithm has an iterative form in which each iteration consists of two steps: an

expectation calculation (E step) and a maximisation (the M step). It has been shown in

literature that the iteration of EM estimates converge to a local maximum of the likelihood

of the incomplete data.

Assume that there exists a statistical model of our dataset DN and that it is parametrized

by a real vector θ. Assume also that further data, denoted by Ξ, exist but are not observ-

able. The quantity ∆N is used to denote the whole dataset, containing both the observed

and unobserved data, and is usually referred to as the complete data.

Let us denote by lcomp (θ ) the log likelihood of the parameter θgiven the complete data.

This is a random variable because the values of Ξ are not known. Hence, it is possible for a

given value θ (τ) of the parameter vector to compute the expected value of lcomp (θ (τ) ). This

gives a deterministic function of the current value of the parameter, denoted by Q (θ (τ) ),

that can be considered as an approximation to the real value of l , called the incomplete

likelihood. The maximisation step is expected to find the parameter value θ (τ +1) which

maximize Q . The EM procedure in detail is the following:

1. Make an initial estimate θ (0) of the parameter vector.

2. The log likelihood l comp (θ (τ) ) of the parameters θ (τ) with respect to the complete

data ∆N is calculated. This is a random function of the unknown dataset Ξ.

3. The E-step: the expectation Q (θ (τ) ) of l comp (θ (τ) ) is calculated.

4. The M-step: a new estimate of the parameters is found by the maximisation

θ(τ +1) = arg max

θQ(θ ) (A.1.15)

The theoretical justification comes from the following result proved in [57]: for a sequence

θ(τ) generated by the EM algorithm it is always true that for the incomplete likelihood

l( θ(τ+1) ) l( θ(τ) ) (A.1.16)

Hence the EM algorithm is guaranteed to converge to a local maximum of the incomplete

likelihood.

A.1.2.3 The EM algorithm for the mixture model

In the mixture model estimation problem the problem of determining the parameters (i.e.

the mixing coefficients and the parameters of the density p (z| j ) in Eq. (A.1.12)) would be

straightforward if we knew which component jwas responsible for generating each data

point in the dataset. We therefore consider a hypothetical complete dataset in which each

data point is labeled by the component which generated it. Thus, for each point zi we

introduce m indicator random variables ζ ij ,j = 1,...,m , such that

ζij =( 1 if z i is generated by the j th basis

0 otherwise (A.1.17)

Let ∆Nbe the extension of the dataset DN , i.e. it represents the complete dataset,

including the unobservable ζij . The probability distribution for each (zi , ζij ) is either zero

or p (zi | j ). If we let ζi represent the set {ζi1 ,ζi2 ,..., ζ im }then

pζ i ( ζi ) = π( j0 ) where j0 is such that ζij 0 = 1 (A.1.18)

A.1. PROBABILITY DENSITY ESTIMATION 391

so

p( zi , ζi ) = p( ζi ) p( zi | j0 ) = π( j0 ) p( zi | j0 ) =

m

Y

j=1

[π (j)p (zi |j )]ζ ij (A.1.19)

Thus the complete log likelihood is given by

lcomp (θ ) = ln Lcomp (θ ) = ln

N

Y

i=1

m

Y

j=1

[π (j)p (zi |j )]ζ ij (A.1.20)

=

N

X

i=1

ln

m

Y

j=1

[π (j )p(zi |j )]ζ ij (A.1.21)

=

N

X

i=1

m

X

j=1

ζij {ln π (j ) + ln p (zi |j) } (A.1.22)

where the vector θincludes the mixing coefficients and the parameters of the density p (z| j )

in Eq. (A.1.12).

Introducing the terms ζ ij the logarithm can be brought inside the summation term.

The cost of this algebraic simplification is that we do not know the values of the ζ ij for

the training data. At this point the EM algorithm can be used. For a value θ (τ) of the

parameters the E-step is carried out:

Q( θ (τ) ) = E[ lcomp ( θ (τ) ) = E [

N

X

i=1

m

X

j=1

ζij {ln π (j ) + ln p (zi |j) } ] (A.1.23)

=

N

X

i=1

m

X

j=1

E[ζij ]{ ln π( j) + ln p( zi | j)} (A.1.24)

Since

E[ζ ij ] = P( ζij = 1| zi ) = p(zi | ζij ) P(ζij )

p( zi )= p(zi | j )π (j )

p( zi )= p(j| zi ) (A.1.25)

from Eq. (A.1.14) and (A.1.18) we have

Q( θ (τ) ) =

N

X

i=1

m

X

j=1

p( j| zi ){ ln π( j) + ln p( zi | j)} (A.1.26)

The M-step maximizes Qwith respect to the whole set of parameters θbut it is known

that this can be done individually for each parameter, if we consider a Gaussian mixture

model

p( z| j) = 1

(2πσj2 ) n/2 exp (z µj ) 2

2σj2 (A.1.27)

In this case we have:

Q( θ (τ) ) =

N

X

i=1

m

X

j=1

p( j| zi ){ ln π( j) + ln p( zi | j)}(A.1.28)

=

N

X

i=1

m

X

j=1

p( j| zi ) ln π( j) nln σj (τ) ( z i µ j (τ) ) 2

2(σj (τ) )2 + constant (A.1.29)

We can now perform the maximisation (A.1.15). For the parameters µj and σj the max-

imisation is straightforward:

µ(τ+1)

j=P N

i=1 p(j| z i )z i

PN

i=1 p(j| z i )(A.1.30)

σ (τ+1)

j 2 =1

nPN

i=1 p(j| z i )(z i µ ( τ+1)

j) 2

PN

i=1 p(j| z i )(A.1.31)

392 APPENDIX A. UNSUPERVISED LEARNING

For the mixing parameters the procedure is more complex [25] and returns:

π( j)(τ +1) =1

N

N

X

i=1

p( j| zi ) (A.1.32)

where p (j| zi ) is computed as in (A.1.25).

A.2 K-means clustering

The K-means algorithm partitions a collection of N vectors xi ,i = 1,...,N , into Kgroups

Gk , k = 1 ,...,K , and finds a cluster center in each group such that a cost function of

dissimilarity (or distance) measure is minimized. When the Euclidean distance is chosen

as the dissimilarity measure between a vector x in the k th group and the corresponding

cluster center ck , the cost function can be defined by

J=

K

X

k=1

Jk =

K

X

k=1 X

x Gk

d( x, ck ) (A.2.33)

where Jk is the cost function within group k and d is a generic distance function

d( x, ck ) = ( x ck )T M( x ck ) (A.2.34)

where M is the distance matrix. The partitioned groups are typically defined by a mem-

bership [K× N ] matrix U, where the element u ki is 1 if the i th data point xi belongs to

group k , and 0 otherwise. The matrix Usatisfies the following conditions:

K

X

k=1

uki = 1 i= 1 ,...,N

K

X

k=1

N

X

i=1

uki = N

(A.2.35)

Once the cluster centers ck are fixed, the terms u ki which minimize Eq. (A.2.33) are:

uki =( 1 if d(xi , ck ) d( xi , cj ), for each j 6= k

0 otherwise (A.2.36)

This means that xi belongs to the group k if ck is the closest center among all centers.

Once the terms u ki are fixed, the optimal center ck that minimizes Eq. (A.2.33) is the

mean of all vectors in the kth group:

ck =1

|Gk |X

x Gk

x(A.2.37)

where | Gk | is the size of Gk .

The K-means algorithm determines iteratively the cluster centers ck and the member-

ship matrix Uusing the following procedure:

1. Initialize the cluster centers ck , typically by randomly selecting K points among all

data points.

2. Evaluate the membership matrix U through Eq. (A.2.36).

3. Compute the cost (A.2.33). If it is below a certain tolerance value or if the improve-

ment is not significant, stop and return the centers and the groups.

4. Update the cluster centers according to Eq. (A.2.37). Go to step 2.

Some final remarks should be made on the K means algorithm. As many other clustering

algorithms, this technique is iterative and no guarantee of convergence to an optimum

solution can be found. Also, the final performance is quite sensitive to the initial position

of the cluster centers and to the number Kof clusters, typically fixed a priori by the

designer.

Appendix B

Linear algebra notions

Linear algebra, the science of vector spaces, plays a major role in machine learning, where

data are represented in a vectorial form. Though the readers are supposed to have attended

numerical analysis classes, here we will remind some basic notions of linear algebra. For a

more extensive presentation of linear algebra and its links with machine learning, we refer

the reader to recent references like [2, 56].

B.1 Rank of a matrix

Let us consider a [N, n] matrix. Many definitions exist for the rank of a matrix: here we

limit to consider the rank of Xas the maximal number of linearly independent columns

of X . Since the rank of a [N, n ] matrix is at most min{ N , n} , a matrix is full-rank if its

rank is min{ N, n} . A matrix which is not full-rank is also called rank-deficient.

B.2 Inner product

In linear algebra, the dot product, also known as the scalar or inner product, is an operation

which takes two vectors over the real numbers R and returns a real-valued scalar quantity.

It is the standard inner product of the orthonormal Euclidean space. The dot product of

two [n, 1] vectors x = [x1 , x2 ,...,x n] Tand y = [y 1, y2,...,y n] Tis defined as:

hx, y i=

n

X

j=1

xj yj = xT y (B.2.1)

The dot product underlies the definition of the following quantities:

the Euclidean norm of a vector x:

kx k= p h x, xi, (B.2.2)

also known as the L2 norm,

the Euclidean distance of two [n, 1] vectors x1 and x2

kx1 x2 k=p h x1 x2 , x1 x2 i, (B.2.3)

the angle ω between two vectors x1 and x2 which satisfies the relation

1 hx1 x2 i

kx1 kkx2 k = cos(ω) 1 , (B.2.4)

the projection of a vector x1 onto a direction x2

πx 2 ( x1 ) = h x 1 , x 2 i

kx2 k2 x 2 = x 2 x T

2

kx2 k x 1 (B.2.5)

where the [n, n ] matrix P = x2 x T

2is called the projection matrix.

393

394 APPENDIX B. LINEAR ALGEBRA NOTIONS

In more qualitative terms, the notion of inner product allows the introduction of a

similarity score between vectors. In this sense, the least similar vectors are two orthogonal

vectors, i.e. two vectors x and y such that h x, yi = 0 and ω = π/2.

Note also that the following relation holds:

xxT y= h x, yix

B.3 Diagonalisation

A [N, N ] matrix X is diagonalisable if it exists an invertible matrix Psuch that

X= P DP 1 . (B.3.6)

A symmetric matrix can always be diagonalised and the diagonal entries of Dare its

eigenvalue.

B.4 QR decomposition

Let us consider a [N, n] matrix X with N n and n linearly independent columns. By

Gram-Schmidt orthogonalisation [2] it is possible to write

X= QR (B.4.7)

where Q is a [N, n] matrix with n orthonormal columns qj (i.e. q T

jq j= 1 and q T

jq k= 0 if

j6= k) and Ris a [ n, n] upper-triangular matrix. Since QT Q= In the pseudo-inverse of

Xcan be written as

X = ( XTX)1 XT = ( RT QTQR) 1 RT QT = R1 QT (B.4.8)

If X is rank-deficient (i.e. only n0 < n < N columns of X are linearly independent) it

is possible to perform the generalised QR decomposition

X= QR

where Q is [N, n0 ] and R is a [n0 , n ] rectangular upper-triangular matrix with n0 < n . Since

Ris of full row rank, the matrix RRT is invertible and the pseudo-inverse of Xcan be

written as

X = RT ( RRT ) 1 QT (B.4.9)

also known as the Moore-Penrose pseudo-inverse.

B.5 Singular Value Decomposition

Let us consider a [N, n] matrix X with N n : such matrix can always be factorised into

the product of three matrices

X= UDV T (B.5.10)

where U is a [N, N ] matrix with orthonormal columns (i.e. UT U =IN ), D is a [N , n]

diagonal matrix whose diagonal entries d ii 0 are called the singular values and V is a

[n, n ] matrix with orthonormal columns.

It can be shown that the N columns of U (also called the left singular vectors) are the

Neigenvectors of the [ N, N ] symmetric matrix XX T and the n columns of V (also called

the right singular vectors) are the neigenvectors of the [n, n] symmetric matrix XTX.

The non-zero singular values are the square-roots of the non-zero eigenvalues of XT X and

of the non-zero eigenvalues of XX T . This is made evident by the link between SVD and

diagonalisation of XT X :

XT X= ( UDV T )T( U DV T ) = V DT UT UDV T = V DT DV T

B.6. CHAIN RULES OF DIFFERENTIAL CALCULUS 395

The SVD of a matrix X of rank r can be written also as:

X=

r

X

j=1

dii ui vT

i

where uj is the j th column of U and vj is the j th column of V.

If in the decomposition above we stop at the order r0 < r , we obtain a low-rank

approximation of X:

X0 =

r0

X

j=1

dii ui v T

i.

Another common SVD decomposition is the economy (or reduced) SVD:

X= UDV T (B.5.11)

where k = min{ N, n} ,U is [N , k] with orthonormal columns, Dis a square [k, k] matrix

and V is a [n, k] matrix with orthonormal columns.

SVD plays an important role in determining the ill-conditioning of a square matrix,

i.e. how much the matrix is close to be singular. The condition number of a matrix is the

ratio of its largest singular value to its smallest singular value. The larger this number

(which is 1) the larger is the ill-conditioning of the matrix.

Note also that if Xis a symmetric matrix, the SVD decomposition returns the diago-

nalisation (B.3.6).

B.6 Chain rules of differential calculus

Let J be the scalar function of α R:

J( α) = f( g( h( α)))

where f, g , h :R R are scalar functions. Then the univariate chain rule is

dJ

= dJ

df

df

dg

dg

dh

dh

Let us consider the function J :R R

J= f( g1 ( α) , g2 ( α) ,...,g n(α))

between αR and the scalar J , where gj :R R, j = 1,...,n . Then the multivariate

chain rule returns the scalar gradient

dJ

=

n

X

j=1

∂J

∂gj dg j

Let J :Rn Rm be the mapping between an input vector αRn and an output

vector of size m . The associated Jacobian matrix is the [m, n ] matrix

α J=h ∂J (α )

∂α1

∂J ( α )

∂α2 . . . , ∂J (α)

∂αn i=

∂J1 ( α )

∂α1

∂J1 ( α )

∂α2 . . . ∂J 1 ( α )

∂αn

.

.

..

.

..

.

.

∂Jm ( α )

∂α1

∂Jm ( α )

∂α2 . . . ∂J m ( α )

∂αn

In the most generic case, suppose that J =Rn Rm ,αRn , and

J= Fk (Fk1 ( . . . F1 ( α )))

where Fi :Rn i Rn i+1 , n1 =n and nk+1 = m . Then the vectored chain rule [2] is

∂J

∂α

|{z}

[m,n]

=∂Fk

∂Fk1

| {z }

[m,nk ]

∂Fk1

∂Fk2

| {z }

[nk ,nk1 ]

. . . F 1

∂α

|{z}

[n2 ,n]

396 APPENDIX B. LINEAR ALGEBRA NOTIONS

B.7 Quadratic norm

Consider the quadratic norm

J( x) = k Ax + bk 2

where J :Rn R ,A is a [N, n] matrix, xis a [n, 1] vector and bis a [N, 1] vector. It can

be written in the matrix form

J( x) = xTAT Ax + 2 bT Ax + bT b

The first derivative of J with respect to xis the [n, 1] vector

∂J (x)

∂x = 2 A T ( Ax +b )

and the second derivative is the [n, n] matrix

2 J( x)

∂x∂xT = 2 A T A

B.8 Quadratic programming

Quadratic programming is the resolution procedure of continuous optimisation problems

with a squared objective function, for instance

b = J( b) = arg min

bb T Db

where b is a [n, 1] vector and Dis a [n, n] matrix. For instance if n= 2 and Ais a

diagonal matrix, J (b ) = b 2

1+b 2

2and has a single global minimum in [0, 0]. If the solution

is submitted to no constraints, the problem is called unconstrained. If D is a positive

(negative) semidefinite matrix the function J is convex (concave).

In machine learning the most common quadratic programming task is strictly convex

since it derives from the least-squares formulation where D is a definite positive matrix.

The general form of an unconstrained strictly convex quadratic objective function is

b = J( b) = arg min

bb T Db d T b+k = arg min

bb T Db d T b(B.8.12)

where D is definite positive, dis a [n, 1] vector and kis a scalar (which has no impact on

the minimisation problem).

The constrained version has a set of linear inequality constraints in the form

AT b b 0

where A is a [n, c] matrix defining the cconstraints under which we want to minimise the

Jfunction.

The R package quadprog provides the implementation solve.QP of a method to solve

a strictly convex constrained quadratic programming task.

B.9 The matrix inversion formula

Let us consider the four matrices F , G ,H and K and the matrix F + GHK . Assume that

the inverses of the matrices F ,G and (F + GHK ) exist. Then

(F +GHK )1 =F1 F1 G H1 +K F 1 G 1 KF 1 (B.9.13)

Consider the case where F is a [n× n ] square nonsingular matrix, G =z where z is a

[n× 1] vector, K =zT and H = 1. Then the formula simplifies to

(F + zz T ) 1 =F 1 F 1 zz T F 1

1 + zT F z

where the denominator in the right hand term is a scalar.

B.9. THE MATRIX INVERSION FORMULA 397

If X and Z are two [N, p] matrices, from (B.9.13) it can be shown the push-through

identity [2]

XT ( IN + ZX T ) 1 = ( Ip + XT Z)1 XT (B.9.14)

Then for any [N, p ] matrix Xand scalar λ > 0

XT ( λIN +XX T ) 1 = (λIp + XTX)1 XT (B.9.15)

398 APPENDIX B. LINEAR ALGEBRA NOTIONS

Appendix C

Probability and statistical

notions

C.1 Common univariate discrete probability func-

tions

C.1.1 The Bernoulli trial

ABernoulli trial is a random experiment with two possible outcomes, often called "suc-

cess" and "failure". The probability of success is denoted by pand the probability of

failure by (1 p ). A Bernoulli random variable z is a binary discrete r.v. associated with

the Bernoulli trial. It takes z= 0 with probability (1 p ) and z= 1 with probability p.

The probability function of zcan be written in the form

Prob {z =z} =Pz (z ) = pz (1 p )1z , z = 0,1

Note that E [z ] = p and Var [z ] = p (1 p ).

C.1.2 The Binomial probability function

Abinomial random variable represents the number of successes z in a fixed number N of

independent Bernoulli trials with the same probability p of success for each trial. A typical

example is the number z of heads in N tosses of a coin.

The probability function of z Bin(N, p ) is given by

Prob {z =z} =Pz (z ) = N

z! p z (1 p)Nz , z = 0 , 1 ,...,N (C.1.1)

The mean of the probability function is µ = Np . Note that:

the Bernoulli probability function is a special case (N= 1) of the binomial function,

for small p, the probability of having at least 1 success in Ntrials is proportional to

N, as long as Np is small,

if z1 Bin(N1 , p ) and z2 Bin(N1 , p ) are independent then z1 +z2 Bin(N1 +

N2 , p)

The Binomial distribution returns then the probability of zsuccesses out of N draws

with replacement. The probability of z successes out of N draws without replacement

from a population of size P that contains k terms associated to success is returned by the

hypergeometric distribution:

Prob {z =z} = k

z Pk

N z

P

N.(C.1.2)

399

400 APPENDIX C. PROBABILISTIC NOTIONS

C.2 Common univariate continuous distributions

C.2.1 Uniform distribution

A random variable zis said to be uniformly distributed on the interval (a, b) (written as

z∼ U ( a, b)) if its probability density function is given by

p( z) = ( 1

b aif a < z < b

0,otherwise

It can be shown that the skewness of a continuous random variable which is uniformly

distributed is equal to 0.

Exercise

Show that the variance of U ( a, b) is equal to 1

12 (b a) 2 .

C.2.2 The chi-squared distribution

It describes the distribution of squared normal r.v.s. An r.v. z has a χ 2

Ndistribution if

z= x2

1+· ·· +x 2

N

where NN and x1 ,x2 ,. . . ,xN are i.i.d. standard normal random variables N (0, 1). The

distribution is called a chi-squared distribution with Ndegrees of freedom. Note also that

The probability distribution is a gamma distribution with parameters ( 1

2N, 1

2).

E[z ] = Nand Var [z ] = 2 N.

The χ 2

Ndensity and distribution function for N= 10 are plotted in Figure C.1 (R script

chisq.R in the package gbcode).

C.2.3 Student's t-distribution

It describes the distribution of the ratio of normal and χsquared r.v.s. If x ∼ N (0, 1) and

yχ2

Nare independent then the Student's t -distribution with N degrees of freedom is

the distribution of the r.v.

z= x

py /N (C.2.3)

We denote this with z ∼ TN . Note that E [z ] = 0 and V ar [z ] = N/ (N 2) if N > 2.

The Student density and distribution function for N= 10 are plotted in Figure C.2

by means of the script stu.R in the package gbcode .

C.2. COMMON UNIVARIATE CONTINUOUS DISTRIBUTIONS 401

Figure C.1: χ 2

Nprobability distribution (N = 10)

Figure C.2: Student probability distribution (N = 10)

402 APPENDIX C. PROBABILISTIC NOTIONS

Figure C.3: F probability distribution (N = 10)

C.2.4 F-distribution

It describes the distribution of the ratio of χ squared r.v.s. Let x χ2

Mand yχ2

Nbe

two independent r.v.. An r.v. z has a F-distribution with M and N degrees of freedom

(written as zF M,N ) if

z= x/M

y/N (C.2.4)

Note that if zF M,N then 1/z F N,M , while if z ∼ TN then z2 F1,N . The F-density

and distribution function are plotted in Figure C.3 by means of the script f.R in the

package gbcode .

C.3 Common statistical hypothesis tests

C.3.1 χ2 -test: single sample and two-sided

Consider a random sample from N ( µ, σ2 ) with µ known. Let

H: σ2 = σ2

0;¯

H: σ2 6= σ2

0

Let c

SS = P i ( zi µ )2 . From Section 5.7 it follows that if H is true then c

SS2

0χ 2

N

(Section C.2.2 )

The level α χ2 -test rejects H if c

SS2

0< a 1or c

SS2

0> a 2where

Prob ( c

SS

σ2

0

< a1 ) + Prob ( c

SS

σ2

0

> a2 ) =α

A slight modification is necessary if µis unknown. In this case you must replace µ

with ˆ µin the quantity c

SS and use a χ 2

N1 distribution.

C.3.2 t-test: two samples, two sided

Consider two r.v.s x ∼ N (µ1 , σ2 ) and y ∼ N (µ2 , σ 2 ) with the same variance. Let D x

Nx

and D y

Mytwo independent sets of samples of size N and M , respectively..

We want to test H :µ1 =µ2 against ¯

H: µ1 6= µ2 .

Let

ˆ µx =PN

i=1 x i

N, c

SS x =

N

X

i=1

(xi ˆ µx )2 ,ˆ µy =PM

i=1 y i

M, c

SS y =

M

X

i=1

(yi ˆ µy )2

It can be shown that if His true then the statistic

t(DN ) = ˆ

µx ˆ

µy

r 1

M+ 1

N c

SSx + c

SSy

M+ N2∼ T M+N 2

C.4. TRANSFORMATION OF RANDOM VARIABLES AND VECTORS 403

It follows that the test of size α rejects Hif

|t( DN ) |> tα/2,M+ N 2

C.3.3 F-test: two samples, two sided

Consider a random sample { x1 ,...,x M} x ∼ N (µ 1, σ 2

1) and a random sample {y 1,...,y N} ←

y∼ N (µ2 , σ 2

2) with µ 1and µ 2unknown. Suppose we want to test

H: σ2

1=σ 2

2;¯

H: σ2

16=σ 2

2

Let us consider the statistic

f=ˆ

σ2

1

ˆ

σ2

2

=c

SS1 /( M 1)

c

SS2 /( N 1) σ 2

1χ 2

M1/(M 1)

σ2

2χ 2

N1/(N 1) = σ 2

1

σ2

2

FM1 ,N1

It can be shown that if His true, the ratio f has a F-distribution FM1 ,N 1 (Section

C.2.4) The F-test rejects H if the ratio fis large, i.e. f > Fα,M 1 ,N 1 where

Prob {f > Fα,M1 ,N 1 } = α

if fFM1 ,N1 .

C.4 Transformation of random variables and vec-

tors

Theorem 4.1 (Jensen's inequality) . Let x be a continuous r.v. and f a convex function.

Then, E[ f( x)] f (E [ x ]) while if f is concave then E[ f( x)] f (E [ x ])

Given a [n× 1] constant vector aand a random vector z of dimension [n× 1] with

expected value E [ z ] = µ and covariance matrix [z ] = Σ, then

E[ aT z] = aT µ, Var h aT zi = aT Σ a

Also if z ∼ N ( µ, Σ) then aT z ∼ N (aT µ, aT Σ a)

Given a [n× n ] constant matrix Aand a random vector zof dimension [n× 1] with

expected value E [ z ] = µ and covariance matrix Var[z ] = Σ, then

E[ Az] = Aµ, Var [Az ] = A Σ AT

R script

The relation above may be used to sample a [n, 1] random vector xwith covariance

Var [x ] = Σ, starting from the sampling of a [n, 1] random vector z with Var [z ] = In . If we

factorize Σ = AAT , then Var [x] = Var [Az ] = AInAT = Σ. In the script chol2cor.R , we

first define the symmetric matrix Σ, then we sample N times the z vector in the dataset

DN and we multiply DN by A : it is possible to verify numerically that this is equivalent

to sample N times a vector xwith covariance Σ.

Theorem 4.2. Given a random vector z of dimension [n× 1] with expected value E[ z] = µ

and covariance matrix Var[ z] = σ2 I , for a generic matrix A of dimension [n× n ] the

following relation holds

E[zT Az] = σ2 tr( A) + µT (C.4.5)

where tr( A) is the trace of matrix A.

404 APPENDIX C. PROBABILISTIC NOTIONS

C.5 Correlation and covariance matrices

Given n r.v.s z1 ,...,zn the correlation matrix C is a symmetric positive-semidefinite [n, n]

matrix whose (i, j ) entry is the correlation coefficient ρ (zi ,zj ) (Equation (3.6.68)).

The following relation exists between the covariance Σ of the nvariables and the

correlation matrix:

C= (diag(Σ))1/ 2 Σ (diag(Σ))1/ 2 ,Σ =

σ1 . . . . . . 0

0σ2 . . . 0

0. . . .. . 0

0. . . . . . σn

C

σ1 . . . . . . 0

0σ2 . . . 0

0. . . . . . 0

0. . . . . . σn

By using the formula above, the script corcov.R shows how it is possible to generate

a set of examples with predefined pairwise correlation ¯ ρ.

C.6 Convergence of random variables

Let { zN } ,N = 1, 2,..., be a sequence of random variables and let z be another random

variable. Let FN ( · ) denote the distribution function of zN and Fz the distribution of z.

We introduce the following definitions:

Definition 6.1 (Convergence in probability). We say that

lim

N→∞z N =zin probability (C.6.6)

and we note zN

P

zif for each ε > 0

lim

N→∞P{|z N z| ≥ ε } = 0 (C.6.7)

Definition 6.2 (Convergence with probability one).We say that

lim

N→∞z N =zwith probability one (or almost surely) (C.6.8)

and we note zN

a.s.

zif

P{ ω: lim

N→∞z N (ω ) = z (ω )} = 1 (C.6.9)

Definition 6.3 (Convergence in Lp ) . For a fixed number p 1 we say that

lim

N→∞z N =zin L p (C.6.10)

if

lim

N→∞E[|z N z| p ] = 0 (C.6.11)

The following theorems hold:

Theorem 6.4. Convergence in Lp implies convergence in probability.

Theorem 6.5. Convergence with probability one implies convergence in probability.

Note however that convergence in probability does not imply convergence in L2 .

Definition 6.6 (Convergence in distribution) . The sequence zN converges in distribution

to z and we note z N

D

zif

lim

N→∞F N (z) = F( z) (C.6.12)

for all z for which F is continuous.

It can be shown that

Theorem 6.7. Convergence with probability implies convergence in distribution.

Note however that convergence in distribution does not imply convergence in proba-

bility.

As a summary

zN

a.s.

zimplies zN

P

zimplies zN

D

z

C.7. THE CENTRAL LIMIT THEOREM 405

C.6.1 Example

Let z ∼ U (1, 2) and θ= 0. Consider the two estimators (stochastic processes) ˆ

θ(1)

Nand

ˆ

θ(2)

Nfor N→ ∞ where

ˆ

θ1

N= exp zN, ˆ

θ2

N=( exp N with probability 1 1/N

1 with probability 1/N

For the first estimator, all the tra jectories converge to θ(strongly consistent). For the

second process, the trajectory which does not converge has a probability decreasing to

zero for N → ∞ (weakly consistent).

C.7 The central limit theorem

Theorem 7.1. Assume that z1 , z2 ,. . . ,zN are i.i.d. random variables, discrete or contin-

uous, each having the same probability distribution with finite mean µ and finite variance

σ2 . As N→ ∞, the standardised random variable

(¯

zµ) N

σ

which is identical to ( S N Nµ)

Nσ

converges in distribution (Definition 6.6) to a r.v. having the standardized normal distri-

bution N (0,1).

This theorem, which holds regardless of the common distribution of zi , justifies the

importance of the normal distribution, since many r.v. of interest are either sums or

averages. Think for example to the commute time of the example in Section 3.2 which can

be considered as the combined effect of several causes.

An illustration of the theorem by simulation is obtained by running the R script

central.R.

C.8 The Chebyshev's inequality

Let z be a generic random variable, discrete or continuous, having a mean µand a variance

σ2 . The Chebyshev's inequality states that for any positive constant d

Prob {|z µ | ≥ d} ≤ σ 2

d2 (C.8.13)

An illustration of the Chebyshev's inequality by simulation can be found in the R

script cheby.R.

Note that if we put zequal to the quantity in (3.10.89), then from (3.10.90) and

(C.8.13) we find

Prob {| ¯

zµ | ≥ d } ≤ σ 2

Nd2 (C.8.14)

i.e. the weak law of large numbers (Section 3.1.5). This law states that the average of a

large sample converges in probability to the mean of the distribution.

C.9 Empirical distribution properties

Let (5.2.2) be the empirical distribution of z obtained from a dataset DN . Note that, being

DN a random vector, the function ˆ

Fz ( ·) is random, too. The following two properties

(unbiasedness and consistency) are valid:

406 APPENDIX C. PROBABILISTIC NOTIONS

Theorem 9.1. For any fixed z

ED N [ˆ

Fz (z )] = Fz (z ) (C.9.15)

Var h ˆ

Fz (z)i =F z (z)(1 Fz (z ))

N(C.9.16)

Theorem 9.2 (Glivenko-Cantelli theorem).

sup

−∞<z< | ˆ

Fz (z )Fz (z)|N→∞

0almost surely (C.9.17)

where the definition of almost sure convergence is in Appendix (Def. 6.2).

The two theoretical results can be simulated by running the R scripts cumdis 2.R and

cumdis 1.R, respectively.

C.10 Useful relations

Some relations

E[(z µ)2 ] = σ2 = E[z2 2 µz+ µ2 ] = E[z2 ] 2 µE[ z] + µ 2

=E [z2 ] 2µµ +µ2 =E [z2 ]µ2

For N = 2

E[(z1 + z2 )2 ] = E[z2

1] + E [z 2

2] + 2E[z 1z2]

= 2E [ z2 ] + 2µ 2

= 4µ2 + 2σ 2

For N = 3

E[(z1 +z2 +z3 )2 ] = E[z2

1] + E[z 2

2] + E [z 2

3] + 2E[z 1z2] + 2E[z 1z3] + 2E [z 2z3]

= 3E [z2 ] + 6µ2 = 9µ2 + 3σ 2

In general for N i.i.d. zi ,E [(z1 +z2 + · ·· + zN )2 ] = N2 µ2 + Nσ2 .

C.11 Minimum of expectation vs. expectation of

minimum

Theorem 11.1. Let us consider M random variables zm , m = 1 ,...,M . Then

E[min

mz m ]min

mE[z m ]

Proof. For each m define the r.v. xm =zm minm zm . Now E [xm ] 0 since zm

minm zm . Then E [ zm ]E [minm zm ] 0. It follows that

m, E[min

mz m ]E[z m ]

then

E[min

mz m ]min

mE[z m ]

The difference minm E [ zm ]E [minm zm ] quantifies then the selection bias that occurs

in a selection (e.g. by minimisation) process that relies on observed data in a random

setting.

C.12. TAYLOR EXPANSION OF FUNCTION 407

C.12 Taylor expansion of function

Let J (· ) be a function with pdimensional argument of the form α = [α1 ,...,α p]. The

Taylor expansion of the function J (· ) about ¯ αcan be written as follows

J( α) = J(¯ α) +

p

X

j=1

(αi ¯ αi ) J (α )

∂αj α= ¯ α+

p

X

i=1

p

X

j=1

(αi ¯ αi )(αj ¯ αj )

2

2 J( α)

∂αiαj α= ¯ α+. . .

which can be written in vector-form as follows:

J( α) J(¯ α) + ( α¯ α)T J( ¯ α) + ( α¯ α)T H( ¯ α)( α¯ α)

where J (α ) is the gradient vector and H (α )=[ Hij ] is the Hessian square matrix [p, p]

of all second-order derivatives

Hij = 2 J(α )

∂αiαj

C.13 Proof of Eq. (7.5.28)

Since the cost is quadratic, the input uniform density is π ( x ) = 1

4in the interval [2, 2],

the regression function is x3 and the noise is i.i.d. with unit variance, from Eq. (7.2.3) we

obtain

R( α) = Z X,Y

C( y, αx) pf ( y| x) π( x) dy dx (C.13.18)

=Z 2

x= 2Z Y

(y αx )2 pf (y| x )π(x)dx dy (C.13.19)

=Z 2

x= 2Z W

(x3 +w αx )2 pw (w )1

4dx dw (C.13.20)

=1

4[ Z W

pw ( w) dw Z 2

x=2

(x3 αx )2 dx + (C.13.21)

+Z 2

x=2

dx Z W

w2 pw ( w) dw +Z W Z 2

x=2

2w (x3 αx )pw (w )dwdx ] (C.13.22)

=1

4Z2

2

(x3 αx )2 dx + 4σ 2

w(C.13.23)

=1

4Z 2

2

(x3 αx )2 dx +σ 2

w(C.13.24)

C.14 Biasedness of the quadratic empirical risk

Consider a regression framework where y =f (x ) + w , with E [w ] = 0 and Var [w ] = σ 2

w,

and hN an estimation of fobtained by minimising the empirical risk in a dataset DN y .

According to the derivation in [64], let us consider the quantity

gN (x) = E D N ,y [(y h ( x, α(DN )))2 ] = ED N [Ey [(y hN )2 ]]

where hN stands for h ( x, α (DN )). Since

(y hN )2 = (y f +f hN )2 = (y f )2 + (f hN )2 + 2(y f )(f hN )

we obtain

(y f )2 + (f hN )2 = (y hN )2 + 2(y f )(hN f )

408 APPENDIX C. PROBABILISTIC NOTIONS

Note that since Ey [y ] = f , for a given h N

Ey [(y hN )2 ] = Ey [(y f)2 + ( f hN )2 + 2(y f )( f hN )] =

=Ey [(yf )2 + (f hN )2 ]

Since Ey [(yf )2 ] = E D N [(yf )2 ] and

(y h )2 = (y f +f h )2 = (y f )2 + (f h )2 2(y f )(h f )

it follows

(y f )2 + (f h )2 = (y h )2 + 2(y f )(h f )

and

gN (x) = E D N [ Ey [(y hN )2 ]] =

=ED N [Ey [(yf )2 ] + (f hN )2 ] = E D N [(yf )2 + (fhN )2 ] =

=ED N [(y hN )2 + 2(yf )(hN f )]

By averaging the quantity gN ( x) over the X domain we obtain

MISE = ED N ,y,x [(y h (x, α (DN )))2 ]] =

=ED N ,x [(y hN )2 ] + 2ED N ,x [(yf )(hN f )] = ED N [ \

MISEemp ] + 2Cov[ hN , y ]

where Cov[hN , y ] = ED N ,x [(yf )(hN f )] and \

MISEemp is the quantity (7.2.7) for a

quadratic error loss. This means we have to add a covariance penalty term to the apparent

error \

MISEemp in order to have an unbiased estimate of MISE.

Suppose that hN is a linear estimator, i.e.

hN =S y

where S is known as the smoother matrix. Note that in least-square regression Sis the

Hat matrix H =X (XTX ) 1 XT . In the linear case, since HT = H

NCov[hN ,y] = ED N [(Y F)T( HY F)] =

=ED N [YT H Y YT F FT H Y YT HT F ] = E D N [YT H (YF )] =

=σ2 tr(H ) + FT HF FT H F =σ2 tr(H ) = σ2 tr((XTX ) 1 XTX ) = σ2 p

where tr(H) is the trace of the matrix H ,Y is a random vector of size [N, 1] and Fis

the vector of the N regression function values f ( xi ). It follows that Cov[hN , y ] = σ2 p/N

and then the Cp formula (8.8.33). Note that the trace of His also known as the effective

number of parameters.

Appendix D

Plug-in estimators

This appendix contains the expression of the plug-in estimators of some interesting pa-

rameters:

Skewness of a random variable z: given a dataset DN ={z1 ,...,z N}the plug-in

estimate of the skewness (3.3.36) is

ˆ γ=PN

i=1(z i ˆ µ)3

Nˆ σ3 (D.0.1)

where ˆ µand ˆ σare defined in (5.3.4) and (5.3.5), respectively.

Kurtosis of a random variable z : given a dataset DN = {z1 ,...,z N}the plug-in

estimate of the kurtosis (3.3.37) is

ˆ κ=P N

i=1(z i ˆ µ)4

Nˆ σ4 (D.0.2)

where ˆ µand ˆ σare defined in (5.3.4) and (5.3.5), respectively.

Correlation of two random variables x and y : given a dataset DN = {hx1 , y1 i,..., hxN , yN i}

where xi R , yi R the plug-in estimate of the correlation (3.6.68) is

ˆ ρ=PN

i=1(x i ˆ µx )(yi ˆ µy )

ˆ σx ˆ σy

(D.0.3)

where ˆ µx (ˆ µy ) and ˆ σ2

x σ2

y) denote the sample mean and sample variance of x( y ).

Covariance matrix of a n-dimensional random vector z: given a dataset DN =

{z1 ,...,z N}where z i= [z i1,...,z in ] T is a [n, 1] vector, the plug-in estimator of the

covariance matrix (3.7.72) is the [n, n] matrix

ˆ

Σ = P N

i=1(z i ˆ µ)( zi ˆ µ)T

N1(D.0.4)

whose jk entry is

ˆ

Σjk = P N

i=1(z ij ˆ µj )( zik ˆ µk )T

N1

and ˆ µis the [n, 1] vector

ˆ µ=P N

i=1 z i

N

and

ˆ µj =P N

i=1 z ij

N.

Note that (D.0.4) can be also written in matrix form

ˆ

Σ = P N

i=1(Z1 N ˆ µT )T ( Z1N ˆ µT )

N1

where Z is a [N, n] matrix whose i th row is z T

iand 1 Nis a [N, 1] vector of ones.

409

410 APPENDIX D. PLUG-IN ESTIMATORS

Correlation matrix of a n-dimensional random vector z: the correlation matrix is a

symmetric [n, n] matrix whose jk entry is the correlation between the scalar random

variables zj and zk . Given a dataset DN ={z1 ,...,z N}where z i= [z i1,...,z in ] T

is a [n, 1] vector, the plug-in estimator can be written as the covariance1

ˆ

P = ˜

ZT Z

N

of the scaled matrix ˜

Z= CZD 1

where

C= IN 1 N 1 T

N

N

is the centering [N, N ] matrix, IN is the diagonal unit matrix, 1Nis a [N, 1] vector

of ones and

D= diag(ˆ σ1 ,...,ˆ σn )

is a diagonal [n, n] scaling matrix where ˆ σ2

jis the sample variance of z j.

The diagonal entries of ˆ

P are all 1. The jk entry (j 6 =k ) of the matrix ˆ

P can be also

obtained by applying (D.0.3) to the j th and k th column of Z.

1see also http://users.stat.umn.edu/~helwig/notes/datamat-Notes.pdf

Appendix E

Kernel functions

A kernel function K is a nonnegative function

K:Rn × Rn ×R+ R +

where the first argument is a n-dimensional input, the second argument is typically called

the center and the third argument is called width or bandwidth. Once a distance function

between the input and the center

d:Rn × Rn R+ (E.0.1)

is defined, the kernel function can be expressed as a function

K:R+ ×R+ R+ (E.0.2)

of the distance dand the bandwidth parameter. The maximum value of a kernel function

is located at zero distance and the function decays smoothly as the distance increases.

Here you have some examples of kernel functions.

Inverse distance:

K( d, B) = 1

(d/B )p (E.0.3)

This function goes to infinity as the distance approaches zero.

Corrected inverse distance:

K( d, B) = 1

1 + (d/B )p (E.0.4)

Gaussian kernel:

K( d, B) = exp d 2

B2 (E.0.5)

Exponential kernel:

K( d, B) = exp d

B (E.0.6)

Quadratic or Epanechnikov kernel:

K( d, B) = ( 1 d

B

2if |d |< B

0 otherwise (E.0.7)

Tricube kernel:

K( d, B) =

1 d

B 3 3 if |d |< B

0 otherwise

(E.0.8)

Uniform kernel:

K( d, B) = ( 1 if |d | < B

0 otherwise (E.0.9)

Triangular kernel:

K( d, B) = ( 1 d

B if |d |< B

0 otherwise (E.0.10)

411

412 APPENDIX E. KERNEL FUNCTIONS

Appendix F

Companion R package

Several scripts are used in the main text to illustrate statistical and machine learning

notions. All the scripts have been implemented in R and are contained in the R package

gbcode.

To install the R package gbcode containing all the scripts mentioned in the text you

should run the following R instructions in the R console.

> library(devtools)

> install_github("gbonte/gbcode")

> require(gbcode)

Once installed, all the scripts will be available in the root directory of the package. In

order to retrieve the directory containing the gbcode package you should type

> system.file(package = "gbcode")

To change the directory to the one containing scripts

> setwd(find.package("gbcode"))

If you wish to run a script mentioned in the main text (e.g. the script freq.R) without

changing the local directory you should run

> source(system.file("scripts","freq.R",package = "gbcode"))

If you wish to edit a script mentioned in the main text (e.g. the script freq.R) without

changing the local directory you should run

> edit(file=system.file("scripts","freq.R",package = "gbcode"))

If you wish to execute a Shiny dashboard (e.g. leastsquares.R) you should run

> library(shiny)

> source(system.file("shiny","leastsquares.R",package = "gbcode"))

413

414 APPENDIX F. COMPANION R PACKAGE

Appendix G

Companion R Shiny

dashboards

Several Shiny dashboards are used in the main text to illustrate statistical and machine

learning notions. All the Shiny dashboards are contained in the directory shiny of the R

package gbcode and require the installation of the library shiny. To run a Shiny dashboard

(e.g. condpro.R ) you should first move to their directory by

> setwd(paste(find.package("gbcode"),"shiny",sep="/"))

and then run

> runApp("condpro.R")

The Shiny dashboards are also active under Shinyapps1. To run a Shiny dashboard named

NAME.R go to https://gbonte.shinyapps.io/NAME. For instance, to run the Shiny dash-

board condpro.R go to:

https://gbonte.shinyapps.io/condpro

G.1 List of Shiny dashboards

mcarlo.R: visualisation by means of Monte Carlo simulation.of

1. transformation of a r.v.

2. result of operation on two r.v.s

3. central limit theorem

4. result of linear combination of two independent r.v.s

condpro.R: visualisation of conditional probability vs. marginal probability in the

bivariate gaussian case and in the regression function case.

estimation.R: visualisation of different problems of estimation:

1. estimation of mean and variance of a univariate normal r.v.: bias/variance

visualisation

2. estimation of mean and variance of a univariate uniform r.v.: bias/variance

visualisation

3. estimation of confidence interval of the mean of a univariate normal r.v.

4. maximum-likelihood estimation of the mean of a univariate normal r.v.: visual-

isation of the log-likelihood function together with the value of the maximum-

likelihood estimator

1https://www.shinyapps.io

415

416 APPENDIX G. COMPANION R SHINY DASHBOARDS

5. maximum-likelihood estimation of the mean and the variance of a univariate

normal r.v.: visualisation of the bivariate log-likelihood function together with

the value of the maximum-likelihood estimator

6. estimation of mean and covariance of a bivariate normal r.v.: bias/variance

visualisation

7. least-squares estimation of the parameters of a linear target function: visu-

alisation of bias/variance of the predicted conditional expectation and of the

parameter estimators

8. least-squares estimation of the parameters of a nonlinear target function: vi-

sualisation of bias/variance of the predicted conditional expectation

bootstrap.R: study of the accuracy of the bootstrap estimation of the sampling

distribution, estimator variance and estimator bias. The dashboard considers the

case of sample average for which it is known that bias is null and variance is inversely

proportional to N (Section 5.5.3). The dashboard shows that the bootstrap returns

an accurate estimation of bias and variance of sample average.

leastsquares.R: visualisation of the minimisation of the empirical risk with gradient-

based iteration. 3 dashboards:

1. linear least-squares: visualisation of the convex empirical risk function and

position of the estimation as gradient-based iteration proceeds

2. NNet least-squares: visualisation of the estimated regression function (single

layer, 3 hidden nodes NNET) and the associated empirical risk as gradient-

based iteration proceeds

3. KNN cross-validation: illustration of the points used for test as cross-validation

proceeds in the case of a KNN regressor (variable number of neighbours).

regression.R: visualisation of the model selection trade-off in regression by show-

ing the impact of different kinds of hyper-parameters (degree of polynomial model,

number of neighbours in locally constant and locally linear fitting, number of trees

in Random Forest) on the bias, variance and generalisation error.

classif.R: visualisation of different classification notions in 4 dashboards:

1. Univariate: visualise the relation between posterior probability and class con-

ditional densities in a univariate binary classification task

2. Linear discriminant: visualise the relation between bivariate class conditional

densities and linear discriminant

3. Perceptron: visualise the evolution of the perceptron hyperplane during the

gradient based minimisation of the hyperplane misclassification and the SVM

hyperplane

4. Assessment: visualise the relation between ROC curve, PR curve, confusion

matrix and classifier threshold in a univariate binary classification task.

classif2.R: visualisation of direct and inverse conditional distributions in the uni-

modal and bimodal case.

Bibliography

[1] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig

Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-

mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia,

Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man´e,

Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon

Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Van-

houcke, Vijay Vasudevan, Fernanda Vi´egas, Oriol Vinyals, Pete Warden, Martin

Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-

scale machine learning on heterogeneous systems, 2015. Software available from

tensorflow.org.

[2] Charu C. Aggarwal. Linear Algebra and Optimization for Machine Learning - A

Textbook. Springer, 2020.

[3] D. W. Aha. Incremental, instance-based learning of independent and graded concept

descriptions. In Sixth International Machine Learning Workshop , pages 387–391, San

Mateo, CA, 1989. Morgan Kaufmann.

[4] D. W. Aha. A Study of Instance-Based Algorithms for Supervised Learning Tasks:

Mathematical, Empirical and Psychological Observations. PhD thesis, University of

California, Irvine, Department of Information and Computer Science, 1990.

[5] D. W. Aha. Editorial of special issue on lazy learning. Artificial Intelligence Review,

11(1–5):1–6, 1997.

[6] H. Akaike. Fitting autoregressive models for prediction. Annals of the Institute of

Statistical Mechanics, 21:243–247, 1969.

[7] D. M. Allen. The relationship between variable and data augmentation and a method

of prediction. Technometrics, 16:125–127, 1974.

[8] Christophe Ambroise and Geoffrey J. McLachlan. Selection bias in gene extraction

on the basis of microarray gene-expression data. PNAS , 99(10):6562–6566, 2002.

[9] B. D. O. Anderson and M. Deistler. Identifiability in dynamic errors-in-variables

models. Journal of Time Series Analysis , 5:1–13, 1984.

[10] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning. Artificial

Intelligence Review, 11(1–5):11–73, 1997.

[11] R. Babuska. Fuzzy Modeling and Identification. PhD thesis, Technische Universiteit

Delft, 1996.

[12] R. Babuska and H. B. Verbruggen. Fuzzy set methods for local modelling and

identification. In R. Murray-Smith and T. A. Johansen, editors, Multiple Model

Approaches to Modeling and Control, pages 75–100. Taylor and Francis, 1997.

[13] D. Barber. Bayesian reasoning and machine learning. Cambridge University Press,

2012.

[14] A. R. Barron. Predicted squared error: a criterion for automatic model selection. In

S. J. Farlow, editor, Self-Organizing Methods in Modeling, volume 54, pages 87–103,

New York, 1984. Marcel Dekker.

[15] Thomas Lumley based on Fortran code by Alan Miller. leaps: Regression Subset

Selection, 2020. R package version 3.1.

417

418 BIBLIOGRAPHY

[16] W. G. Baxt. Improving the accuracy of an artificial neural network using multiple

differently trained networks. Neural Computation , 4:772–780, 1992.

[17] Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jef-

frey Mark Siskind. Automatic differentiation in machine learning: a survey, 2015.

cite arxiv:1502.05767Comment: 43 pages, 5 figures.

[18] M. G. Bello. Enhanced training algorithms, and integrated training/architecture se-

lection for multilayer perceptron networks. IEEE Transactions on Neural Networks,

3(6):864–875, 1992.

[19] H. N. Bensusan. Automatic bias learning: an inquiry into the inductive basis of

induction. PhD thesis, University of Sussex, 1999.

[20] H. Bersini and G. Bontempi. Fuzzy models viewed as multi-expert networks. In

IFSA '97 (7th International Fuzzy Systems Association World Congress, Prague),

pages 354–359, Prague, 1997. Academia.

[21] H. Bersini and G. Bontempi. Now comes the time to defuzzify the neuro-fuzzy

models. Fuzzy Sets and Systems , 90(2):161–170, 1997.

[22] H. Bersini, G. Bontempi, and C. Decaestecker. Comparing rbf and fuzzy inference

systems on theoretical and practical basis. In F. Fogelman-Soulie' and P. Gallinari,

editors, ICANN '95,International Conference on Artificial Neural Networks, pages

169–174, 1995.

[23] M. Birattari and G. Bontempi. The lazy package for r. lazy learning for local regres-

sion. Technical Report 38, IRIDIA ULB, 2003.

[24] M. Birattari, G. Bontempi, and H. Bersini. Lazy learning meets the recursive least-

squares algorithm. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, NIPS 11,

pages 375–381, Cambridge, 1999. MIT Press.

[25] C. M. Bishop. Neural Networks for Statistical Pattern Recognition. Oxford University

Press, Oxford, UK, 1994.

[26] S. Bittanti. Model Identification and Data Analysis. Wiley, 2019.

[27] Joseph K. Blitzstein and Jessica Hwang. Introduction to Probability Second Edition.

2019.

[28] G Bontempi. A blocking strategy to improve gene selection for classification of gene

expression data. Computational Biology and Bioinformatics, IEEE/ACM Transac-

tions on, 4(2):293–300, 2007.

[29] G. Bontempi and H. Bersini. Identification of a sensor model with hybrid neuro-fuzzy

methods. In A. B. Bulsari and S. Kallio, editors, Neural Networks in Engineering

systems (Proceedings of the 1997 International Conference on Engineering Applica-

tions of Neural Networks (EANN '97), Stockolm, Sweden), pages 325–328, 1997.

[30] G. Bontempi, M. Birattari, and H. Bersini. Lazy learning for modeling and control

design. International Journal of Control , 72(7/8):643–658, 1999.

[31] G. Bontempi, M. Birattari, and H. Bersini. A model selection approach for local

learning. Artificial Intelligence Communications , 121(1), 2000.

[32] G. Bontempi, B. Haibe-Kains, C. Desmedt, C. Sotiriou, and J. Quackenbush.

Multiple-input multiple-output causal strategies for gene selection. BMC bioinfor-

matics, 12(1):458, 2011.

[33] G. Bontempi and P.E. Meyer. Causal filter selection in microarray data. In Proceed-

ing of the ICML2010 conference, 2010.

[34] G. Bontempi, C. Olsen, and M. Flauder. D2C: Predicting Causal Direction from

Dependency Features, 2014. R package version 1.1.

[35] Gianluca Bontempi and Maxime Flauder. From dependency to causality: A machine

learning approach. Journal of Machine Learning Research, 16:2437–2457, 2015.

[36] L. Breiman. Stacked regressions. Machine Learning, 24(1):49–64, 1996.

[37] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and

Regression Trees. Wadsworth International Group, Belmont, CA, 1984.

BIBLIOGRAPHY 419

[38] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

[39] D. Broomhead and D. Lowe. Multivariable functional interpolation and adaptive

networks. Complex Systems , 2:321–355, 1988.

[40] Gavin C. Cawley. Over-fitting in model selection and its avoidance. In Jaakko

Hollmn, Frank Klawonn, and Allan Tucker, editors, IDA, volume 7619 of Lecture

Notes in Computer Science, page 1. Springer, 2012.

[41] A. Chalmers. What is this thing called science? (new and extended) . Open University

Press, 2012.

[42] V. Cherkassky and F. Mulier. Learning from Data: Concepts, Theory, and Methods.

Wiley, New York, 1998.

[43] F. Chollet and J.J. Allaire. Deep Learning with R. Manning, 2018.

[44] W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots.

Journal of the American Statistical Association, 74:829–836, 1979.

[45] W. S. Cleveland and S. J. Devlin. Locally weighted regression: an approach to

regression analysis by local fitting. Journal of American Statistical Association,

83:596–610, 1988.

[46] W. S. Cleveland and C. Loader. Smoothing by local regression: Principles and

methods. Computational Statistics , 11, 1995.

[47] T. Cover and P. Hart. Nearest neighbor pattern classification. Proc. IEEE Trans.

Inform. Theory, pages 21–27, 1967.

[48] P. Craven and G. Wahba. Smoothing noisy data with spline functions: Estimat-

ing the correct degree of smoothing by the method of generalized cross-validation.

Numer. Math., 31:377–403, 1979.

[49] G. Cybenko. Just-in-time learning and estimation. In S. Bittanti and G. Picci,

editors, Identification, Adaptation, Learning. The Science of Learning Models from

data, NATO ASI Series, pages 423–434. Springer, 1996.

[50] Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca

Bontempi. Credit card fraud detection: a realistic modeling and a novel learning

strategy. IEEE transactions on neural networks and learning systems, 29(8):3784–

3797, 2017.

[51] Andrea Dal Pozzolo, Olivier Caelen, and Gianluca Bontempi. When is undersam-

pling effective in unbalanced classification tasks? In Joint European Conference on

Machine Learning and Knowledge Discovery in Databases, pages 200–215. Springer,

Cham, 2015.

[52] Andrea Dal Pozzolo, Olivier Caelen, Yann-Ael Le Borgne, Serge Waterschoot, and

Gianluca Bontempi. Learned lessons in credit card fraud detection from a practi-

tioner perspective. Expert systems with applications, 41(10):4915–4928, 2014.

[53] Peter Dalgaard. Introductory statistics with R. Springer, 2002.

[54] P. Daniusis, D. Janzing, J. Mooij, J. Zscheischler, B. Steudel, K. Zhang, and

B. Schlkopf. Inferring deterministic causal relations. In Proceedings of the 26th Con-

ference on Uncertainty in Artificial Intelligence (UAI-2010), pages 143–150, 2010.

[55] A. Dean and D. Voss. Design and Analysis of Experiments. Springer Verlag, New

York, NY, USA, 1999.

[56] Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Mathematics for

Machine Learning. Cambridge University Press, 2020.

[57] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete

data via the em algorithm. Journal of the Royal Statistical Society, B, 39(1):1–38,

1977.

[58] L. Devroye, L. Gy¨orfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition.

Springer Verlag, 1996.

[59] N. R. Draper and H. Smith. Applied Regression Analysis. John Wiley and Sons,

New York, 1981.

420 BIBLIOGRAPHY

[60] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1976.

[61] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classification. John Wiley and

sons, 2001.

[62] B. Efron. The estimation of prediction error: Covariance penalties and cross-

validation. Annals of Statistics , pages 1–26, 1979.

[63] B. Efron. The Jacknife, the Bootstrap and Other Resampling Plans. SIAM, 1982.

Monograph 38.

[64] B. Efron. Bootstrap methods: Another look at the jacknife. JASA , pages 619–642,

2004.

[65] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall,

New York, NY, 1993.

[66] B. Efron and R. J. Tibshirani. Cross-validation and the bootstrap: estimating the

error rate of a prediction rule. Technical report, Stanford University, 1995.

[67] Gamaleldin F. Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alexey

Kurakin, Ian J. Goodfellow, and Jascha Sohl-Dickstein. Adversarial examples that

fool both computer vision and time-limited humans. In Samy Bengio, Hanna M. Wal-

lach, Hugo Larochelle, Kristen Grauman, Nicol Cesa-Bianchi, and Roman Garnett,

editors, NeurIPS , pages 3914–3924, 2018.

[68] J. Fan and I. Gijbels. Variable bandwidth and local linear regression smoothers. The

Annals of Statistics, 20(4):2008–2036, 1992.

[69] J. Fan and I. Gijbels. Adaptive order polynomial fitting: bandwidth robustification

and bias reduction. J. Comp. Graph. Statist. , 4:213–227, 1995.

[70] J. Fan and I. Gijbels. Local Polynomial Modelling and Its Applications. Chapman

and Hall, 1996.

[71] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The KDD process for extracting

useful knowledge from volumes of data. Communications of the ACM, 39(11):27–34,

November 1996.

[72] V. Fedorov. Theory of Optimal Experiments. Academic Press, 1972.

[73] Manuel Fern´andez-Delgado, Eva Cernadas, Sen´en Barro, and Dinani Amorim. Do

we need hundreds of classifiers to solve real world classification problems? Journal

of Machine Learning Research, 15(1):3133–3181, January 2014.

[74] F. Fleuret. Fast binary feature selection with conditional mutual information. Jour-

nal of Machine Learning Research, 5:1531–1555, 2004.

[75] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learn-

ing and an application to boosting. Journal of Computer and System Sciences,

55(1):119–139, 1997.

[76] J. H. Friedman. Flexible metric nearest neighbor classification. Technical report,

Stanford University, 1994.

[77] Jerome H. Friedman. Stochastic gradient boosting. Comput. Stat. Data Anal.,

38(4):367–378, February 2002.

[78] A. Gelman. Bayesian Data Analysis . Chapman and Hall, 2004.

[79] Z. Ghahramani and M. I. Jordan. Supervised learning from incomplete data via an

em approach. In J. D. Cowan, G. T. Tesauro, and J. Alspector, editors, Advances in

Neural Information Processing Systems, volume 6, pages 120–127, San Mateo, CA,

1994. Morgan Kaufmann.

[80] Clark Glymour, Kun Zhang, and Peter Spirtes. Review of causal discovery methods

based on graphical models. Frontiers in Genetics, 10:524, 2019.

[81] P. Godfrey-Smith. Theory and reality: an introduction to the philosophy of science.

The University of Chicago Press, 2003.

[82] D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning.

Addison-Wesley, Reading, MA, 1989.

BIBLIOGRAPHY 421

[83] G.H. Golub and C.F. Van Loan. Matrix computations . Johns Hopkins University

Press, 1996.

[84] T. R. Golub, D. K. Slonin, P. Tamayo, C. Huard, and M. Gaasenbeek. Molecu-

lar clssification of cancer: Class discovery and class prediction by gene expression

monitoring. Science , 286:531–537, 1999.

[85] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,

2016. http://www.deeplearningbook.org.

[86] S. Guo and M.W. Fraser. Propensity Score Analysis: Statistical Methods and Appli-

cations. SAGE, 2014.

[87] I. Guyon. Results and analysis of the 2013 ChaLearn cause-effect pair challenge.

JMLR Workshop and Conference Proceedings, 2014.

[88] I. Guyon, C. Aliferis, and A. Elisseeff. Computational Methods of Feature Selection,

chapter Causal Feature Selection, pages 63–86. Chapman and Hall, 2007.

[89] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal

of Machine Learning Research, 3:1157–1182, 2003.

[90] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal

of Machine Learning Research, 3:1157–1182, 2003.

[91] Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lotfi A. Zadeh. Feature Ex-

traction: Foundations and Applications. Springer-Verlag New York, Inc., 2006.

[92] B. Haibe-Kains, C. Desmedt, S. Loi, M. Delorenzi, C. Sotiriou, and G. Bontempi.

Computational Intelligence in Clinical Oncology: Lessons Learned from an Analysis

of a Clinical Study, pages 237–268. Springer Berlin Heidelberg, Berlin, Heidelberg,

2008.

[93] D. J. Hand. Discrimination and classification. John Wiley, New York, 1981.

[94] D.J. Hand. Statistics: a very short introduction, volume 196. Oxford University

Press, USA, 2008.

[95] W. Hardle and J. S. Marron. Fast and simple scatterplot smoothing. Comp. Statist.

Data Anal., 20:1–17, 1995.

[96] T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman and Hall,

London, UK, 1990.

[97] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6):607–615,

1996.

[98] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning.

Springer, 2001.

[99] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning (2nd

edition). Springer, 2009.

[100] Trevor Hastie and Robert Tibshirani. Efficient quadratic regularization for expres-

sion arrays. Biostatistics (Oxford, England), 5(3):329–40, Jul 2004.

[101] J. S. U. Hjorth. Computer Intensive Statistical Methods. Chapman and Hall, 1994.

[102] Tin Kam Ho. The random subspace method for constructing decision forests. IEEE

Trans. Pattern Anal. Mach. Intell., 20(8):832–844, 1998.

[103] W. Hoeffding. Probability inequalities for sums of bounded random variables. Jour-

nal of American Statistical Association, 58:13–30, 1963.

[104] PO Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Scholkopf. Nonlinear causal

discovery with additive noise models. In Advances in Neural Information Processing

Systems, pages 689–696, 2009.

[105] P. J. Huber. Robust Statistics . Wiley, New York, 1981.

[106] P. Hurley. A Concise Introduction to Logic. CENGAGE Learning Custom Publish-

ing, 2011.

422 BIBLIOGRAPHY

[107] A. K. Jain, R. C. Dubes, and C. Chen. Bootstrap techniques for error estimation.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 5:628–633, 1987.

[108] J.-S. R. Jang. Anfis: Adaptive-network-based fuzzy inference systems. IEEE Trans-

actions on Fuzzy Systems, 23(3):665–685, 1993.

[109] J. S. R. Jang, C. T. Sun, and E. Mizutani. Neuro-Fuzzy and Soft Computing . Matlab

Curriculum Series. Prentice Hall, 1997.

[110] E.T. Jaynes. Probability theory : the logic of science. Cambridge University Press,

2003.

[111] T. A. Johansen and B. A. Foss. Constructing narmax models using armax models.

International Journal of Control, 58:1125–1153, 1993.

[112] M. C. Jones, J. S. Marron, and S. J. Sheather. A brief survey of bandwidth selection

for density estimation. Journal of American Statistical Association, 90, 1995.

[113] M. I. Jordan and T. J. Sejnowski, editors. Graphical models: foundations of neural

computation. The MIT Press, 2001.

[114] V. Y. Katkovnik. Linear and nonlinear methods of nonparametric regression analysis.

Soviet Automatic Control, 5:25–34, 1979.

[115] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing.

Science, 220(4598):671–680, 1983.

[116] R. Kohavi. A study of cross-validation and bootstrap for accuracy estima-

tion and model selection. In Proceedings of IJCAI-95, 1995. available at

http://robotics.stanford.edu/users/ronnyk/ronnyk-bib.html.

[117] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelli-

gence, 97(1-2):273–324, 1997.

[118] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelli-

gence, 97(1-2):273–324, 1997.

[119] D. Koller and N. Friedman. Probabilistic graphical models. The MIT Press, 2009.

[120] A. N. Kolmogorov. Foundations of Probability. Berlin, 1933.

[121] J. Kolodner. Case-Based Reasoning. Morgan Kaufmann, 1993.

[122] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,

521(7553):436–444, 2015.

[123] Raoul LePage and Lynne Billard. Exploring the limits of bootstrap. John Wiley &

Sons, 1992.

[124] R. J.A. Little and D. B. Rubin. Statistical analysis with missing data. Wiley, 2002.

[125] L. Ljung. System identification: Theory for the User . Prentice-Hall, Englewood

Cliffs, NJ, 1987.

[126] Sherene Loi, Benjamin Haibe-Kains, Christine Desmedt, Pratyaksha Wirapati,

Fran¸coise Lallemand, Andrew M Tutt, Cheryl Gillet, Paul Ellis, Kenneth Ryder,

James F Reid, Gianluca Bontempi, et al. Predicting prognosis using molecular

profiling in estrogen receptor-positive breast cancer treated with tamoxifen. BMC

genomics, 9(1):1–12, 2008.

[127] D. Lopez-Paz. From Dependence to Causation. PhD thesis, Cambridge University,

2016.

[128] D. G. Luenberger. Linear and Nonlinear Programming. Addison Wesley, Reading,

MA, 1984.

[129] C. Mallows. Discussion of a paper of beaton and tukey. Technometrics, 16:187–188,

1974.

[130] C. L. Mallows. Some comments on cp . Technometrics, 15:661, 1973.

[131] O. Maron and A. Moore. The racing algorithm: Model selection for lazy learners.

Artificial Intelligence Review, 11(1–5):193–225, 1997.

[132] A. Miller. Subset Selection in Regression (2nd ed.). Chapman and Hall, 2002.

BIBLIOGRAPHY 423

[133] J. Moody. The effective number of parameters: An analysis of generalization and

regularization in nonlinear learning systems. In J. Moody, Hanson, and Lippmann,

editors, Advances in Neural Information Processing Systems, volume 4, pages 847–

854, Palo Alto, 1992. Morgan Kaufmann.

[134] J. Moody and C. J. Darken. Fast learning in networks of locally-tuned processing

units. Neural Computation , 1(2):281–294, 1989.

[135] A. W. Moore, D. J. Hill, and M. P. Johnson. An empirical investigation of brute force

to choose features, smoothers and function approximators. In S. Janson, S. Judd, and

T. Petsche, editors, Computational Learning Theory and Natural Learning Systems,

volume 3. MIT Press, Cambridge, MA, 1992.

[136] K. P. Murphy. An introduction to graphical models. Technical report, 2001.

[137] R. Murray-Smith. A local model network approach to nonlinear modelling. PhD

thesis, Department of Computer Science, University of Strathclyde, Strathclyde,

UK, 1994.

[138] R. Murray-Smith and T. A. Johansen. Local learning in local model networks.

In R. Murray-Smith and T. A. Johansen, editors, Multiple Model Approaches to

Modeling and Control, chapter 7, pages 185–210. Taylor and Francis, 1997.

[139] R. H. Myers. Classical and Modern Regression with Applications. PWS-KENT

Publishing Company, Boston, MA, second edition, 1994.

[140] E. Nadaraya. On estimating regression. Theory of Prob. and Appl. , 9:141–142, 1964.

[141] Cathy O'Neil. Weapons of Math Destruction. Crown, New York, 2016.

[142] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw Hill,

1991.

[143] Simon Parsons and Anthony Hunter. A Review of Uncertainty Handling Formalisms,

pages 8–37. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.

[144] E. Parzen. On estimation of a probability density function and mode. Annals of

Mathematical Statistics, 33:1065–1076, 1962.

[145] Y. Pawitan. In all likelihood: statistical modelling and inference using likelihood.

Oxford Science, 2001.

[146] J. Pearl. Causality. Cambridge University Press, 2000.

[147] Judea Pearl. Probabilistic reasoning in intelligent systems : networks of plausible

inference. Morgan Kaufmann, San Francisco, Calif., 2009. Example for Explaning

away.

[148] Judea Pearl. Comment: Understanding simpson's paradox. The American Statisti-

cian, 68:8–13, 2014.

[149] Judea Pearl and Dana Mackenzie. The Book of Why. Basic Books, New York, 2018.

[150] Jean-Philippe Pellet and Andr Elisseeff. Using markov blankets for causal structure

learning. J. Mach. Learn. Res., 9:1295–1342, 2008.

[151] J.P. Pellet and A. Elisseeff. Using markov blankets for causal structure learning.

Journal of Machine Learning Research, 9:1295–1342, 2008.

[152] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information:

criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 27, 2005.

[153] M. P. Perrone and L. N. Cooper. When networks disagree: Ensemble methods for

hybrid neural networks. In R. J. Mammone, editor, Artificial Neural Networks for

Speech and Vision, pages 126–142. Chapman and Hall, 1993.

[154] Jonas Peters, Dominik Janzing, and Bernhard Sch¨olkopf. Elements of Causal Infer-

ence: Foundations and Learning Algorithms. Adaptive Computation and Machine

Learning. MIT Press, Cambridge, MA, 2017.

[155] D. Plaut, S. Nowlan, and G. E. Hinton. Experiments on learning by back propaga-

tion. Technical Report CMU-CS-86-126, Department of Computer Science, Carnegie

Mellon University, Pittsburgh, PA, 1986.

424 BIBLIOGRAPHY

[156] M. J. D. Powell. Algorithms for Approximation , chapter Radial Basis Functions

for multivariable interpolation: a review, pages 143–167. Clarendon Press, Oxford,

1987.

[157] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical

Recipes in C: The Art of Scientific Computing. Cambridge University Press, 1992.

Second ed.

[158] J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986.

[159] J. R. Quinlan. Simplyfying decision trees. International Journal of Man-Machine

Studies, 27:221–234, 1987.

[160] R Development Core Team. R: A language and environment for statistical computing.

R Foundation for Statistical Computing, Vienna, Austria, 2004. ISBN 3-900051-07-0.

[161] J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978.

[162] M. Rosenblatt. Remarks on some nonparametric estimates of a density function.

Annals of Mathematical Statistics, 27:832–837, 1956.

[163] K. J. Rothman, S. Greenland, and Timothy L. Lash. Modern Epidemiology . Lippin-

cott Williams & Wilkins, Philadelphia, PA, 3rd edition, 2008.

[164] D. B. Rubin. Inference and missing data (with discussion). Biometrika , 63:581–592,

1976.

[165] D. E. Rumelhart, G. E. Hinton, and R. K. Williams. Learning representations by

backpropagating errors. Nature, 323(9):533–536, 1986.

[166] S. Russel and Peter Norvig. Artificial Intelligence: a modern approach. Pearson,

2016.

[167] Y. Saeys, I. Inza, and P. Larranaga. A review of feature selection techniques in

bioinformatics. Bioinformatics , 23:2507–2517, 2007.

[168] R. E. Schapire. Nonlinear Estimation and Classification, chapter The boosting ap-

proach to machine learning: An overview. Springer,.

[169] L. Schneps and C. Colmez. Math on Trial: How Numbers Get Used and Abused in

the Courtroom. EBL ebooks online. Basic Books, 2013.

[170] D. W. Scott. Multivariate density estimation . Wiley, New York, 1992.

[171] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis.

Cambridge University Press, illustrated edition edition, 2004.

[172] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. Springer

Verlag, Berlin, 2000.

[173] P. Spirtes and K. Zhangl. Causal discovery and inference: concepts and recent

methodological advances. Applied Informatics , 3, 2016.

[174] C. Stanfill and D. Waltz. Toward memory-based reasoning. Communications of the

ACM, 29(12):1213–1228, 1987.

[175] C. Stone. Consistent nonparametric regression. The Annals of Statistics , 5:595–645,

1977.

[176] M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal

of the Royal Statistical Society B, 36(1):111–147, 1974.

[177] M. Stone. An asymptotic equivalence of choice of models by cross-validation and

akaike's criterion. Journal of Royal Statistical Society, Series B, 39:44–47, 1977.

[178] T. Takagi and M. Sugeno. Fuzzy identification of systems and its applications

to modeling and control. IEEE Transactions on Systems, Man, and Cybernetics,

15(1):116–132, 1985.

[179] M. Taniguchi and V. Tresp. Averaging regularized estimators. Neural Computation,

(9), 1997.

[180] H. Tijms. Understanding probability. Cambridge, 2004.

[181] V. Tresp. Handbook for neural network signal processing, chapter Committee ma-

chines. CRC Press, 2001.

BIBLIOGRAPHY 425

[182] I. Tsamardinos and C. Aliferis. Towards principled feature selection: Relevancy. In

Proceedings of the 9th International Workshop on Artificial Intelligence and Statis-

tics, 2003.

[183] I Tsamardinos and CF Aliferis. Towards Principled Feature Selection: Relevancy,

Filters and Wrappers. In Ninth International Workshop on Artificial Intelligence

and Statistics, AISTAT, 2003.

[184] B. van Fraassen. The Scientific Image. Oxford University Press, 1980.

[185] V. N. Vapnik. Principles of risk minimization for learning theory. In Advances in

Neural Information Processing Systems, volume 4, Denver, CO, 1992.

[186] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, NY,

1995.

[187] V. N. Vapnik. Statistical Learning Theory. Springer, 1998.

[188] V. N. Vapnik and A. J. Chervonenkis. The necessary and sufficient conditions for

consistency of the method of empirical risk. Pattern Recognition and Image Analysis,

1(3):284–305, 1991.

[189] W. N. Venables and D. M. Dmith. An Introduction to R. Network Theory, 2002.

[190] Tyler Vigen. Spurious correlations. Hachette Book, 2015.

[191] T. P. Vogl, J. K. Mangis, A. K. Rigler, W. T. Zink, and D. L. Alkon. Accelerating

the convergence of the back-propagation method. Biological Cybernetics, 59:257–263,

1988.

[192] L. Wasserman. All of statistics . Springer, 2004.

[193] G. Watson. Smooth regression analysis. Sankhya, Series, A(26):359–372, 1969.

[194] S. M. Weiss and C. A. Kulikowski. Computer Systems that learn. Morgan Kaufmann,

San Mateo, California, 1991.

[195] B. Widrow and M.E. Hoff. Adaptive switching circuits. In WESCON Convention

Record Part IV, 1960.

[196] D. H. Wolpert. Stacked generalization. Technical Report LA-UR-90-3460, Los

Alamos, NM, 1990. He proposes a more sophisticated voting scheme where the

a second level function output performs the final estimation of the true class. I don't

think this is going to referenced in the final thesis.

[197] D. H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural

Computation, 8:1341–1390, 1996.

[198] D. H. Wolpert and R. Kohavi. Bias plus variance decomposition for zero-one loss

functions. In Proceedings of the 13th International Conference on Machine Learning,

pages 275–283, 1996.

[199] Zenglin Xu, Rong Jin, Jieping Ye, Michael R. Lyu, and Irwin King. Non-monotonic

feature selection. In Andrea Pohoreckyj Danyluk, Lon Bottou, and Michael L.

Littman, editors, ICML , volume 382 of ACM International Conference Proceeding

Series, pages 1145–1152. ACM, 2009.

[200] L. A. Zadeh. Fuzzy sets. Information and Control, 8:338–353, 1965.

... La définition d'un modèle de pendule dépend de plusieurs facteurs, comme discuté en Section 1.2. 1. On pourrait envisager un modèle pour calculer la période T du pendule pour une longueur fixée à l'avance ou au contraire établir un modèle pour déterminer quelle longueur l * du fil détermine une période T * . ...

... Afin de pouvoir étudier un système dynamique il est nécessaire d'introduire, comme dans l'exemple précédent, une troisième grandeur, appelée état, qui jouit des propriétés suivantes : 1. elle résume l'ensemble de l'information sur le passé et le présent du système (mémoire). ...

... Définition 2. 6. Un état x (2) est accessible à l'instant t 2 à partir d'un état x (1) s'il existe un instant t 1 et une fonction d'entrée u(·) ∈ Ω telle que ϕ(t 2 , t 1 , x (1) , u(·)) = x (2) Parfois il est intéressant de considérer des systèmes où chaque état x (2) à l'instant t est accessible à partir d'un quelconque état x (1) . ...

Ce syllabus est destiné, en premier lieu, aux étudiants de Bachelier en Sciences Informatiques de l'Université Libre de Bruxelles. Le cours de Modélisation et Simulation vise à fournir une présentation des fondements théoriques de la modélisation des systèmes dynamiques et des outils pour la simulation. La motivation pour le cours naquit d'un constat: nonobstant la réputation mondiale (notamment un prix Nobel et plusieurs prix Francqui) de l'ULB dans le domaine de la modélisation de systèmes dynamiques, aucun cours sur ce sujet n'était dispensé aux élèves informaticiens. Ceci était d'autant plus surprenant si on considère i) que des grandes avancées ont été rendues possible dans la modélisation par l'utilisation de l'ordinateur (il suffit penser aux fractales et à l'étude des phénomènes chaotiques) et ii) que les progrès des méthodes numériques et l'augmentation des performances des ordinateurs permettent aujourd'hui de simuler et prédire le comportement de systèmes de plus en plus complexes. Aussi plusieurs domaines industriels utilisent couramment la simulation numérique afin de raccourcir le cycle de conception et développement de nouveaux produits. La matière enseignée se répartit sur 9 sujets principaux: l'introduction à la modélisation et à la simulation, l'introduction aux systèmes dynamiques, les systèmes à états discrets et à temps discret, les systèmes dynamiques à temps continu, les systèmes linéaires continus, les systèmes non-linéaires continus, les systèmes à temps discret, la simulation Monte-Carlo et la simulation à événements discrets. On trouvera dans l'annexe des rappels d'équations différentielles et de probabilité. Le côté pratique de la modélisation et de la simulation n'est pas négligé et chaque chapitre introduit des applications à des problèmes concrets. Ce manuel ne vise ni à l'originalité, ni à être complet. Il n'a d'autre but que de fournir un supplément à l'étudiant qui suit régulièrement le cours. De nombreuses références à des ouvrages publiés sont faites tout au long de la présentation. En particulier, la présentation des fondements des systèmes dynamiques est inspirée du livre du Pr. Sergio Rinaldi (Politecnico di Milano, Italie) at de son cours de Teoria dei sistemi que j'ai eu l'honneur de suivre dans le lointain 1989. Et, afin de stimuler l'étudiant à entreprendre la découverte de la complexité qui se cache derrière la dynamique d'un système apparemment simple, voici une citation extraite de Gleick: Tout irait bien mieux, non seulement en recherche, mais aussi dans le monde quotidien de la politique et de l'économie, si davantage de gens prenaient conscience du fait que les systèmes élémentaires ne possèdent pas nécessairement des propriétés dynamiques simples.

... Our research considered the implementation of a streaming algorithm capable of performing analysis on a univariate time series as new data becomes available. To make this possible, recursive least squares techniques can be applied to an autoregressive (AR) model [5]. The AR model is commonly used for time series forecasting. ...

... We can then apply recursive least squares to allow the model to work on a data stream [5]. The regression coefficients can be updated as a new sample ( +1 , +1 ) becomes available. ...

... Nonlinear regression is based on developing predictive models, which combine basic functions, such as polynomial, sigmoid, and spline [13]. The polynomial regression is one of the simplest approaches, and it aims at fitting a model by using curves of order n > 2 (quadratic, cubic, etc. ), while the spline approach aims at producing a piece-wise model in which each model is trained with only the value lying in a specified interval. ...

... It performs this through an unsupervised process that projects the data from the original space to a lower dimensional one, where the axis, which are called Principal Components (PC), of this new space are computed by combining the original variables. The first PC is oriented over the direction with the maximum variance of data [13]. This mathematically corresponds to find the vector a = [a 1 , . . . ...

The large-scale deployment of pervasive sensors and decentralized computing in modern smart grids is expected to exponentially increase the volume of data exchanged by power system applications. In this context, the research for scalable, and flexible methodologies aimed at supporting rapid decisions in a data rich, but information limited environment represents a relevant issue to address. To this aim, this paper outlines the potential role of Knowledge Discovery from massive Datasets in smart grid computing, presenting the most recent activities developed in this field by the Task Force on "En-abling Paradigms for High-Performance Computing in Wide Area Monitoring Protective and Control Systems" of the IEEE PSOPE Technologies and Innovation Subcommittee.

... The choice of the optimal number of neighbors k will be performed through automatic leave-one-out selection as described in [5]. Our implementation of the kNN models is based on the R package gbcode [7]. ...

In finance, volatility is defined as a measure of variation of a trading price series over time. As volatility is a latent variable, several measures, named proxies, have been proposed in the literature to represent such quantity. The purpose of our work is twofold. On one hand, we aim to perform a statistical assessment of the relationships among the most used proxies in the volatility literature. On the other hand, while the majority of the reviewed studies in the literature focuses on a uni-variate time series model (NAR), using a single proxy, we propose here a NARX model, combining two proxies to predict one of them, showing that it is possible to improve the prediction of the future value of some proxies by using the information provided by the others. Our results , employing artificial neural networks (ANN), k-Nearest Neighbours (kNN) and support vector regression (SVR), show that the supplementary information carried by the additional proxy could be used to reduce the forecasting error of the aforementioned methods. We conclude by explaining how we wish to further investigate such relationship.

... The coefficient of determination R 2 is in the range from zero to one, the closer the coefficient to one, the better the results. However, this criterion is not appropriate for the comparison of candidate models because overfitting increases artificially its value (Bontempi & Ben Taieb, 2011), as well as the number of features. Finally, this measure is useful to check how well the independent variables explain the variability of the dependent one. ...

The design of experiments and the validation of the results achieved with them are vital in any research study. This paper focuses on the use of different Machine Learning approaches for regression tasks in the field of Computational Intelligence and especially on a correct comparison between the different results provided for different methods, as those techniques are complex systems that require further study to be fully understood. A methodology commonly accepted in Computational intelligence is implemented in an R package called RRegrs. This package includes ten simple and complex regression models to carry out predictive modeling using Machine Learning and well-known regression algorithms. The framework for experimental design presented herein is evaluated and validated against RRegrs. Our results are different for three out of five state-of-the-art simple datasets and it can be stated that the selection of the best model according to our proposal is statistically significant and relevant. It is of relevance to use a statistical approach to indicate whether the differences are statistically significant using this kind of algorithms. Furthermore, our results with three real complex datasets report different best models than with the previously published methodology. Our final goal is to provide a complete methodology for the use of different steps in order to compare the results obtained in Computational Intelligence problems, as well as from other fields, such as for bioinformatics, cheminformatics, etc., given that our proposal is open and modifiable.

La cerveza es un bien de alto consumo en la actualidad que se caracteriza, en países latinos, por su alta concentración en la producción nacional. Estas cerveceras han creado niveles elevados de lealtad en los consumidores, pero carecen de características diferenciadoras debido a la baja diversidad de productos de este tipo. En los últimos años se ha observado un auge de las microcervecerías, las cuales en menos de diez años han crecido exponencialmente haciéndose fuerte en diferentes lugares como bares, pubs, cafés y mostrando aspectos de fortalecimiento ante un mercado dominado por una única empresa. Frente a esto, los consumidores han respondido generando patrones en los que, para diferentes ocasiones o eventos, se inclinan por cierto tipo de cerveza (artesanal o comercial). El estudio apunta a mostrar el panorama actual de la cerveza artesanal en Colombia y generar bases suficientes para enlazar tendencias actuales de la cerveza artesanal y cómo esto influye en el sentimiento y motivaciones del consumidor. La investigación pretende destacar en un campo que no es muy conocido: las microcervecerías en la producción de cerveza artesanal, y obtener una respuesta satisfactoria desde el punto de vista del consumidor. Los resultados obtenidos son ideales para entender el comportamiento del consumidor durante el consumo de cerveza artesanal.

  • Dalila Hattab Dalila Hattab

The aim of this paper is to investigate the effect of volatility surges during the COVID-19 pandemic crisis on long-term investment trading rules. These trading rules are derived from stock return forecasting based on a Multiple Step Ahead Direct Strategy, and built on the combination of machine learning models and the Autoregressive Fractionally Integrated Moving Average (ARFIMA) model. ARFIMA has the feature to account for the long memory and structural change in conditional variance process. The machine learning models considered are a particular Neural Network model (MLP), K-Nearest Neighbors (KNN) and Support Vector Regression (SVR). The trading performances of the produced models are evaluated in terms of economical metrics reflecting profitability and risk like: Annualized Return, Sharpe Ratio and Profit Ratio. The hybrid model performances are compared to the simple machine learning models and to the classical ARMA-GARCH model using a Volatility Proxy as external regressor. When applying these long-term investment trading rules to the CAC40 index, from May 2016 to May 2020, the finding is that both MLP-based and hybrid ARFIMA-MLP-based trading models show higher performances with a Sharpe Ratio close to 2 and a Profit Ratio around 40% despite the COVID-19 crisis.

Lectures and exercise classes of the ULB/VUB Big Data course (Scalable Analytics part). The GitHub archive contains the lectures and the practical classes on how to implement from scratch different ML algorithms (ordinary least squares, gradient descent, k-means, alternating least squares), using Python NumPy, and how to then make these implementations scalable using Map/Reduce and Spark. Code on: https://github.com/Yannael/BigDataAnalytics_INFOH515 Slides of the course on: https://github.com/Yannael/BigDataAnalytics_INFOH515/tree/master/Analytics_Course_Slides

  • Clark Glymour
  • Kun Zhang
  • Peter Spirtes Peter Spirtes

A fundamental task in various disciplines of science, including biology, is to find underlying causal relations and make use of them. Causal relations can be seen if interventions are properly applied; however, in many cases they are difficult or even impossible to conduct. It is then necessary to discover causal relations by analyzing statistical properties of purely observational data, which is known as causal discovery or causal structure search. This paper aims to give a introduction to and a brief review of the computational methods for causal discovery that were developed in the past three decades, including constraint-based and score-based methods and those based on functional causal models, supplemented by some illustrations and applications.

  • Gamaleldin F. Elsayed
  • Shreya Shankar Shreya Shankar
  • Brian Cheung
  • Jascha Sohl-Dickstein

Machine learning models are vulnerable to adversarial examples: small changes to images can cause computer vision models to make mistakes such as identifying a school bus as an ostrich. However, it is still an open question whether humans are prone to similar mistakes. Here, we create the first adversarial examples designed to fool humans, by leveraging recent techniques that transfer adversarial examples from computer vision models with known parameters and architecture to other models with unknown parameters and architecture, and by modifying models to more closely match the initial processing of the human visual system. We find that adversarial examples that strongly transfer across computer vision models influence the classifications made by time-limited human observers.

Detecting frauds in credit card transactions is perhaps one of the best testbeds for computational intelligence algorithms. In fact, this problem involves a number of relevant challenges, namely: concept drift (customers' habits evolve and fraudsters change their strategies over time), class imbalance (genuine transactions far outnumber frauds), and verification latency (only a small set of transactions are timely checked by investigators). However, the vast majority of learning algorithms that have been proposed for fraud detection rely on assumptions that hardly hold in a real-world fraud-detection system (FDS). This lack of realism concerns two main aspects: 1) the way and timing with which supervised information is provided and 2) the measures used to assess fraud-detection performance. This paper has three major contributions. First, we propose, with the help of our industrial partner, a formalization of the fraud-detection problem that realistically describes the operating conditions of FDSs that everyday analyze massive streams of credit card transactions. We also illustrate the most appropriate performance measures to be used for fraud-detection purposes. Second, we design and assess a novel learning strategy that effectively addresses class imbalance, concept drift, and verification latency. Third, in our experiments, we demonstrate the impact of class unbalance and concept drift in a real-world data stream containing more than 75 million transactions, authorized over a time window of three years.

  • K.J. Hastings

Updated to conform to Mathematica® 7.0, Introduction to Probability with Mathematica®, Second Edition continues to show students how to easily create simulations from templates and solve problems using Mathematica. It provides a real understanding of probabilistic modeling and the analysis of data and encourages the application of these ideas to practical problems. The accompanying CD-ROM offers instructors the option of creating class notes, demonstrations, and projects. New to the Second Edition • Expanded section on Markov chains that includes a study of absorbing chains • New sections on order statistics, transformations of multivariate normal random variables, and Brownian motion • More example data of the normal distribution • More attention on conditional expectation, which has become significant in financial mathematics • Additional problems from Actuarial Exam P • New appendix that gives a basic introduction to Mathematica • New examples, exercises, and data sets, particularly on the bivariate normal distribution • New visualization and animation features from Mathematica 7.0 • Updated Mathematica notebooks on the CD-ROM (Go to Downloads/Updates tab for link to CD files.) • After covering topics in discrete probability, the text presents a fairly standard treatment of common discrete distributions. It then transitions to continuous probability and continuous distributions, including normal, bivariate normal, gamma, and chi-square distributions. The author goes on to examine the history of probability, the laws of large numbers, and the central limit theorem. The final chapter explores stochastic processes and applications, ideal for students in operations research and finance.

  • T.J. Hastie
  • R.J. Tibshirani

This book describes an array of power tools for data analysis that are based on nonparametric regression and smoothing techniques. These methods relax the linear assumption of many standard models and allow analysts to uncover structure in the data that might otherwise have been missed. While McCullagh and Nelder's Generalized Linear Models shows how to extend the usual linear methodology to cover analysis of a range of data types, Generalized Additive Models enhances this methodology even further by incorporating the flexibility of nonparametric regression. Clear prose, exercises in each chapter, and case studies enhance this popular text.

  • Leo Breiman
  • Jerome H. Friedman Jerome H. Friedman
  • Richard A. Olshen
  • Charles J. Stone

The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.