# 1 The Receiver Operating Characteristics (ROC) curve

source

  my.auc=function(pred,target){
#### too slow. use auc_roc
# pred is predictor
# target is the true vector (0,1)
# the return value is auc the area under the roc curve
target   =target[order(pred)]
n   =length(target)
totP=sum(target)
totN=n-totP
FN  =cumsum(target)
TN  =(1:n)-FN
FP  =totN-TN
TP  =n-totN-FN
x   =c(1,FP/(TN+FP))
y   =c(1,TP/(FN+TP))
auc =-0.5*sum(diff(x)*(y[-1]+y[-n]))
obj=list(auc=auc,x=x,y=y)
class(obj)      ="auc"

return(obj)
}
plot.auc=function(obj){
with(auc,plot(x,y,type="l",col=2,lwd=3,
xaxs="i",yaxs="i",
xlab="False Positive Rate",
ylab="True Positive Rate",
main=paste("ROC Curve. AUC=",round(obj$auc,3)))) xp=c(auc$x,0,1,1)
yp=c(auc$y,0,0,1) polygon(xp, yp , density = NA, angle = 45, border = 2, col = "yellow") abline(a=0,b=1) } ## 1.1 The problem of a binary classification. Suppose we obtained a continuos scoring from a composition of different explanatory variables, which should help us in deciding if a unit must be classified as 0 or 1. For example we could have obtained it with a linear predictor from a logistic regression. From an estimated linear predictor $$\hat{\eta}_i$$ (whatever the variables used to obtain it), we can compute an estimate of the probability of the $$i-$$th unit of being “1” (or TRUE), according to the inverse logit transform of the estimated predictor: $\hat{p}_i=\hat{Prob}(Y_i=1)=\frac{\exp(\hat{\eta}_i)}{1+\exp(\hat{\eta}_i)}$ which maps a real varable fro $$\cal{R}_1$$ to $$[0,1]$$. Of course low values of $$\hat{\eta}_i$$ (that is negative values with large absolute value) will give estimated probabilities near zero, while large positive values will give values of $$\hat{p}_i$$ near one. This could be enough for our logistic regression model estimation (which possibly has been estimated with glm in R, and possibly chosen after a model selection, through AIC or some sort of Cross Validation), but certainly this is not enough if we mest decide something, that is: • A patient is lickely to be sick? • An on line transaction is honest or is a fraude transaction? • The team “myfavouriteteam” will win a soccer match? ## 1.2 Choosing a threshold. In other terms I want to use the values of $$\hat{p}_i$$ to classify the units $$U_i, i=1,2,...,n$$ in a group (0 for low values) or in another one (1 vor large values) A first simple rule could be: Classify the unit as group “1” (ore TRUE) if $$p_i\ge\frac{1}{2}$$. However this can be measleading in many cases. Suppose this very simple (artificial) situation: source y=c(0,0,0,0,0,0,0,1,1,1) eta=c(-5,-4,-3,-2,-1.5,-1,0.5,1,3,4) p=exp(eta)/(1+exp(eta)) plot(p,y) abline(v=0.5) plot source print(round(p,3)) output  [1] 0.007 0.018 0.047 0.119 0.182 0.269 0.622 0.731 0.953 0.982 A threshold of 0.5 is too crude, we would classify the sixth unit in the wrong group, while, given the good separation between groups, a threshold $$p_t$$ in the interval $$p_7 < p \le p_8$$, that is 0.622, 0.731, will do the job perfectly, without classification errors. Any other threshold will not be efficient, because will cause at least a unit misclassified (as for $$p_t=0.5$$) Of course reality will not be so confortable (and maybe also boring…): look at another more realistic example (again on artificial data, only to fix ideas). source y=c(0,0,0,1,0,1,0,1,1,1) eta=c(-5,-4,-3.5,-2,-1.5,-1.1,0.5,1,3,4) p=exp(eta)/(1+exp(eta)) plot(p,y) abline(v=0.5) plot Now the situation, even if very simple, is different: there is not a clear separation between groups, and every theshold will give some misclassification error: “0” classified as “1” (false positive, or FP, type I error) or “1” classified as “0” (false negative, or FN, type II error). The first case is when we say: “Stop this transaction: it is a fraude!” and afterwards we discover that it was honest; the second case is the opposite: “Approve this transaction and pay it” and alter we discover that it was a fraud. Or we say that someone is sick while she/he is ok. Of course the kind of errors will have different consequences, or, if we want to quantify with vile money, will have different costs. In the previous example, if we want to be safe from II type errors, but we can bear some type I errors, we could be conservative and fix a threshold greater than $$p_3$$ but less or equal to $$p_4$$, to be sure that all positive unit will be correctly classified, but at the cost of two type II errors. If we moved the threshold on the right we would have errors of both type. On the other extreme, if we choose a threshold greater than $$p_7$$ but less than $$p_8$$ we would have no type II errors but two type I errors… ## 1.3 Some general consideration In a general real life situation, things will be of course more complicated. First of all, if we want to understand the effect of a threshold, it is the some thing to work on the $$\hat{p}$$ scale, or directly on the $$\hat{\eta}$$ scale, or on any other monotone transformation of them… From the last example we see that any threshold less than $$p_3$$ or greter than $$p_8$$ is not efficient. Furthermore, any threshold between consecutive values of $$p_i,p_{i+1}$$ will give same error rates, so that we will have five different configuration of errors. With $$n$$ we cannot have more than $$n+1$$ distinct configurations ## 1.4 A more complicated situation source  l8=glm(isFraud~card4*card6+log(TransactionAmt)+I((log(TransactionAmt))^2) +sin(pi*hour/12) +cos(pi*hour/12) +sin(pi*hour/12)*cos(pi*hour/12) +C1*C2+D1*D10*D11*D15,maxit=100,data=a,family=binomial) summary(l8) output  Call: glm(formula = isFraud ~ card4 * card6 + log(TransactionAmt) + I((log(TransactionAmt))^2) + sin(pi * hour/12) + cos(pi * hour/12) + sin(pi * hour/12) * cos(pi * hour/12) + C1 * C2 + D1 * D10 * D11 * D15, family = binomial, data = a, maxit = 100) Deviance Residuals: Min 1Q Median 3Q Max -0.7118 -0.2205 -0.1769 -0.1374 3.8181 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.995e+00 2.790e-01 -21.488 < 2e-16 *** card4mastercard -8.303e-01 8.157e-02 -10.180 < 2e-16 *** card4visa -6.120e-01 7.677e-02 -7.973 1.55e-15 *** card6debit -1.353e+00 5.105e-01 -2.651 0.00803 ** log(TransactionAmt) 1.197e+00 1.060e-01 11.298 < 2e-16 *** I((log(TransactionAmt))^2) -8.664e-02 1.015e-02 -8.535 < 2e-16 *** sin(pi * hour/12) 2.148e-01 2.484e-02 8.647 < 2e-16 *** cos(pi * hour/12) -1.934e-01 2.202e-02 -8.784 < 2e-16 *** C1 -6.419e-02 5.592e-03 -11.479 < 2e-16 *** C2 7.093e-02 5.534e-03 12.817 < 2e-16 *** D1 -3.207e-03 5.293e-04 -6.058 1.37e-09 *** D10 1.127e-03 2.237e-04 5.037 4.73e-07 *** D11 7.314e-04 2.501e-04 2.924 0.00346 ** D15 -4.117e-04 2.246e-04 -1.833 0.06685 . card4mastercard:card6debit 5.688e-01 5.131e-01 1.109 0.26764 card4visa:card6debit 7.448e-01 5.117e-01 1.455 0.14557 sin(pi * hour/12):cos(pi * hour/12) -3.779e-01 4.367e-02 -8.655 < 2e-16 *** C1:C2 -3.079e-05 1.794e-05 -1.716 0.08611 . D1:D10 -1.527e-06 2.111e-06 -0.723 0.46944 D1:D11 -3.252e-06 2.389e-06 -1.361 0.17359 D10:D11 -7.783e-06 1.524e-06 -5.108 3.26e-07 *** D1:D15 6.857e-06 1.223e-06 5.608 2.05e-08 *** D10:D15 -5.627e-07 7.014e-07 -0.802 0.42238 D11:D15 -4.758e-06 8.915e-07 -5.337 9.46e-08 *** D1:D10:D11 1.099e-08 8.021e-09 1.371 0.17049 D1:D10:D15 -3.029e-09 4.194e-09 -0.722 0.47023 D1:D11:D15 6.776e-09 4.618e-09 1.467 0.14231 D10:D11:D15 1.415e-08 3.159e-09 4.480 7.46e-06 *** D1:D10:D11:D15 -2.321e-11 1.459e-11 -1.591 0.11157 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 59702 on 308895 degrees of freedom Residual deviance: 56698 on 308867 degrees of freedom AIC: 56756 Number of Fisher Scoring iterations: 8 source  AIC(l8) output [1] 56755.86 source  with(a,plot(predict(l8),isFraud,main="Scale of the linear predictor")) plot source  p=exp(predict(l8))/(1+exp(predict(l8))) plot(p,a$isFraud,main="Scale of p")

plot

Indeed, it is not very easy to understand the possibility of discriminating frauds, from this messy data.

## 1.5 The ROC curve

The ROC curve is built like a parametric curve: for each possible value of the threshold, the True Positive Rate, that is, the Sensitivity of the test, is plotted against the “False Positive Rate” that is, 1-Specifity of the test.

The bisector is the line reflects a random selection to classify units.

source

auc=my.auc(predict(l8),a$isFraud) plot(auc) plot In the plot, the ROC curve is the red one (built on the fraud data), the yellow area is the called “Area Under the Curve”. An AUC near to one denotes a very good test, while an area near 0.5 denotes an almost random test. Formally AUC is the probability that two units randomly chosen from the two sets, will be correctly ordered by the predictor test. In our example 0.689 is the probability that if we draw two observations, one fraud and one not fraud, the first has a value of the predictor greater than the second one. ## 1.6 A more complex model source  l9=glm(isFraud~card4*card6+log(TransactionAmt)+I((log(TransactionAmt))^2) +as.factor(hour) +C1*C2+D1*D10*D11*D15,maxit=100,data=a,family=binomial) summary(l9) output  Call: glm(formula = isFraud ~ card4 * card6 + log(TransactionAmt) + I((log(TransactionAmt))^2) + as.factor(hour) + C1 * C2 + D1 * D10 * D11 * D15, family = binomial, data = a, maxit = 100) Deviance Residuals: Min 1Q Median 3Q Max -0.7733 -0.2198 -0.1757 -0.1367 3.8264 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.262e+00 2.832e-01 -22.111 < 2e-16 *** card4mastercard -8.286e-01 8.167e-02 -10.145 < 2e-16 *** card4visa -6.117e-01 7.687e-02 -7.958 1.75e-15 *** card6debit -1.359e+00 5.107e-01 -2.661 0.00780 ** log(TransactionAmt) 1.191e+00 1.060e-01 11.238 < 2e-16 *** I((log(TransactionAmt))^2) -8.618e-02 1.016e-02 -8.486 < 2e-16 *** as.factor(hour)1 -2.873e-02 8.282e-02 -0.347 0.72872 as.factor(hour)2 2.456e-02 8.879e-02 0.277 0.78209 as.factor(hour)3 9.523e-02 9.562e-02 0.996 0.31930 as.factor(hour)4 2.129e-01 1.114e-01 1.912 0.05587 . as.factor(hour)5 7.809e-01 1.156e-01 6.756 1.41e-11 *** as.factor(hour)6 7.895e-01 1.387e-01 5.691 1.26e-08 *** as.factor(hour)7 1.053e+00 1.446e-01 7.285 3.23e-13 *** as.factor(hour)8 8.913e-01 1.738e-01 5.129 2.92e-07 *** as.factor(hour)9 1.072e+00 1.508e-01 7.109 1.17e-12 *** as.factor(hour)10 6.777e-01 1.486e-01 4.561 5.08e-06 *** as.factor(hour)11 4.670e-01 1.197e-01 3.902 9.53e-05 *** as.factor(hour)12 4.160e-01 9.526e-02 4.368 1.26e-05 *** as.factor(hour)13 7.182e-02 8.999e-02 0.798 0.42480 as.factor(hour)14 4.375e-02 8.288e-02 0.528 0.59756 as.factor(hour)15 8.874e-03 7.984e-02 0.111 0.91150 as.factor(hour)16 1.125e-01 7.487e-02 1.502 0.13301 as.factor(hour)17 1.343e-01 7.332e-02 1.831 0.06707 . as.factor(hour)18 2.017e-01 7.198e-02 2.802 0.00508 ** as.factor(hour)19 2.163e-01 7.166e-02 3.019 0.00254 ** as.factor(hour)20 1.402e-01 7.314e-02 1.916 0.05531 . as.factor(hour)21 9.835e-02 7.358e-02 1.337 0.18134 as.factor(hour)22 3.080e-02 7.446e-02 0.414 0.67912 as.factor(hour)23 2.300e-01 7.258e-02 3.170 0.00153 ** C1 -6.397e-02 5.588e-03 -11.447 < 2e-16 *** C2 7.073e-02 5.531e-03 12.788 < 2e-16 *** D1 -3.201e-03 5.291e-04 -6.050 1.44e-09 *** D10 1.137e-03 2.239e-04 5.079 3.80e-07 *** D11 7.472e-04 2.501e-04 2.987 0.00281 ** D15 -4.180e-04 2.246e-04 -1.861 0.06279 . card4mastercard:card6debit 5.728e-01 5.133e-01 1.116 0.26446 card4visa:card6debit 7.511e-01 5.119e-01 1.467 0.14232 C1:C2 -3.095e-05 1.795e-05 -1.724 0.08473 . D1:D10 -1.613e-06 2.108e-06 -0.765 0.44433 D1:D11 -3.266e-06 2.388e-06 -1.368 0.17137 D10:D11 -7.853e-06 1.524e-06 -5.152 2.58e-07 *** D1:D15 6.879e-06 1.224e-06 5.618 1.93e-08 *** D10:D15 -5.653e-07 7.021e-07 -0.805 0.42067 D11:D15 -4.761e-06 8.910e-07 -5.343 9.13e-08 *** D1:D10:D11 1.118e-08 8.004e-09 1.397 0.16231 D1:D10:D15 -2.919e-09 4.194e-09 -0.696 0.48641 D1:D11:D15 6.762e-09 4.618e-09 1.464 0.14317 D10:D11:D15 1.423e-08 3.160e-09 4.502 6.72e-06 *** D1:D10:D11:D15 -2.341e-11 1.457e-11 -1.607 0.10810 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 59702 on 308895 degrees of freedom Residual deviance: 56601 on 308847 degrees of freedom AIC: 56699 Number of Fisher Scoring iterations: 8 source  AIC(l9) output [1] 56698.68 source auc=my.auc(predict(l9),a$isFraud)
plot(auc)

plot