Boosted Gaussian Bayes Classifier and Its Application in Bank Credit Scoring

With the explosion of computer science in the last decade, data banks and networks management present a huge part of tomorrows problems. One of them is the development of the best classi cation method possible in order to exploit the data bases. In classi cation problems, a representative successful method of the probabilistic model is a Naïve Bayes classi er. However, the Naïve Bayes e ectiveness still needs to be upgraded. Indeed, Naïve Bayes ignores misclassi ed instances instead of using it to become an adaptive algorithm. Di erent works have presented solutions on using Boosting to improve the Gaussian Naïve Bayes algorithm by combining Naïve Bayes classi er and Adaboost methods. But despite these works, the Boosted Gaussian Naïve Bayes algorithm is still neglected in the resolution of classi cation problems. One of the reasons could be the complexity of the implementation of the algorithm compared to a standard Gaussian Naïve Bayes. We present in this paper, one approach of a suitable solution with a pseudo-algorithm that uses Boosting and Gaussian Naïve Bayes principles having the lowest possible complexity.

ence in the last decade, data banks and networks management present a huge part of tomorrows problems. One of them is the development of the best classication method possible in order to exploit the data bases. In classication problems, a representative successful method of the probabilistic model is a Naïve Bayes classier. However, the Naïve Bayes eectiveness still needs to be upgraded. Indeed, Naïve Bayes ignores misclassied instances instead of using it to become an adaptive algorithm. Dierent works have presented solutions on using Boosting to improve the Gaussian Naïve Bayes algorithm by combining Naïve Bayes classier and Adaboost methods. But despite these works, the Boosted Gaussian Naïve Bayes algorithm is still neglected in the resolution of classication problems. One of the reasons could be the complexity of the implementation of the algorithm compared to a standard Gaussian Naïve Bayes. We present in this paper, one approach of a suitable solution with a pseudo-algorithm that uses Boosting and Gaussian Naïve Bayes principles having the lowest possible complexity.

INTRODUCTION
In machine learning and statistics, classication is one of the most important tools to analyze and classify a large amount of data. Classication is the problem of identifying to which from several categories, a new observation belongs.
Elkan C (1997) [1] and after him Ridgeway G, Madigan D, Richardson T, O'Kane J (1998) [2] presented the advantage of the Boosting methods and the interest of using it in classication problems. Many researchers have studied classication problems in order to improve the quality and eciency of classication. An example would be assigning a given email to the "spam" or "non-spam" class or assigning a given Iris In the case of classication, the Theorem can be interpreted with this approach: -Let D be a training set of samples having the information about the classes. We consider m classes w 1 , w 2 , . . . , w m . And X = X 1 , X 2 , . . . , X n with x = x 1 , x 2 , . . . , x n be a specic sample of n value with n attribute.
-Given a sample x, the probability P (w i |x) is the posterior probability that x belongs to the class w i . The classier will aect x to the class w i having the biggest P (w i |x).
According to the Bayes's theorem Eq. (2): -As P (X) is the same for all classes and P (w i ), the prior probability is the same for each data sample, we only need to compute P (X|w i ).
-In order to reduce the computational cost, the most common approach is to estimate P (X|w i ) instead of calculating it. By admitting that the classes belonging probabilities are independent, the formula can be calculated by Eq. (3):

Gaussian Naïve Bayes classier
The Gaussian Naïve Bayes classier is a special case of Naïve Bayes in continuous case.
With k classes w 1 , w 2 , . . . , w k ; with the prior probability q i , i = 1, k; and X = X 1 , X 2 , . . . , X n the n dimensional data sample.
In continuous case, P (x|w i ) is calculated by Eq. (4): With: -P (w i |x) : the class a prior probability of class w i , In practice we assume that each of the probability function has a Gaussian distribution, we then only need to calculate mean µ and variance σ 2 to obtain the density Eq. (5).
In which we need to compute the mean µ i and variance σ 2 of each training samples of classes w i .
For example, in case of two classes, the new observation x is predicted to belong to the class w 1 , if q 1 f 1 (x) > q 2 f 2 (x).

The Adaboost Algorithm
Adaboost is an algorithm that combine weak classier and inaccurate rules to get a highly accurate prediction rule. On every iteration, the algorithm focuses on mistakes made by inaccurate rules by adding a notion of weight [13]. The combination of all the rules then makes a more precise one. It can be illustrated as in Fig. 1 This error will be used to calculate a parameter α which will represent the contribution of each hypothesis: h t , to the nal prediction.

The principle of the Boosted Gaussian Naïve
Bayes can be translated into this pseudoalgorithm, Fig. 2. In bank credit operations, the important question is how to determine the repayment ability and creditworthiness of a customer. Lenders use a credit scoring system, or a numerical system, to measure how likely it is that a borrower will make payments on the money he or she borrows and to decide on whether to extend or deny credit. Lenders use machine learning algorithm to determine how much risk a particular borrower places on them if they decide to lend to that person. Therefore, the study on assessing the ability to repay bank debt is necessary.

Example 1 Simulated data
We test the algorithm on a sample of simulated data in order to obtain a training model. We generate 100 random samples according to the following formula Eqs.(8)- (9).
We then try to create a model using the Boosted Gaussian Naïve Bayes algorithm. Red points represent instances which belong to w 1 class, green points to w 2 class and blue points are the misclassied one, Fig. 3. After the rst classi- cation we can see that 5 samples were misclassied, Fig. 3. We calculate the error, update the weight and make another classication. The algorithm will now focus on the weak sample, Fig. 4. After 10 iterations, we combine the classier, Fig. 5. We obtain the nal model:   Table 1).
The Boosted algorithm provides better results: it is two times more precise than the Gaussian Naïve Bayes classier. Though, the computational time to create the model is 3.5 times more important.
Thanks to this example, we can identify more We run the training process 10 times, the error is calculated by averaging error of 10 times and we choose to randomly divide 10 times our dataset into 70% training and 30% test sets in order to obtain reliable results (see Table 2).
The in Vinh Long province. We have three independent variables in our sample (see Table 3). We  Table 4).