An R Code for Implementing Non-hierarchical Algorithm for Clustering of Probability Density Functions

Abstract. This paper aims to present a code for implementation of non-hierarchical algorithm to cluster probability density functions in one dimension for the rst time in R environment. The structure of code consists of 2 primary steps: executing the main clustering algorithm and evaluating the clustering quality. The code is validated on one simulated data set and two applications. The numerical results obtained are highly compatible with that on MATLAB software regarding computational time. Notably, the code mainly serves for educational purpose and desires to extend the availability of algorithm in several environments so as having multiple choices for whom interested in clustering.


INTRODUCTION
R is a free, open-source implementation of the S programming language and computing environment for users.
It is rstly developed in 1996 by professors Ross Ihaka and Robert Gentleman of the University of Auckland in New Zealand [1].From time to time, a lot of upgraded versions are made in order to enhance its user library, more detail can be found at https://cran.r-project.org/.Due to aforementioned features, R does not require any purchased fee unlike other commercial software such as MATLAB or STATA.Moreover, it is also supported graphical presentation of data sets, built-in mechanisms for organizing data together with numerous available packages for users [2].
Therefore, it is extremely easy to use without good programming skills and it has been attracted a lot of attentions of experts in dierent elds, especially in statistics.To the best of our knowledge, R is currently one of the most wellknown statistical software around the world as well as more and more packages are built to increase the diversity of R library [3].Needless to say, R is an ideal environment for who want to contribute their humble work to other users to serve various purposes, educating and research-ing in particular.
Clustering is usually considered as grouping objects such that objects in the same cluster are similar as much as possible and diers from objects in other clusters.These days, it has been popular as an unsupervised recognition algorithm to explore the internal structure of data [4].Therefore, more researchers are devoted to this eld as a result.Concerning R packages related to clustering, a lot of works can be considered as follows.For instance, in 2006, Chris Fraley and Adrian E. Raftery made an enhancement for "MCLUST" package to serve for normal mixture modeling and model-based clustering.Simultaneously, Ryota Suzuki and Hidetoshi Shimodaira also contributed an R package so-called "Pvclust" to measure the uncertainty in hierarchical clustering analysis using the bootstrap resampling methods [5].In 2007, Lokesh Kumar and Matthias Futschik oered a package called "Mfuzz" for fuzzy clustering of microarray data aiming at overcoming drawbacks of hard clustering in gene area [6].Then, in 2011, another clustering-related package called "clValid" was released by Guy Brock at el. in order to evaluate the clustering result, concerning internal, stability and biological measures, respectively [7].Besides, the work of Daniel Mullner in 2013 with a package so-called "fastcluster" based on C ++ language boost performance of current clustering algorithms in R and Python both [8].Further, in 2015, Tal Galili contributed a package to visualize, adjust and compare trees of hierarchical method [9].Therefore, it can be noticed that the clustering eld has gained a lot of attention from programmers year on year.The kernel formula is presented as follows: where parameter for the jth variable; K j (•) is a kernel function of the jth variable, which is usually Gaussian, Epanechnikov, Biweight, etc.
Bandwidth parameter and the type of kernel function have important roles in the estimation.There are many opinions to select a bandwidth parameter, but the optimal selection has not been found.In this paper, the bandwidth parameter is chosen based on the concept of Scott [11] and the kernel function is the Gaussian, where It is noticed that estimating PDFs in one dimension is considered in this paper.
The clustering problem of probability density functions is dened as follows: Denition 1.Let F be a set of PDFs, F = {f 1 (x) , f 2 (x) , ..., f n (x)} , n > 2 and k be the number of clusters given in advance.The requirement of the crisp clustering problem is to separate these PDFs according to value of

Evaluating similarity between PDFs
In clustering problem, how to determine the similarity between PDFs is extremely crucial since it partly decides the clustering result [12].Although numerous similarity measures are proposed for discrete elements, there is a restricted number for PDFs.To the best of our knowledge, L 1 distance and cluster width are one of the most popular.Thus, in this paper, we select the cluster width to measure the similarity between PDFs.Its denition is briey expressed as follows [13].
Let g, (g 1 , g 2 , ..., g n ), (f 1 , f 2 , ..., f m ) are probability of density functions, we dene the cluster width between a PDF and a set of PDFs .., f m )}] and the cluster width between two sets of PDFs (two

SF index-an internal validity measure index
After a partition is deduced from a clustering algorithm, one needs to evaluate the goodness of partition.Normally, the internal validity measure indexes intend to perform that purpose.For internal measure index, the SF stated in [14] is employed to assess the internal structure of cluster.In detail, this measure is dened below.
where f v i − f v j is the L 1 distance between representing PDFs of the clusters C i and C j ; is the representing probability density function for each cluster with n j being the number of PDFs in the cluster C j .
From Eq. ( 4), we see that SF considers not only the compactness among PDFs in one cluster but also concerns the separation between clusters.Therefore, it is reasonable to employ SF in this paper as an internal measure index.The smallest value of SF, the most valid optimal partition indicates [14].

Extracting image features
In this section, we will present shortly how to cluster for image objects based on their features.
Initially, we will read color feature from image pixels into R software.From the pixel distribution of Grayscale or RGB scale, we can construct one-dimensional or multi-dimensional PDF representing for each image.These PDFs will be the input for the employed algorithm to tackle subsequent works.Besides, all processing steps like reading image and presenting an image in one-dimensional or multi-dimensional spaces are performed in R software by some available packages.

HIERARCHICAL ALGORITHM
The non-hierarchical algorithm rstly proposed by T.V.Van and Pham-Gia [13] presents a new approach for clustering of PDFs with prior number of clusters.The original idea is inspired from the well-known k-means algorithm in clustering of discrete elements.That is, from a known number of clusters, an arbitrarily initial partition is created to assign each probability density function (PDF) into each cluster.These PDFs will then reallocate to cluster such that cluster width is the minimum. Given is the cluster C i , i = 1, k at iteration tth, the nonhierarchical algorithm is shown as follows.
Step 1. Partitioning n PDFs into k random clusters, we have initial clusters at the 1st iteration C (1) k .
For a specic is the cluster width of set obtained by assigning f j to clusters C (1) 1 , computed according to Eq. (2).
Consider just f j ∈ C (1) h , there are two possible cases as follows: , the f j will move to the cluster C s .Simulta- neously, the old cluster C h will detached f j to form the new one C (2) h in next step.Now, the new partition is formed: 2 , . . ., C  k is the number of clusters known in advance.Usually, the value of k ranges from 2 to square root of number of PDFs; x is value used for evaluating f on a dene domain.
The non-hierarchical algorithm starts by creating and printing the initial partition matrix (lines 7-9).Herein, we build a subroutine called "createU" to perform this task.After the partition is established, the next step assigns the input PDFs into clusters in which PDFs belong to (lines 12-17).Subsequently, computing cluster width between each PDF with all clusters is executed (lines 20-28).This step intends to calculate the similarity level of PDF in each cluster to be reference for later changes.Before going to the while loop for updating the clusters, some variables are created rstly.For example, m and d are respectively used to count the number of elements and record the minimum cluster width when observed PDF does not satisfy minimum cluster width between it and cluster it belonging to.Unew is employed to update the initial partition matrix.Then, checking initial partition matrix U, if U satises the condition, the algorithm will stop and output the U as the nal partition (lines 49-51).Otherwise, updating matrix U is performed starting from line 52.The while loop for updating partition matrix will suspend if m is equal to 0 meaning all PDFs satisfy condition that the cluster width of a PDF to cluster, which it is belonging to, is minimum.

Creating initial partition matrix
The initial partition matrix is created by subroutine so-called "createU" (lines 151-173) to randomly make a partition before performing the non-hierarchical algorithm.This function can be used by following command where k is the number of clusters and n pdf is the total number of PDFs initially.For real data, the nominal partition is not determined yet so that we just assess in terms of computational time, SF and not comparing with other software.Moreover, for each data set, the performance is repeated 30 independent times to assure the stability and accuracy.The nal numerical results are computed by average of these 30 times.The detail of each numerical result will be presented later.

Simulated data
For this case, we consider a Benchmark dataset in order to validate the accuracy of the code.
That is seven univariate normal probability density functions (PDFs) rstly proposed in [13].
The structure of data is well-separated and less overlapping leading to reducing the complexity for clustering mission.Therefore, it is used to test performance of a novel algorithm or algorithm recoded in other programming environment like R. The detail of estimated parameters can be referred to [13].From these parameters, seven PDFs are estimated and demonstrated in The numerical result is shown in Table 1 and      All numerical results corresponding to suggested number of clusters are listed in Table 2.  2, it can be said that the case k = 2 has the smallest SF Index, so that we will divide sixteen faculties into two clusters: f 8 , f 9 , f 10 , f 12 , f 13 , f 14 , f 15 .
Based on the clustering result and comments from the students on the lecturer during the survey, some main discussions are listed as follows: - From the above comments, it points out that each department needs to take measures to improve the bad points, promote the strengths of lecturers to be better teaching so that students is getting more and more satisfaction.

Application to clustering trac images
In this example, we would like to demonstrate an example for image object in implementation section.Specically, there are 121 real images size 1920 × 1080 pixels extracted from a short  The numerical result is shown in

CONCLUSION
This paper provides an R code of nonhierarchical algorithm using for clustering problem of probability density functions.This aims to extend the range of this algorithm as well as oering more options for whom interested in clustering for PDFs.By advantages of R The result shows that the code in R is superior than that in MATLAB regarding the computational time.Furthermore, the source code is completely shown in the Appendix section so that one can refer for more detail.
Nevertheless, the provided code in this paper is just presented in one dimension.Therefore, two or more dimensions should be added in order to expand the code as well as develop the applications of the non-hierarchical method.
This would be a promising direction in our future work.
the 2nd iteration: C

k where f j has been moved to the right cluster with the minimum cluster width. Step 3 .=.Figure 1
Figure 1 is the diagram demonstrating for the mentioned process.
section, we employ the code written in R to perform one simulated data set and two applications.The rst one is seven PDFs having normal distribution separated into 3 clusters.The numerical result of this data set will be compared with that in MATLAB to have some evaluations, especially the computational time.Then, two applications are executed, one is data about satisfaction level of student at Ton Duc thang University, another is trac image.

Fig. 2 .
Fig. 2. It is clear that both R and MATLAB produce precise results compared with the nominal partition with lowest SF 0.050.Nevertheless, the code written on R is more rapid than in MATLAB since it just takes 0.016 seconds in average in R instead of 0.038 in MATLAB.Therefore, writing code on R not only provides a new environment for users to experience but also boosts the performance of non-hierarchical algorithm in this case.

Fig. 3 :
Fig. 3: The clustering result of seven PDFs in example 1 by non-hierarchical algorithm in R.

5. 2 . 6 .Fig. 4 .
Fig. 4. One can see that the PDFs are pretty overlapping so that hardly can algorithm cluster correctly all PDFs.Furthermore, since the number of clusters is undetermined, hence some numbers are suggested to survey, including 2, 3 and 4 clusters.The most appropriate result is measured in terms of SF index.If the SF value is smaller, it means that the partition is better, and the number of clusters will be taken based on that.
The Cluster 1 (C 1 ) includes faculties: Electrical Electronics Engineering, Labor Relations and Trade Unions, Pharmacy have average satisfaction scores from 3 to below 5 (it means from slightly dissatisfaction level to quite satisfaction one) more than the rest of the faculties.That is because students have a lot of dissatised opinions with faculty members from teaching methods, teaching materials, student interaction, enthusiasm, activity more than the rest of the faculties.-The Cluster 2 (C 2 ) consisting of remaining faculties has an average satisfaction score from 5 to 6 (it means from satisfaction level to very satisfaction one) higher than Cluster 1.Although faculty members belonging to C 2 received some unexpected comments, almost comments are positive such as good and speed teaching method, enthusiastic lectures.As a result, students give a pretty high level of satisfaction for this group (C 2 ).

Fig. 5 :
Fig. 5: The clustering result of sixteen PDFs in example 2 by non-hierarchical algorithm in R.

Fig. 6 : 7 .
Fig. 6: Some trac images extracted from a short video on Nguyen Huu Tho Street at night.

Fig. 7 :
Fig. 7: 121 PDFs estimated by 121 trac images on Nguyen Huu Tho Street at night in example 3 before clustering.

Fig. 8 :
Fig. 8: The clustering result of 121 PDFs in example 3 by non-hierarchical algorithm in R.

Table 1 .
Comparison of performance of non-hierarchical algorithm for seven normal PDFs in R and MATLAB software.
c 2017 Journal of Advanced Engineering and Computation (JAEC)

Table 2 .
The result for three cases of k uses

Table 3
. As can be seen from the table that the case k = 2 c 2017 Journal of Advanced Engineering and Computation (JAEC)