Finite mixtures of normal distributions in the study of the error in altimetry

The modelling of the altimetric error is proposed by means of the mixture of normal distributions. This alternative allows to avoid the problems of lack of normality of the altimetric error and that have been indicated numerous times. The conceptual bases of the mixture of distributions are presented and its application is demonstrated with an applied example. In the example, the altimetric errors existing between a DEM with 5x5 m resolution and another DEM with 2x2 m resolution are modelled, which is considered as a reference. The application demonstrates the feasibility and power of analysis of the proposal made.


Introduction
Digital Elevation Models (DEM) are topographic data that following a model (e.g., contour lines, point clouds, meshes, triangle networks, etc.) digitally represent the elevations (elevations or altimetry) of the bare terrain. DEMs have application in numerous branches of science and engineering and are used mainly for the calculation of height, slope, orientation and delimitation of basins (Ariza-López et al., 2018b). So that, DEM are considered as base data, and in this way the United Nations (UN-GGIM, 2019) and the European Union (https://inspire.ec.europa.eu/Themes/Data-Specifications/2892) consider them as a theme in their lists of relevant geospatial data layers. In the geomatic field, the quality of DEMs is usually understood as the altimetric positional accuracy of point data. Given a data product (PRO) and a reference of higher accuracy (REF), the altimetric discrepancies, or errors, (d = ZPRO-ZREF) are analyzed in a given set of points. There are very numerous approaches and methods developed to evaluate this positional accuracy (Mesa-Mingorance and Ariza-López, 2020). The best way to evaluate or control positional accuracy is by applying standardized methods, for example the new ASPRS standard (ASPRS, 2015), but there are many others (see a current guide of the most outstanding in Ariza-López et al., 2019). Until now, these standards are based on the assumption of the normality of errors (e.g., ASCE 1983, FGDC 1998, ASPRS 2015 which allows to the application of a parametric model: the normal distribution where the mean (μ) and the standard distribution (σ) are the position and scale parameters of the distribution. However, many studies (Zandbergen 2008(Zandbergen , 2011Maune, 2007) indicate that positional errors are not normally distributed. Regarding the work with methods based on the assumption of normality of the data, the non-normality of these can have several consequences depending on the degree of non-normality and the robustness of the applied method. In this case, non-normality violates a basic assumption of the method, and this violation is important from a strict perspective. The normal distribution is a suitable distribution to represent real-valued random variables. Therefore, fully adequate to describe the altimetric discrepancies d. However, the abundance of references indicating the nonnormality of altimetric-discrepancy data leads us to look for alternative statistical approaches, for example approaches are the use of robust statistics (Höhle and Höhle, 2009), the use of tolerances based on observed distributions (Ariza-López and Rodríguez-Avi, 2018a). In this work, a new way is explored, which consists in assuming that the altimetric discrepancies do not really come from a single normal distribution and that, on the contrary, they are the result of the mixture of several normal distributions. This way is very powerful, and interesting, since it consists of decomposing the observederror density function into a composition of a certain number of normal functions such that they adequately approximate it, that is, we work with a tool equivalent to what in analysis of signals consists of decomposing a signal by means of series of sine/cosine functions (Fourier transform). The underlying idea is that the observed variable really comes from a mixture of data from distributions that follow the same model (the normal), but with different parameters (means and standard deviations). In this way, the probability of an observed value comes from the mixture of the probabilities that it comes from each of the distributions that make up the mixture. The first works date back to 1894 when Pearson worked with the mixture of two normal distributions with the same variance and has been developed by multiple researchers (a detailed review can be seen in McLachlan-Peel, 2000;McLachlan et al., 2019, or Huang et al., 2017 and some examples of recent applications of mixtures in different fields can be seen, for instance, in Zhao et al., 2021or Li et al., 2021. To the best of our knowledge, this approach has never been previously applied to the case of errors in DEM. Thus, the objective of this paper is to propose the use of this wellknown general-statistical approach for analyzing and describing the altimetric positional accuracy, and that can be applied to any type of error data from DEM. This document is organized in the following way; section 2 presents a basic conceptual approach to the mixture of normal distributions. In section 3 an application method is proposed and in section 4 the proposed method is applied step by step to the case of data from two DEM with different spatial resolutions (2x2 m and 5x5 m). Section 5 presents the discussion and finally, some general conclusions are included. (ICA):

Mixture of Normal distributions
The assumption of error's normality in measures appears from the same origin of the normal distribution. Indeed, Gauss obtained it by studying astronomical errors. In fact, error's normality implies that there are not any external cause of errors, but pure chance. Nevertheless, in many cases, a distribution of measure errors, even when the underlying normality is adequate, can be overall non-normally distributed. This occurs when errors come from normal distribution but with different parameters. In this case, the mixture of data originating from different normal distributions cannot be adequately modelled by a single normal distribution. In this paper we propose an approach of this problem about the distribution of errors based on the use of Gaussian finite mixture models. In this context, we try to determine, through their parameters, which are the normal distributions that are mixed in the observed data set. In a theoretical point of view, we assume that the vector of observed errors = ( 1 , … , ) is a random sample that come from a mixture of > 1 arbitrary distributions of probability. Then, the density function of each is given by Where = ( , ) = � 1 , … , , 1 , … , � is the vector of parameters in such a way that 1 + ⋯ + = 1, with > 0 ∀ and ( 1 , … , ) is the vector of parameters of each mixing distribution that comes from any absolutely continuous probability distribution family, ℱ. In our case we consider that ℱ = { (•| , )} is the family of density functions ( , ), ( , ) ∈ ℝ × ℝ + . In consequence, we need to estimate the vector of dimension 3 : = � 1 , … , , ( 1 , 1 �, … , ( , )) In order to estimate (2) we utilize the EM algorithm (Dempster et al, 1977), that provide an iterative solution of the calculus of Maximum Likelihood estimators (MLE) in problems with missing values. The use of the EM algorithm is suggested not only for evidently incomplete data (missing values, truncated distributions, censored or grouped distributions), but also for statistical models where the absence of data is not so evident (McLachlan -Krishnan, 2008, McLachlan et al, 2019 as occurs with distributions obtained as mixtures. This algorithm uses, in an iterative way, the operator: where ∈ Θ, ( ) is the value obtained at iteration and the expectation refers to the distribution of ( | ) of given for the value ( ) of the parameter. Each iteration has two steps: (i) E-step where Q�θ�θ (t) � is computed and (ii) M-step where these values are used to maximize the likelihood of the mixing distribution and obtain the updated estimates ( +1) . Once parameters have been estimated, and by the Bayes theorem, we proceed to make a probabilistic grouping to assign each value of the original set (or of the whole population), to the corresponding normal distribution to which has more pertaining probability, according to the posterior probabilities: where ∈ ℝ, is the number of mixing distributions and � is the posterior probability that belongs to the group with density function . In this way, given an observed value, it is assigned to the corresponding normal distribution where this probability is maximum. In addition, we can calculate probabilities in the final mixed models, adding all probabilities for each point in the g obtained models: where � are obtained in (4).

Description of data.
To show an example of an application on real data, a study area around Allo (Navarra, Spain) is used by means of the following digital elevation models: • DEM02. It is a gridded DEM (2x2 meter resolution), it was generated in 2017 and its primary data source is a Lidar survey (second coverage of the PNOA-LiDAR project https://pnoa.ign.es/estado-del-proyectolidar/segunda-cobertura). It is considered as the reference in this example. • DEM05. It is a gridded DEM (5x5 meter resolution), it was generated in 2012 and its primary data source is a Lidar survey (first coverage of the PNOA-LiDAR project https://pnoa.ign.es/estado-del-proyectolidar/primera-cobertura). Both DEM data sets come from the Instituto Geográfico Nacional (IGN, Spain, www.igne.es) and are freely available. In relation to the study area, Figure 1 shows a Advances in Cartography and GIScience of the International Cartographic Association, 3, 2021. 30th International Cartographic Conference (ICC 2021), 14-18 December 2021, Florence, Italy. This contribution underwent double-blind peer review based on the full paper. https://doi.org/10.5194/ica-adv-3-13-2021 | © Author(s) 2021. CC BY 4.0 License general vision. The area is 504 km2, and it has a varied relief, but not abrupt, with valleys of different widths, and areas with different degrees of undulation. The elevation is in the interval 316-1046 m, mean value of elevation is 468 m and the standard deviation of the elevation is 92.8 m. The subtraction of both models (DEM05 -DEM02) allows to obtain the altimetric discrepancy model. Assuming a global normal distribution for these discrepancies the relevant parameters values are: μ = -0.017 m, and σ = 0.280 m. Several experts took samples manually in the altimetric discrepancies model. These zones represent different altimetry discrepancy environments between DEM05 and DEM02.   Figure 2 shows the quantile-quantile plot and the expected normal distribution (blue line). In the three figures we observe the presence of several local modes, and a significate deviation from normality. That suggests the possibility of a mixture of normal distribution as an underlying data model.

Method: the mixture model selection
The first step is to determine , the number of mixing normal distributions that compound the global mixture, to reproduce the observed form of errors. To deal with this problem we apply some information criteria to decide the best model (see for instance Cameron-Trivedi, 2013;Burnham-Anderson, 2003). Concretely we calculate the Akaike Information Criteria ( ), defined as: where ℒ is the value of the log-likelihood obtained through the estimation procedure, is the sample size and in the number of parameters, in this case, 3 . We have to take into account that this measure, related to the Kulblak Leibler distance is not a contrast about the model goodness of fit, but only make a comparison between models, in the way that the best model is which the obtained value is minimum. Table 3 shows AIC criteria when comes from 2 to 9, and we observe that the minimum value is accomplish for = 8. All calculations have been carried out using the package mixtools of R (R, 2021, Benaglia et al., 2008), that provide an estimation of the parameter vector given in (2). According to Table 3     We can see that there is a small set of data that are extremely dispersed, and only 27% of the data correspond to errors with a mean of practically 0. It is also worth noting high values of the standard deviations and that the probabilities of belonging are very distributed among the groups (an exception to the last, which are the most positively biased). Figure 5 graphically represents the fit of the 8 normal distributions (colored lines) and the fit to the true observed density, which is the dotted line. Once the number of mixing distribution is determined and parameters are estimated, the next step is about the behaviour of the resulting mixed distribution. For this, and using (4) we analyse all sampling points to the distribution to which it is most likely to belong. In this case, Table 5 shows the number of sampling points that can be assigned to each group, where the groups are those that appear in Table 4.  In this way, and using (5), we can obtain the mixed theoretical distribution, and we compare it with the observed empirical distribution. Figures 6 and 7 show a superposition of the histogram of the empirical distribution and the theoretical density curve (in orange) obtained from the estimated normal distributions. In Figure 7 case, only values from -1 to 1 are drawn because the high range of errors. Additionally, in this Figure   Figure 7: Observed Histogram and expected density of altimetric errors (from -1 to 1). In green, normal density considering that all data come to a single normal.

Results and analysis
Starting from the specified model, we can obtain probabilities using it and compare it with the empirical distribution and the one obtained assuming a normal (−0.017, 0.280). Table 6 shows several probabilities obtained in these three ways: • In the column labeled "Empirical model", values are the relative frequency of points that verify the row condition, • In the column labeled "Mixture model", probabilities are obtained using (4). That is to say, is the weighted (by the value of � ) sum of the eight normal distribution functions • In the column labeled "One Normal", probabilities are directly obtained from a Normal  41   2132  7557  1732  0  50  995  2391  2975  0  60  189  25  20 1234  61  86  575  302  0  62  6  4420  2865  0  63  13  2607  393  0  64  1146  1043  13  0  65  690 20445  5142  0  66  326  484  12  0  70  287  1053  1631  0  Table 8: Number of sampling points by type that belongs to groups 1-8 (continuation).
Starting from this contingency table we are interested in studying if both variables are related. In this case we can apply the Pearson's 2 test for independence in contingency tables, where the null hypothesis is that the variables have no relationship between them. In this case, the value 2 = 412294.4 that under the null hypothesis follows a 2 distribution with 98 degrees of freedom (d.o.f) so the p-value is 0. Similarly, the likelihood ratio test produces a 247898.4 statistic, which follows the same 2 distribution with 98 d.o.f and its p-value is also 0. In fact, if we calculate the Pearson contingency coefficient is: The proportion of points of each type of terrain that belong to each group are given in Tables 9 and 10. The most frequent model for each terrain type are highlighted in bold. It is noteworthy that 40% of the elements of land type 60 belong to distribution 8, which is the one with the highest average, which is called "Water". Similarly, in distributions 3, 5 and 7 there are no groups for which they are dominant. On the other hand, having the mixture model allows it to be applied to all the positions of the DEM and to obtain a spatial representation of the probability of belonging to each altimetric discrepancy value to each of the normal distributions that make up the mixture. This is what is presented in Figures 8 and 9 for groups N1 and N8, respectively. As can be seen, Figure 8 presents greater probabilities in the northern half, which is consistent with the terrain types where N1 participates. The distribution shown in Figure 9 is consistent with those areas, with a reduced dimension, where the highest values of altimetric discrepancy occur.

Discussion
The hypothesis of normality underlying measurement errors presupposes the fact that these errors are random, independent and have no other external cause that influences them (pure chance). Nevertheless, in practice, in many cases the normality hypothesis is not fulfilled, as Advances in Cartography and GIScience of the International Cartographic Association, 3, 2021. 30th International Cartographic Conference (ICC 2021), 14-18 December 2021, Florence, Italy. This contribution underwent double-blind peer review based on the full paper. https://doi.org/10.5194/ica-adv-3-13-2021 | © Author(s) 2021. CC BY 4.0 License is the example here shown (see Fig.4). This non-normality can be attributed to multiple causes. One of these causes, that is very common, is the fact that even though measurement errors are normally distributed, they come from a mixture of several normal distributions with different parameters. In this case the overall population, obtained as an addition of data from these different normal distributions, is not, itself, a normal distribution. In the case that has been presented, it has been considered that the mixture of normal distributions come from different topographic and land cover situations, but it is also true that the parameters of the data collection by the LiDAR sensor (e.g., height, angle of incidence, humidity of the ground, etc.), influence altimetric errors. The great discussion that can be raised about the previous approach is whether to work with or without prior information. If additional information is available on the points, the groups obtained by assigning each point to the mixing normal distribution to which it is most likely to belong can be related with that additional information. In our case, this information is qualitative -the type of terrain-, but it can be of any other type. This helps to better understand the cause of the error and identify, for example, those areas most prone to extreme errors. All of this information can be useful when designing more precise quality control procedures. If this information is not available, the method is also capable of determining the number of mixing normal distributions and their parameters; however, in this case it may be more complex to find an obvious physical meaning for the categories that are made up. The examples of probabilities presented in Table 6 clearly indicate the goodness of fit to the observed data. In addition, it is also evident that the option of adjusting a single global normal is a bad approximation for this case. Tables 8 and 9 present results that show that there are types of terrain that are better defined than others, understanding as better defined that they present an explanation based on few groups, that is, a high proportion of cases in those few groups (e.g., terrain types 0, 61, 63). There are also other cases where this is not the case (e.g., type 30). It is interesting to note that group 8, which is responsible for the largest outliers in the general normal model, is concentrated in a single type of terrain (type 60). These results show us that, perhaps, the terrain types classification being considered, or the samples taken from them, are not the most appropriate, but it is not a problem to understand that the method of mixtures works properly. The parameter estimation method also requires a comment. We have used the EM method to estimate the parameters that correspond to each of the distributions that make up the mixture. The EM is a well-known algorithm that is available in many computerized calculation tools and, as shown, its use is relatively straightforward. This algorithm is for general use in optimization and is included in packages suitable for determining the parameters of normal mixtures, as is the case of the mixtools package of R (R, 2021, Benaglia et al., 2008), that has been used in this work. If this is possible, as in the example presented above, we can reproduce the empirical data distribution function through a theoretical model, which, once obtained, can be extended to the entire population. In this way, we go from having a model that cannot be explained by a single normal, and that can be considered as nonparametric, to a parametric model with multiple parameters. This provides a new way of analysis since the methods based on a single normal can be applied to models based on mixtures of normal distributions. Of course, the last will be more complex, but not intractable.

Conclusions
An application of the finite mixture analysis techniques of distributions to the case of altimetric discrepancies in a DEM is presented. The theoretical basis and the tools for its application already exist, so it is not risky or expensive to apply them to this field. A method for this type of analysis was developed, and a practical example has been carried out based on real data, that shows how to carry out this application, as well as the results obtained. We consider that this technique for the analysis of discrepancies in altimetry allows us to apply the conventional positional quality assessment methods to this mixture model, which opens a line that extends its possibilities and avoids the limitations of non-normality that arise in many studies of error in altimetry. So that, looking ahead, an interesting research line is to analyze how the positional accuracy assessment standards can be applied when a characterization of the errors through mixtures is available. This procedure may also be applied in many other cases involving continuous data where normality can be assumed in advance.