Dynamic Location Referencing: Probability-Based Decision System

Location Referencing is a well-known methodology to transfer geoobjects from one digital map to another and typically used to share traffic information. Here, especially the dynamic methods play a major role, as they are developed to transfer Location References between different maps in such cases where no common databases and/or common structures are available. The key issue in dynamic Location Referencing is to find the correct geoobject in the target map which corresponds to the geoobject in the source map. So far, in nearly all methods a deterministic algorithm is implemented to perform this. Given the fact that geodata as well as the matching procedure for geodata has some uncertainty, it is obvious to research uncertainty-based algorithms. This paper presents a probability-based decision system by formulating the decision algorithm/functions and evaluating them. The evaluation is done with real traffic information and benchmarked against a deterministic decision system.


Introduction
For location-based services (LBS), digital maps are of crucial importance and are often provided in different versions. In the context of the transfer of geoobjects (locations) between two digital maps, the term Location Reference (LR) was introduced. Following various definitions for a Location Reference (Hackeloeer, 2016;Lv et al., 2008;Schützle, 2016;Wevers, 2000), a Location Reference can be understood as a map-independent description or identification of a digital map based geoobject. The description itself is carried out by means of concrete attributes, which are derived from the well-known attribute categories for general geodata (de Lange, 2013;Devogele et al., 1996;Hagedorn, 2007;Lutz et al., 2009;Ma et al., 2010;Streit, 1996) and can be summarized and grouped as follows (Hindenberger and Schwieger, 2018): • Geometrical attributes like coordinates/ bearing; • Topological attributes like node valence/degree; • Syntactical (or toponymical) attributes like the name of an object; • Semantical (or thematical) attributes like physical appearance of an object or hierarchy in road network.
As geoobjects in general (de Lange, 2013), a LR can consist in planar case of points, lines and areas (Schützle, 2016) and cover typically use cases such as traffic or road information (Hackeloeer, 2016;Wevers and Hendriks, 2006). The umbrella term Location Referencing refers to a summary of all methods and techniques for transferring a LR Schützle 2016) and represents a concrete variant of general georeferencing (Hackeloeer, 2016). The specific method is typically called Location Referencing Method (LRM) and used in a Location Referencing System (LRS) (Kenley and Harfield, 2018). The process (Location Referencing Process) within the Location Referencing System is one-dimensional without any iteration and typically the sender (source) and receiver (target) systems are distributed and separated. Svensk and Wikström (2013) and Schützle (2016) split the process in three steps: firstly with encoding the LR for a given geoobject on the side of the sender (i.e. the LR is generated in a defined interface format), secondly the transfer itself between sender and receiver system, and thirdly the decoding on the receiver side which means primarily performing the algorithm to find and detect the correct or the most likely geoobject for the transferred LR in the receiver system. In the past, nearly all methods used deterministic algorithms to achieve this object. Generally, geodata have some uncertainty (Goodchild, 1992;Glemser, 2000). Additionally, some uncertainty is given by linking/integrating geodata from different sources (Gahegan and Ehlers, 2000;Gösseln and Sester, 2004; Advances in Cartography and GIScience of the International Cartographic Association, 2, 2019. 15th International Conference on Location Based Services, 11-13 November 2019, Vienna, Austria. This contribution underwent double-blind peer review based on the full paper | https://doi.org/10.5194/ica-adv-2-6-2019 | © Authors 2019. CC BY 4.0 License Samal et al., 2004;Stigmar, 2005). Hence, the use of a method to handle such uncertainty by concepts like probability theory is appropriate. However, such a method has just been used in one known LRM (TPEG2-ULR with Markov Chain published in Schramm et al. (2012)), so this makes it interesting to do some additional research and investigations. Based on previous works published in Hindenberger and Schwieger (2018), this paper presents a probability-based decision system for LRS and is structured as follows: In Section 2 the mathematical principles in the context of the approach are given. Sections 3 and 4 describe the specific decision functions and underlying probability distributions. This is followed by the experimental evaluations in Section 5 and the paper closes with a conclusion given in Section 6.

Mathematical principles
In this approach, the values are seen as random variables. So, in this section the relevant mathematical principles and methods are introduced in the following and used later in Sections 3 and 4.

Probability theory
Probability is a measure of the occurrence of a random variable (Eckey et al., 2005). Here, the random variables are differentiated by discrete or continuous (Bosch, 1986) as shown in Table 1. In this paper, the exponential and binomial distributions are relevant in particular (compare Section 4.2). In specific, the exponential distribution is defined as (Bronstein et al., 1993): Furthermore, the binomial distribution is given by (Bronstein et al., 1993): where p = probability of single event, n = number of trials, k = number of successes in n trials.
In case of unknown distribution type, the distribution and the underlying parameters need to be estimated (Table 2) by a random sample and finally checked by a "goodness of fit" test (e.g. Kolmogoroff-Smirnoff-Test) to identify the probability distribution.
By a Kolmogoroff-Smirnoff-Test, the test criterion is verified against a critical value 1− which needs to be reduced in case of parameter estimation. This critical value 1− is given for an exponential probability distribution with estimated parameter on significance level α=0,05 and random sample size n>30 as (Hartung et al., 2009):

Deterministic and probability-based approaches
Mathematical decision concepts can be distinguished into deterministic or uncertainty-based approaches. In the deterministic case, the values are uniquely determined, whereas in the case of uncertainty, the values base on random variables (stochastic, probability-based) or are fuzzy (Rommelfanger and Eickemeier, 2002).
Typically, deterministic values are described by analytical functions which are combined to a final power (or cost) function. This function will be maximized (or minimized) to get the final decision value (Dinkelbach, 1969). Stochastic values themselves are handled by the methods of probability theory. In case of multiple individual probabilities these are also combined. Here, it has be differentiated between stochastically independent and stochastically dependent probabilities. For independent probabilities, single probabilities 1 , … , are multiplied (Equation 6) whereas for dependent probabilities, the "multiplication rule for conditional probabilities" needs to be used (Hartung et al., 2009): Similar to the deterministic approach, the probabilitybased approach is an optimization task which means the maximum (or minimum) value will be used to make the final decision (Dinkelbach, 1969;Kromphardt et al., 1962).

Definition of criteria
Similarity measurements are well-known and widely used measurements to integrate and match geoobjects, i.e. identify the same geoobject represented in different  (Bruns and Egenhofer, 1996;Gösseln and Sester, 2004;Li and Goodchild, 2012;Rodriguez and Egenhofer, 2004;Volz and Walter, 2004) and play a major role in many applications such as spatial decision systems etc. (Schwering, 2008). Following Hindenberger and Schwieger (2018), Table 3 shows the selected criteria (similarity measurements) with the focus on linear geoobjects (e.g. roads) described by the topological elements of nodes and edges.  (2018), the criteria listed result from the outcome of former works in this context, in which the criteria have been used and/or named as relevant (e.g. Schützle, 2016;.

Decision functions
In this paper, two groups of decision functions have been specified and formulated: • Probability-based main function with full set of criteria as primary methodological outcome; • Probability-based and deterministic functions for comparative calculation.
The latter is given by the fact, that absolute performance strongly depends on the aggregation of the specific test data and the definition of the performance indicator. So, it is necessary to perform such a comparative calculation by using a well-known deterministic approach. For this, the OpenLR-decision function (OpenLR demo implementation, described in Schützle (2016)) has been selected and adapted accordingly. In this, a reduced set of criteria by node distances, bearing differences and the comparison of FRC and FOW is used.

Probability-based functions
Using the full set of criteria, the probability-based main function is defined by Equation 7 following Equation 6 due to the stochastically independence of the criteria: where Table 4. Variable names for probability-based decision function = node distance, = bearing difference, = node valences, = street name, = functional road class, = form of way, = speed category, : one-way-attribute.
For the comparative calculation and the reduced set of criteria, the probability-based decision function is adapted accordingly by Equation 8: The individual probability values are given by the probability distributions in Section 4.2.

Deterministic function
The OpenLR deterministic approach is specified as a deterministic power function given by Equation 9: where = Rating factor for node distance, = Rating factor for bearing differences, = Rating factor for FRC, = Rating factor for FOW, , = weighting factors (default = 3).
In detail, is calculated by Equation 10 for node distance (Euclidian distance, detailed in Hindenberger and Schwieger (2018)): is specified by rating Table 5 using the bearing difference (minimum absolute difference between the azimuths of the two paired edges, Hindenberger and Schwieger (2018)).  random samples in an appropriate way as well as the characteristics of the probability distributions. The first classification reflects the fact whether the criteria are considered to be map supplier independent or not, resulting in a mixed or dependent (specific) random sample. Secondly by the characteristics of the data (random variables, compare Section 2.1), i.e. if they are continuous or discrete, which defines the type of probability distribution. Both additional classifications are shown in Table 6.

Geometrical evaluations
Here, a mixed random sample (100 corresponding edges) consisting of several supplier combinations with the same size for every subset has been aggregated (Table 7). This gives a sample size of = 200 for node distances and = 100 for bearing differences. As detailed by the authors, node distances have been calculated by Euclidian distance and bearing differences by the minimum absolute difference between the azimuths of the two paired edges. As a result (Figure 2 and Figure 3), both random samples are exponentially distributed with given (estimated by Equation 4) to be tested by the Kolmogoroff-Smirnoff-Test (KST). Processing the KST using Equation 5 gives the results (11) for node distances and (12) for bearing differences. Both test criteria fall below the critical value 1− , i.e. the hypothesis that the random samples are exponentially distributed with given could not be declined: = 0,072 < 1−∝=0,95 = 1,08 = 0,102 < 1−∝=0,95 = 1,08

Topological evaluations
Again, a mixed random sample with the same size for every subset has been set up to analyse the distribution of the node valence within a given node pair (Table 8). Here, the mixed random sample needs to be splitted due to the available nodes with necessary node degrees. From practical point of view, for a given node valence in the sender system the corresponding node valence in a receiver system has been evaluated. This gives the results in Table 9.  Table 9 shows a nearly 1:1 matching for the common node valences 1,3 and 4. In contrast, the node valences 5 and 6 are more distributed. As explained in detail by Hindenberger and Schwieger (2018), the general multinomial distribution for all corresponding values and a specific given value can be simplified by a binomial distribution. So, the probability ̂, for a specific combination "given node valencecorresponding node valence" can be derived from the relative frequency by Equation 3.

Syntactical evaluations
As written in Section 3.1 this is realized as a comparison of street names which means two street names are compared to each other to find out how similar they are. To carry out such string comparation, various methods and concepts for similarity measurements can be applied. Good overviews are given in Naumann (2013) or Deza and Deza (2016). Based on this fact, a preliminary analysis has been performed to find out the best-fitting similarity measure. The analysis itself has been done on normalized strings, i.e. a given string will be transferred to a string with just upper cases, no street numbers, no space, no special (country specific) character, no abbreviations, no other symbols etc. This helps to minimize user-specific differences just by using different ways of naming a street or typical spelling/typing errors. As an outcome, the well-known and use case independent Levenshtein distance in the variant of Damerau-Levenshtein as a normalized similarity measurement score (Equation 13 following Serva and Petroni, 2008;Naumann, 2013) has been chosen as a compromise. Basically, Levenshtein distance calculates the minimal number of operations which are necessary to transform one string into another. Every operation (insert, delete, edit) is calculated with cost = 1 which results in a sum of costs (Levenshtein, 1966;Zhang, 2009). In the variant of Damerau-Levenshtein distance, transpositions are calculated with cost = 1 (Damerau, 1964): where = number of operations (edit distance), max( 1 , 2 ) = max. length of strings.
As a result, the similarity measurements are given by a rel. frequency of = 0,95 for = 100% and consequentially = 0,05 for < 100% (based on the same mixed random example aggregated for the geometrical evaluations). Inserted in Equation 3, these frequencies also represent the probabilities.

Semantical evaluations
In Section 4.1 the semantical criteria have been introduced as supplier-dependent which means generally that for every combination of map suppliers, the distributions have to be worked out separately. Here it must be noted that this mapping is unidirectional, since the in Section 3.1 specified underlying semantical attributes are usually supplier-specific due to availability, definition, scalability, actuality and interpretation. In this paper, it has been focused on the combination TomTom with HERE, for which the distributions have been analysed. Like in Section 4.2.3 for topological evaluations, the distributions have been analysed that way that specific criteria in one system have been given and the corresponding criteria in the other system have been taken. The sample size has been selected to n=30 since by experience even a smaller sample size shows sufficient results. As an example, the results are shown for Functional Road Class (FRC) in Table 10 and Table 11. This attribute is defined in TomTom directly as Functional Road Class (FRC) in the interval [0;8] with decreasing importance. In HERE there is a similar attribute named Functional Class (FC) and defined in the interval [1;5] with decreasing importance, too. As Table 10 and Table 11 show, the general mapping distribution is different: In some cases (e.g. TomTom to HERE for 1 to 2), there is nearly a 1:1 matching whereas in other cases (e.g. HERE to TomTom for 3 to 1..5), the results are splitted. The other named semantical criteria have been evaluated in an analogous way (Hindenberger, 2017).

Decision algorithm and processing
To evaluate the performance of the probability-based approach, a system has been set up that shows the absolute performance as well as the relative performance compared to a deterministic algorithm implementing the decision functions formulated in Section 3.2 using the following decision algorithm. Step 1 Creating a set of elements in the sender system: Edges with starting or endpoints representing Location Reference.
Step 2 For every element in : collecting an individual set of potential candidates in the receiver system.
Step 3 For every element of this individual set of potential candidates : Processing the specific decision function.
Step 4 Candidate with the maximum value in this individual set : Assumed to be the corresponding element in to the specific element in .
The basic test concept consists of selecting an edge in the sender system, specifying a reference edge (reflecting to the same object) in the target system and performing the decoding process in the receiver system. The list of edges to be transferred results from real live traffic situations in Stuttgart provided by GoogleMaps.
In practice, the starting or ending edge for the given traffic situation has been identified in the sender system and the corresponding edge (reference edge) in the receiver system has been matched manually. This has been done for the constellation of transferring TomTom to HERE as well as HERE to TomTom which results in two lists (test data) covering the same real live traffic situations. This makes it possible to compare the performance depending on the mapping direction. As a performance indicator, the hit rate ℎ -calculating the ratio of the number of correctly identified edges to all transferred edges -have been used (Demir, 2002): In this context, correctly identified means that the identified edge in the receiver system is the same as the specified reference edge. For processing the evaluations, a module in the QGIS (v 2.18.13) processing toolbox has been developed which automatically performs the decoding process.

Results
As mentioned before, two different criteria have been evaluated. First, absolute performance by probabilitybased main function using full set of criteria as defined in Equation 7. Processing this for a list of = 100 edges, the hit rates in Table 13 have been obtained. Secondly, the relative performance by the functions for comparative calculation (Equations 8 and 9) using the reduced set of criteria. With the same list of edges as above, the hit rates are as follows (Table 14 and Table  15).

Analyse and assessment of the results
Generally, the overall picture of the results shows an adequate performance of the probability-based approach and gives sufficient hit rates for practical use. For the probability-based main function, hit rates up to 90 % have been obtained with no significant difference in performance between the two mapping directions. In detail, the list of correctly identified edges slightly differs among both directions. Here, there is no specific criteria in case of no hits, but usually the combination of different criteria.
In case of the relative performance, the probability-based algorithm gives a significant higher performance, in average 12 percentage points, compared to the deterministic. Again, there is no specific criteria for the slight differences for both directions. It is noticeable here, that the hit rate between the full set of criteria (Table 13) and the reduced one (Table 14 and Table 15) is not that significant. As more in-depth analyses have shown, this is due to the predominant impact of the geometrical criteria (covered by both sets) on the overall result. From this it can be concluded that no significant improvement may be expected by adding further criteria.

Conclusion
In this paper, a probability-based approach for a decision algorithm within a Location Referencing Method has been published. After giving an overview of terms and methodology in the context of Location Referencing and some mathematical principles, the specific decision functions used by the algorithm have been formulated and defined. In this, a set of criteria based on similarity measurements for geometrical, topological, syntactical and semantical attributes has been specified and the corresponding probability distributions have been estimated. The performance of these functions has been validated and confirmed by experimental evaluations based on real live traffic situations covering two mapping directions. As a conclusion, the probability-based algorithm delivers an adequate performance (up to 90% hit rate) and thus offers an improvement of 12 percentage points compared to the previous deterministic algorithms. Since this has been conducted for urban and urban hinterland scenarios, the number of uses cases could be extended by other scenarios like rural or motorway. Furthermore, elaborating another approaches to handle uncertainty are beneficial, so a fuzzy-based approach is in process.