EPages

EPages

ePages is an e-commerce software that allows merchants to create and run online shops in the cloud. The number of shops based on ePages is currently 140,000 worldwide. ePages software is regularly updated due to its Software-as-a-Service model. An investor in the company is United Internet, with a 25% stake. ePages focuses upon distributing its products mainly through hosting providers. ePages is headquartered in Hamburg, with additional offices Barcelona, Jena, and Bilbao. == History == The name ePages was used for the first time for software in 1997 to market "Intershop ePages". In 2002, the product line then called Intershop 4 was taken over by ePages GmbH and renamed to ePages. == Features == Depending on the ePages product and packages offered by hosting providers, merchants can sell up to an unlimited number of items. Users can offer their products and services in 15 languages and with all currencies. With ePages, merchants can use web marketing tools; e.g. newsletters, coupons or social media plug-ins for social commerce.

Integrated test facility

An integrated test facility (ITF) creates a fictitious entity in a database to process test transactions simultaneously with live input. ITF can be used to incorporate test transactions into a normal production run of a system. Its advantage is that periodic testing does not require separate test processes. However, careful planning is necessary, and test data must be isolated from production data. Moreover, ITF validates the correct operation of a transaction in an application, but it does not ensure that a system is being operated correctly. Integrated test facility is considered a useful audit tool during an IT audit because it uses the same programs to compare processing using independently calculated data. This involves setting up dummy entities on an application system and processing test or production data against the entity as a means of verifying processing accuracy.

Mean squared error

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the true value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive (and not zero) is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk (the average loss on an observed data set), as an estimate of the true MSE (the true risk: the average loss on the actual population distribution). The MSE is a measure of the quality of an estimator. As it is derived from the square of Euclidean distance, it is always a positive value that decreases as the error approaches zero. The MSE is the second moment (about the origin) of the error, and thus incorporates both the variance of the estimator (how widely spread the estimates are from one data sample to another) and its bias (how far off the average estimated value is from the true value). For an unbiased estimator, the MSE is the variance of the estimator. Like the variance, MSE has the same units of measurement as the square of the quantity being estimated. In an analogy to standard deviation, taking the square root of MSE yields the root-mean-square error or root-mean-square deviation (RMSE or RMSD), which has the same units as the quantity being estimated; for an unbiased estimator, the RMSE is the square root of the variance, known as the standard error. == Definition and basic properties == The MSE either assesses the quality of a predictor (i.e., a function mapping arbitrary inputs to a sample of values of some random variable), or of an estimator (i.e., a mathematical function mapping a sample of data to an estimate of a parameter of the population from which the data is sampled). In the context of prediction, understanding the prediction interval can also be useful as it provides a range within which a future observation will fall, with a certain probability. The definition of an MSE differs according to whether one is describing a predictor or an estimator. === Predictor === If a vector of n {\displaystyle n} predictions is generated from a sample of n {\displaystyle n} data points on all variables, and Y {\displaystyle Y} is the vector of observed values of the variable being predicted, with Y ^ {\displaystyle {\hat {Y}}} being the predicted values (e.g. as from a least-squares fit), then the within-sample MSE of the predictor is computed as MSE = 1 n ∑ i = 1 n ( Y i − Y i ^ ) 2 {\displaystyle \operatorname {MSE} ={\frac {1}{n}}\sum _{i=1}^{n}\left(Y_{i}-{\hat {Y_{i}}}\right)^{2}} In other words, the MSE is the mean ( 1 n ∑ i = 1 n ) {\textstyle \left({\frac {1}{n}}\sum _{i=1}^{n}\right)} of the squares of the errors ( Y i − Y i ^ ) 2 {\textstyle \left(Y_{i}-{\hat {Y_{i}}}\right)^{2}} . This is an easily computable quantity for a particular sample (and hence is sample-dependent). In matrix notation, MSE = 1 n ∑ i = 1 n ( e i ) 2 = 1 n e T e {\displaystyle \operatorname {MSE} ={\frac {1}{n}}\sum _{i=1}^{n}(e_{i})^{2}={\frac {1}{n}}\mathbf {e} ^{\mathsf {T}}\mathbf {e} } where e i {\displaystyle e_{i}} is Y i − Y i ^ {\displaystyle Y_{i}-{\hat {Y_{i}}}} and e {\displaystyle \mathbf {e} } is a n × 1 {\displaystyle n\times 1} column vector. The MSE can also be computed on q data points that were not used in estimating the model, either because they were held back for this purpose, or because these data have been newly obtained. Within this process, known as cross-validation, the MSE is often called the test MSE, and is computed as MSE = 1 q ∑ i = n + 1 n + q ( Y i − Y i ^ ) 2 {\displaystyle \operatorname {MSE} ={\frac {1}{q}}\sum _{i=n+1}^{n+q}\left(Y_{i}-{\hat {Y_{i}}}\right)^{2}} === Estimator === The MSE of an estimator θ ^ {\displaystyle {\hat {\theta }}} with respect to an unknown parameter θ {\displaystyle \theta } is defined as MSE ⁡ ( θ ^ ) = E θ ⁡ [ ( θ ^ − θ ) 2 ] . {\displaystyle \operatorname {MSE} ({\hat {\theta }})=\operatorname {E} _{\theta }\left[({\hat {\theta }}-\theta )^{2}\right].} This definition depends on the unknown parameter, therefore the MSE is a priori property of an estimator. The MSE could be a function of unknown parameters, in which case any estimator of the MSE based on estimates of these parameters would be a function of the data (and thus a random variable). If the estimator θ ^ {\displaystyle {\hat {\theta }}} is derived as a sample statistic and is used to estimate some population parameter, then the expectation is with respect to the sampling distribution of the sample statistic. The MSE can be written as the sum of the variance of the estimator and the squared bias of the estimator, providing a useful way to calculate the MSE and implying that in the case of unbiased estimators, the MSE and variance are equivalent. MSE ⁡ ( θ ^ ) = Var θ ⁡ ( θ ^ ) + Bias ⁡ ( θ ^ , θ ) 2 . {\displaystyle \operatorname {MSE} ({\hat {\theta }})=\operatorname {Var} _{\theta }({\hat {\theta }})+\operatorname {Bias} ({\hat {\theta }},\theta )^{2}.} ==== Proof of variance and bias relationship ==== MSE ⁡ ( θ ^ ) = E θ ⁡ [ ( θ ^ − θ ) 2 ] = E θ ⁡ [ ( θ ^ − E θ ⁡ [ θ ^ ] + E θ ⁡ [ θ ^ ] − θ ) 2 ] = E θ ⁡ [ ( θ ^ − E θ ⁡ [ θ ^ ] ) 2 + 2 ( θ ^ − E θ ⁡ [ θ ^ ] ) ( E θ ⁡ [ θ ^ ] − θ ) + ( E θ ⁡ [ θ ^ ] − θ ) 2 ] = E θ ⁡ [ ( θ ^ − E θ ⁡ [ θ ^ ] ) 2 ] + E θ ⁡ [ 2 ( θ ^ − E θ ⁡ [ θ ^ ] ) ( E θ ⁡ [ θ ^ ] − θ ) ] + E θ ⁡ [ ( E θ ⁡ [ θ ^ ] − θ ) 2 ] = E θ ⁡ [ ( θ ^ − E θ ⁡ [ θ ^ ] ) 2 ] + 2 ( E θ ⁡ [ θ ^ ] − θ ) E θ ⁡ [ θ ^ − E θ ⁡ [ θ ^ ] ] + ( E θ ⁡ [ θ ^ ] − θ ) 2 E θ ⁡ [ θ ^ ] − θ = constant = E θ ⁡ [ ( θ ^ − E θ ⁡ [ θ ^ ] ) 2 ] + 2 ( E θ ⁡ [ θ ^ ] − θ ) ( E θ ⁡ [ θ ^ ] − E θ ⁡ [ θ ^ ] ) + ( E θ ⁡ [ θ ^ ] − θ ) 2 E θ ⁡ [ θ ^ ] = constant = E θ ⁡ [ ( θ ^ − E θ ⁡ [ θ ^ ] ) 2 ] + ( E θ ⁡ [ θ ^ ] − θ ) 2 = Var θ ⁡ ( θ ^ ) + Bias θ ⁡ ( θ ^ , θ ) 2 {\displaystyle {\begin{aligned}\operatorname {MSE} ({\hat {\theta }})&=\operatorname {E} _{\theta }\left[({\hat {\theta }}-\theta )^{2}\right]\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]+\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}\right]\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)^{2}+2\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)+\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}\right]\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)^{2}\right]+\operatorname {E} _{\theta }\left[2\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)\right]+\operatorname {E} _{\theta }\left[\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}\right]\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)^{2}\right]+2\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)\operatorname {E} _{\theta }\left[{\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right]+\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}&&\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta ={\text{constant}}\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)^{2}\right]+2\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)+\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}&&\operatorname {E} _{\theta }[{\hat {\theta }}]={\text{constant}}\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)^{2}\right]+\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}\\&=\operatorname {Var} _{\theta }({\hat {\theta }})+\operatorname {Bias} _{\theta }({\hat {\theta }},\theta )^{2}\end{aligned}}} An even shorter proof can be achieved using the well-known formula that for a random variable X {\textstyle X} , E ( X 2 ) = Var ⁡ ( X ) + ( E ( X ) ) 2 {\textstyle \mathbb {E} (X^{2})=\operatorname {Var} (X)+(\mathbb {E} (X))^{2}} . By substituting X {\textstyle X} with, θ ^ − θ {\textstyle {\hat {\theta }}-\theta } , we have MSE ⁡ ( θ ^ ) = E [ ( θ ^ − θ ) 2 ] = Var ⁡ ( θ ^ − θ ) + ( E [ θ ^ − θ ] ) 2 = Var ⁡ ( θ ^ ) + Bias 2 ⁡ ( θ ^ , θ ) {\displaystyle {\begin{aligned}\operatorname {MSE} ({\hat {\theta }})&=\mathbb {E} [({\hat {\theta }}-\theta )^{2}]\\&=\operator

Mating pool

Mating pool is a concept used in evolutionary algorithms and means a population of parents for the next population. The mating pool is formed by candidate solutions that the selection operators deem to have the highest fitness in the current population. Solutions that are included in the mating pool are referred to as parents. Individual solutions can be repeatedly included in the mating pool, with individuals of higher fitness values having a higher chance of being included multiple times. Crossover operators are then applied to the parents, resulting in recombination of genes recognized as superior. Lastly, random changes in the genes are introduced through mutation operators, increasing the genetic variation in the gene pool. Those two operators improve the chance of creating new, superior solutions. A new generation of solutions is thereby created, the children, who will constitute the next population. Depending on the selection method, the total number of parents in the mating pool can be different to the size of the initial population, resulting in a new population that’s smaller. To continue the algorithm with an equally sized population, random individuals from the old populations can be chosen and added to the new population. At this point, the fitness value of the new solutions is evaluated. If the termination conditions are fulfilled, processes come to an end. Otherwise, they are repeated. The repetition of the steps result in candidate solutions that evolve towards the most optimal solution over time. The genes will become increasingly uniform towards the most optimal gene, a process called convergence. If 95% of the population share the same version of a gene, the gene has converged. When all the individual fitness values have reached the value of the best individual, i.e. all the genes have converged, population convergence is achieved. == Mating pool creation == Several methods can be applied to create a mating pool. All of these processes involve the selective breeding of a particular number of individuals within a population. There are multiple criteria that can be employed to determine which individuals make it into the mating pool and which are left behind. The selection methods can be split into three general types: fitness proportionate selection, ordinal based selection and threshold based selection. === Fitness proportionate selection === In the case of fitness proportionate selection, random individuals are selected to enter the pool. However, the ones with a higher level of fitness are more likely to be picked and therefore have a greater chance of passing on their features to the next generation. One of the techniques used in this type of parental selection is the roulette wheel selection. This approach divides a hypothetical circular wheel into different slots, the size of which is equal to the fitness values of each potential candidate. Afterwards, the wheel is rotated and a fixed point determines which individual gets picked. The greater the fitness value of an individual, the higher the probability of being chosen as a parent by the random spin of the wheel. Alternatively, stochastic universal sampling can be implemented. This selection method is also based on the rotation of a spinning wheel. However, in this case there is more than one fixed point and as a result all of the mating pool members will be selected simultaneously. === Ordinal based selection === The ordinal based selection methods include the tournament and ranking selection. Tournament selection involves the random selection of individuals of a population and the subsequent comparison of their fitness levels. The winners of these “tournaments” are the ones with the highest values and will be put into the mating pool as parents. In ranking selection all the individuals are sorted based on their fitness values. Then, the selection of the parents is made according to the rank of the candidates. Every individual has a chance of being chosen, but higher ranked ones are favored === Threshold based selection === The last type of selection method is referred to as the threshold based method. This includes the truncation selection method, which sorts individuals based on their phenotypic values on a specific trait and later selects the proportion of them that are within a certain threshold as parents.

IDistance

In pattern recognition, iDistance is an indexing and query processing technique for k-nearest neighbor queries on point data in multi-dimensional metric spaces. The kNN query is one of the hardest problems on multi-dimensional data, especially when the dimensionality of the data is high. iDistance is designed to process kNN queries in high-dimensional spaces efficiently and performs extremely well for skewed data distributions, which usually occur in real-life data sets. iDistance employs a two-phase search strategy involving an initial filtering of candidate regions and a subsequent refinement of results, an approach aligned with the Filter and Refine Principle (FRP). This means that the index first prunes the search space to eliminate unlikely candidates, then verifies the true nearest neighbors in a refinement step, following the general FRP paradigm used in database search algorithms. The iDistance index can also be augmented with machine learning models to learn data distributions for improved searching and storage of multi-dimensional data. == Indexing == Building the iDistance index has two steps: A number of reference points in the data space are chosen. There are various ways of choosing reference points. Using cluster centers as reference points is the most efficient way. The data points are partitioned into Voronoi cells based on well-chosen reference points. The distance between a data point and its closest reference point is calculated. This distance plus a scaling value is called the point's iDistance. By this means, points in a multi-dimensional space are mapped to one-dimensional values, and then a B+-tree can be adopted to index the points using the iDistance as the key. The figure on the right shows an example where three reference points (O1, O2, O3) are chosen. The data points are then mapped to a one-dimensional space and indexed in a B+-tree. Various extensions have been proposed to make the selection of reference points for effective query performance, including employing machine learning to learn the identification of reference points. == Query processing == To process a kNN query, the query is mapped to a number of one-dimensional range queries, which can be processed efficiently on a B+-tree. In the above figure, the query Q is mapped to a value in the B+-tree while the kNN search ``sphere" is mapped to a range in the B+-tree. The search sphere expands gradually until the k NNs are found. This corresponds to gradually expanding range searches in the B+-tree. The iDistance technique can be viewed as a way of accelerating the sequential scan. Instead of scanning records from the beginning to the end of the data file, the iDistance starts the scan from spots where the nearest neighbors can be obtained early with a very high probability. == Applications == The iDistance has been used in many applications including Image retrieval Video indexing Similarity search in P2P systems Mobile computing Recommender system == Historical background == The iDistance was first proposed by Cui Yu, Beng Chin Ooi, Kian-Lee Tan and H. V. Jagadish in 2001. Later, together with Rui Zhang, they improved the technique and performed a more comprehensive study on it in 2005.

Solomonoff's theory of inductive inference

Solomonoff's theory of inductive inference proves that, under its common sense assumptions (axioms), the best possible scientific model is the shortest algorithm that generates the empirical data under consideration. In addition to the choice of data, other assumptions are that, to avoid the post-hoc fallacy, the programming language must be chosen prior to the data and that the environment being observed is generated by an unknown algorithm. This is also called a theory of induction. Due to its basis in the dynamical (state-space model) character of Algorithmic Information Theory, it encompasses statistical as well as dynamical information criteria for model selection. It was introduced by Ray Solomonoff, based on probability theory and theoretical computer science. In essence, Solomonoff's induction derives the posterior probability of any computable theory, given a sequence of observed data. This posterior probability is derived from Bayes' rule and some universal prior, that is, a prior that assigns a positive probability to any computable theory. Solomonoff proved that this induction is incomputable (or more precisely, lower semi-computable), but noted that "this incomputability is of a very benign kind", and that it "in no way inhibits its use for practical prediction" (as it can be approximated from below more accurately with more computational resources). It is only "incomputable" in the benign sense that no scientific consensus is able to prove that the best current scientific theory is the best of all possible theories. However, Solomonoff's theory does provide an objective criterion for deciding among the current scientific theories explaining a given set of observations. Solomonoff's induction naturally formalizes Occam's razor by assigning larger prior credences to theories that require a shorter algorithmic description. == Origin == === Philosophical === The theory is based in philosophical foundations, and was founded by Ray Solomonoff around 1960. It is a mathematically formalized combination of Occam's razor and the Principle of Multiple Explanations. All computable theories which perfectly describe previous observations are used to calculate the probability of the next observation, with more weight put on the shorter computable theories. Marcus Hutter's universal artificial intelligence builds upon this to calculate the expected value of an action. === Principle === Solomonoff's induction has been argued to be the computational formalization of pure Bayesianism. To understand, recall that Bayesianism derives the posterior probability P [ T | D ] {\displaystyle \mathbb {P} [T|D]} of a theory T {\displaystyle T} given data D {\displaystyle D} by applying Bayes rule, which yields P [ T | D ] = P [ D | T ] P [ T ] P [ D | T ] P [ T ] + ∑ A ≠ T P [ D | A ] P [ A ] {\displaystyle \mathbb {P} [T|D]={\frac {\mathbb {P} [D|T]\mathbb {P} [T]}{\mathbb {P} [D|T]\mathbb {P} [T]+\sum _{A\neq T}\mathbb {P} [D|A]\mathbb {P} [A]}}} where theories A {\displaystyle A} are alternatives to theory T {\displaystyle T} . For this equation to make sense, the quantities P [ D | T ] {\displaystyle \mathbb {P} [D|T]} and P [ D | A ] {\displaystyle \mathbb {P} [D|A]} must be well-defined for all theories T {\displaystyle T} and A {\displaystyle A} . In other words, any theory must define a probability distribution over observable data D {\displaystyle D} . Solomonoff's induction essentially boils down to demanding that all such probability distributions be computable. Interestingly, the set of computable probability distributions is a subset of the set of all programs, which is countable. Similarly, the sets of observable data considered by Solomonoff were finite. Without loss of generality, we can thus consider that any observable data is a finite bit string. As a result, Solomonoff's induction can be defined by only invoking discrete probability distributions. Solomonoff's induction then allows to make probabilistic predictions of future data F {\displaystyle F} , by simply obeying the laws of probability. Namely, we have P [ F | D ] = E T [ P [ F | T , D ] ] = ∑ T P [ F | T , D ] P [ T | D ] {\displaystyle \mathbb {P} [F|D]=\mathbb {E} _{T}[\mathbb {P} [F|T,D]]=\sum _{T}\mathbb {P} [F|T,D]\mathbb {P} [T|D]} . This quantity can be interpreted as the average predictions P [ F | T , D ] {\displaystyle \mathbb {P} [F|T,D]} of all theories T {\displaystyle T} given past data D {\displaystyle D} , weighted by their posterior credences P [ T | D ] {\displaystyle \mathbb {P} [T|D]} . === Mathematical === The proof of the "razor" is based on the known mathematical properties of a probability distribution over a countable set. These properties are relevant because the infinite set of all programs is a denumerable set. The sum S of the probabilities of all programs must be exactly equal to one (as per the definition of probability) thus the probabilities must roughly decrease as we enumerate the infinite set of all programs, otherwise S will be strictly greater than one. To be more precise, for every ϵ {\displaystyle \epsilon } > 0, there is some length l such that the probability of all programs longer than l is at most ϵ {\displaystyle \epsilon } . This does not, however, preclude very long programs from having very high probability. Fundamental ingredients of the theory are the concepts of algorithmic probability and Kolmogorov complexity. The universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p. Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion. == Mathematical guarantees == === Solomonoff's completeness === The remarkable property of Solomonoff's induction is its completeness. In essence, the completeness theorem guarantees that the expected cumulative errors made by the predictions based on Solomonoff's induction are upper-bounded by the Kolmogorov complexity of the (stochastic) data generating process. The errors can be measured using the Kullback–Leibler divergence or the square of the difference between the induction's prediction and the probability assigned by the (stochastic) data generating process. === Solomonoff's uncomputability === Unfortunately, Solomonoff also proved that Solomonoff's induction is uncomputable. In fact, he showed that computability and completeness are mutually exclusive: any complete theory must be uncomputable. The proof of this is derived from a game between the induction and the environment. Essentially, any computable induction can be tricked by a computable environment, by choosing the computable environment that negates the computable induction's prediction. This fact can be regarded as an instance of the no free lunch theorem. == Modern applications == === Artificial intelligence === Though Solomonoff's inductive inference is not computable, several AIXI-derived algorithms approximate it in order to make it run on a modern computer. The more computing power they are given, the closer their predictions are to the predictions of inductive inference (their mathematical limit is Solomonoff's inductive inference). Another direction of inductive inference is based on E. Mark Gold's model of learning in the limit from 1967 and has developed since then more and more models of learning. The general scenario is the following: Given a class S of computable functions, is there a learner (that is, recursive functional) which for any input of the form (f(0),f(1),...,f(n)) outputs a hypothesis (an index e with respect to a previously agreed on acceptable numbering of all computable functions; the indexed function may be required consistent with the given values of f). A learner M learns a function f if almost all its hypotheses are the same index e, which generates the function f; M learns S if M learns every f in S. Basic results are that all recursively enumerable classes of functions are learnable while the class REC of all computable functions is not learnable. Many related models have been considered and also the learning of classes of recursively enumerable sets from positive data is a topic studied from Gold's pioneering paper in 1967 onwards. A far reaching extension of the Gold’s approach is developed by Schmidhuber's theory of generalized Kolmogorov complexities, which are kinds of super-recursive algorithms.

Large margin nearest neighbor

Large margin nearest neighbor (LMNN) classification is a statistical machine learning algorithm for metric learning. It learns a pseudometric designed for k-nearest neighbor classification. The algorithm is based on semidefinite programming, a sub-class of convex optimization. The goal of supervised learning (more specifically classification) is to learn a decision rule that can categorize data instances into pre-defined classes. The k-nearest neighbor rule assumes a training data set of labeled instances (i.e. the classes are known). It classifies a new data instance with the class obtained from the majority vote of the k closest (labeled) training instances. Closeness is measured with a pre-defined metric. Large margin nearest neighbors is an algorithm that learns this global (pseudo-)metric in a supervised fashion to improve the classification accuracy of the k-nearest neighbor rule. == Setup == The main intuition behind LMNN is to learn a pseudometric under which all data instances in the training set are surrounded by at least k instances that share the same class label. If this is achieved, the leave-one-out error (a special case of cross validation) is minimized. Let the training data consist of a data set D = { ( x → 1 , y 1 ) , … , ( x → n , y n ) } ⊂ R d × C {\displaystyle D=\{({\vec {x}}_{1},y_{1}),\dots ,({\vec {x}}_{n},y_{n})\}\subset R^{d}\times C} , where the set of possible class categories is C = { 1 , … , c } {\displaystyle C=\{1,\dots ,c\}} . The algorithm learns a pseudometric of the type d ( x → i , x → j ) = ( x → i − x → j ) ⊤ M ( x → i − x → j ) {\displaystyle d({\vec {x}}_{i},{\vec {x}}_{j})=({\vec {x}}_{i}-{\vec {x}}_{j})^{\top }\mathbf {M} ({\vec {x}}_{i}-{\vec {x}}_{j})} . For d ( ⋅ , ⋅ ) {\displaystyle d(\cdot ,\cdot )} to be well defined, the matrix M {\displaystyle \mathbf {M} } needs to be positive semi-definite. The Euclidean metric is a special case, where M {\displaystyle \mathbf {M} } is the identity matrix. This generalization is often (falsely) referred to as Mahalanobis metric. Figure 1 illustrates the effect of the metric under varying M {\displaystyle \mathbf {M} } . The two circles show the set of points with equal distance to the center x → i {\displaystyle {\vec {x}}_{i}} . In the Euclidean case this set is a circle, whereas under the modified (Mahalanobis) metric it becomes an ellipsoid. The algorithm distinguishes between two types of special data points: target neighbors and impostors. === Target neighbors === Target neighbors are selected before learning. Each instance x → i {\displaystyle {\vec {x}}_{i}} has exactly k {\displaystyle k} different target neighbors within D {\displaystyle D} , which all share the same class label y i {\displaystyle y_{i}} . The target neighbors are the data points that should become nearest neighbors under the learned metric. Let us denote the set of target neighbors for a data point x → i {\displaystyle {\vec {x}}_{i}} as N i {\displaystyle N_{i}} . === Impostors === An impostor of a data point x → i {\displaystyle {\vec {x}}_{i}} is another data point x → j {\displaystyle {\vec {x}}_{j}} with a different class label (i.e. y i ≠ y j {\displaystyle y_{i}\neq y_{j}} ) which is one of the nearest neighbors of x → i {\displaystyle {\vec {x}}_{i}} . During learning the algorithm tries to minimize the number of impostors for all data instances in the training set. == Algorithm == Large margin nearest neighbors optimizes the matrix M {\displaystyle \mathbf {M} } with the help of semidefinite programming. The objective is twofold: For every data point x → i {\displaystyle {\vec {x}}_{i}} , the target neighbors should be close and the impostors should be far away. Figure 1 shows the effect of such an optimization on an illustrative example. The learned metric causes the input vector x → i {\displaystyle {\vec {x}}_{i}} to be surrounded by training instances of the same class. If it was a test point, it would be classified correctly under the k = 3 {\displaystyle k=3} nearest neighbor rule. The first optimization goal is achieved by minimizing the average distance between instances and their target neighbors ∑ i , j ∈ N i d ( x → i , x → j ) {\displaystyle \sum _{i,j\in N_{i}}d({\vec {x}}_{i},{\vec {x}}_{j})} . The second goal is achieved by penalizing distances to impostors x → l {\displaystyle {\vec {x}}_{l}} that are less than one unit further away than target neighbors x → j {\displaystyle {\vec {x}}_{j}} (and therefore pushing them out of the local neighborhood of x → i {\displaystyle {\vec {x}}_{i}} ). The resulting value to be minimized can be stated as: ∑ i , j ∈ N i , l , y l ≠ y i [ d ( x → i , x → j ) + 1 − d ( x → i , x → l ) ] + {\displaystyle \sum _{i,j\in N_{i},l,y_{l}\neq y_{i}}[d({\vec {x}}_{i},{\vec {x}}_{j})+1-d({\vec {x}}_{i},{\vec {x}}_{l})]_{+}} With a hinge loss function [ ⋅ ] + = max ( ⋅ , 0 ) {\textstyle [\cdot ]_{+}=\max(\cdot ,0)} , which ensures that impostor proximity is not penalized when outside the margin. The margin of exactly one unit fixes the scale of the matrix M {\displaystyle M} . Any alternative choice c > 0 {\displaystyle c>0} would result in a rescaling of M {\displaystyle M} by a factor of 1 / c {\displaystyle 1/c} . The final optimization problem becomes: min M ∑ i , j ∈ N i d ( x → i , x → j ) + λ ∑ i , j , l ξ i j l {\displaystyle \min _{\mathbf {M} }\sum _{i,j\in N_{i}}d({\vec {x}}_{i},{\vec {x}}_{j})+\lambda \sum _{i,j,l}\xi _{ijl}} ∀ i , j ∈ N i , l , y l ≠ y i {\displaystyle \forall _{i,j\in N_{i},l,y_{l}\neq y_{i}}} d ( x → i , x → j ) + 1 − d ( x → i , x → l ) ≤ ξ i j l {\displaystyle d({\vec {x}}_{i},{\vec {x}}_{j})+1-d({\vec {x}}_{i},{\vec {x}}_{l})\leq \xi _{ijl}} ξ i j l ≥ 0 {\displaystyle \xi _{ijl}\geq 0} M ⪰ 0 {\displaystyle \mathbf {M} \succeq 0} The hyperparameter λ > 0 {\textstyle \lambda >0} is some positive constant (typically set through cross-validation). Here the variables ξ i j l {\displaystyle \xi _{ijl}} (together with two types of constraints) replace the term in the cost function. They play a role similar to slack variables to absorb the extent of violations of the impostor constraints. The last constraint ensures that M {\displaystyle \mathbf {M} } is positive semi-definite. The optimization problem is an instance of semidefinite programming (SDP). Although SDPs tend to suffer from high computational complexity, this particular SDP instance can be solved very efficiently due to the underlying geometric properties of the problem. In particular, most impostor constraints are naturally satisfied and do not need to be enforced during runtime (i.e. the set of variables ξ i j l {\displaystyle \xi _{ijl}} is sparse). A particularly well suited solver technique is the working set method, which keeps a small set of constraints that are actively enforced and monitors the remaining (likely satisfied) constraints only occasionally to ensure correctness. == Extensions and efficient solvers == LMNN was extended to multiple local metrics in the 2008 paper. This extension significantly improves the classification error, but involves a more expensive optimization problem. In their 2009 publication in the Journal of Machine Learning Research, Weinberger and Saul derive an efficient solver for the semi-definite program. It can learn a metric for the MNIST handwritten digit data set in several hours, involving billions of pairwise constraints. An open source Matlab implementation is freely available at the authors web page. Kumal et al. extended the algorithm to incorporate local invariances to multivariate polynomial transformations and improved regularization.