Overview
ABSTRACT
This article introduces the notion of unsupervised statistical machine learning, then describes the techniques currently available to perform statistical learning from unlabeled data: partitioning (or clustering), dimensionality reduction, density estimation and finally generative models. It covers the oldest classical algorithms (principal component analysis, k-means) as well as the most recent techniques using deep learning (word representations, autoregressive models, auto-encoders, generative adversarial networks).
Read this article from a comprehensive knowledge base, updated and supplemented with articles reviewed by scientific committees.
Read the articleAUTHOR
-
Bruno SAUVALLE: Chief Mining Engineer - Center de Robotique, MINES ParisTech, Paris, France
INTRODUCTION
The aim of this article is to present methods and techniques for unsupervised statistical learning, i.e. using data that has not been labeled beforehand.
The notion of unsupervised statistical learning may seem difficult to grasp when compared with that of supervised statistical learning, which simply consists of learning a function f:y = f(x) from a very large number of example pairs (x i ,y i ) where x i is the input data and y i is the output result, or label.
However, obtaining a labelled database is difficult and costly, as human intervention is generally required to obtain the labels y i corresponding to the data x i available. The creation of the ImageNet database, which currently contains over 14 million images and is the source of the spectacular successes observed in image analysis in recent years, thus required many years and the intervention of several tens of thousands of "annotators" tasked with viewing images downloaded from the Internet and identifying the objects or animals present in these images.
However, the ever-decreasing cost of capturing, communicating, storing and processing data is naturally leading to the availability of much larger databases, whose exhaustive analysis by humans is clearly impossible.
In this context, unsupervised learning is currently being developed along two lines.
A first way of exploiting a data set statistically without human intervention is to try to learn the distribution of these data. By way of example, language models are programs often based on neural networks which, for a given language, seek to assign a probability, or likelihood value, to each sentence or group of sentences proposed to them. Among other things, this makes it possible to optimize speech recognition or translation software by avoiding proposing sentences that would be considered too unlikely in the language and context considered, for example if they are grammatically incorrect. The data used to build these language models are text corpora freely available on the Internet, and therefore require no particular annotation effort.
A second way of exploiting a large dataset is to use it to build a representation of this type of data, optimized for one or more classes of use. If the aim is simply to visualize data in the form of vectors comprising a large number of coordinates, a reduction in dimensionality to two or three dimensions would seem to be the obvious choice. If you are...
Exclusive to subscribers. 97% yet to be discovered!
You do not have access to this resource.
Click here to request your free trial access!
Already subscribed? Log in!
The Ultimate Scientific and Technical Reference
KEYWORDS
clustering | dimensionality reduction | generative model
This article is included in
Software technologies and System architectures
This offer includes:
Knowledge Base
Updated and enriched with articles validated by our scientific committees
Services
A set of exclusive tools to complement the resources
Practical Path
Operational and didactic, to guarantee the acquisition of transversal skills
Doc & Quiz
Interactive articles with quizzes, for constructive reading
Unsupervised statistical learning
Bibliography
Software tools
For computations that do not involve deep learning and that deal with data volumes that do not require the use of distributed computing, the two reference software tools are scikit-learn and R
The Spark Mlib library adapts the main machine learning algorithms (excluding deep learning) to a distributed environment, enabling the processing of very large volumes of data.
Deep...
Events
Annual conferences :
International Conference on Learning Representations ( https://iclr.cc/ )
Conference on Neural Information Processing Systems ( https://nips.cc/ )
Conference on Computer Vision and...
Exclusive to subscribers. 97% yet to be discovered!
You do not have access to this resource.
Click here to request your free trial access!
Already subscribed? Log in!
The Ultimate Scientific and Technical Reference