Quizzed article | REF: H5012 V1

Unsupervised statistical machine learning

Author: Bruno SAUVALLE

Publication date: January 10, 2020, Review date: January 18, 2021

You do not have access to this resource.
Click here to request your free trial access!

Already subscribed? Log in!

Overview

Français

ABSTRACT

This article introduces the notion of unsupervised statistical machine learning, then describes the techniques currently available to perform statistical learning from unlabeled data: partitioning (or clustering), dimensionality reduction, density estimation and finally generative models. It covers the oldest classical algorithms (principal component analysis, k-means) as well as the most recent techniques using deep learning (word representations, autoregressive models, auto-encoders, generative adversarial networks).

Read this article from a comprehensive knowledge base, updated and supplemented with articles reviewed by scientific committees.

Read the article

AUTHOR

Bruno SAUVALLE: Chief Mining Engineer - Center de Robotique, MINES ParisTech, Paris, France

INTRODUCTION

The aim of this article is to present methods and techniques for unsupervised statistical learning, i.e. using data that has not been labeled beforehand.

The notion of unsupervised statistical learning may seem difficult to grasp when compared with that of supervised statistical learning, which simply consists of learning a function f:y = f(x) from a very large number of example pairs (x _i ,y _i ) where x _i is the input data and y _i is the output result, or label.

However, obtaining a labelled database is difficult and costly, as human intervention is generally required to obtain the labels y _i corresponding to the data x _i available. The creation of the ImageNet database, which currently contains over 14 million images and is the source of the spectacular successes observed in image analysis in recent years, thus required many years and the intervention of several tens of thousands of "annotators" tasked with viewing images downloaded from the Internet and identifying the objects or animals present in these images.

However, the ever-decreasing cost of capturing, communicating, storing and processing data is naturally leading to the availability of much larger databases, whose exhaustive analysis by humans is clearly impossible.

In this context, unsupervised learning is currently being developed along two lines.

A first way of exploiting a data set statistically without human intervention is to try to learn the distribution of these data. By way of example, language models are programs often based on neural networks which, for a given language, seek to assign a probability, or likelihood value, to each sentence or group of sentences proposed to them. Among other things, this makes it possible to optimize speech recognition or translation software by avoiding proposing sentences that would be considered too unlikely in the language and context considered, for example if they are grammatically incorrect. The data used to build these language models are text corpora freely available on the Internet, and therefore require no particular annotation effort.

A second way of exploiting a large dataset is to use it to build a representation of this type of data, optimized for one or more classes of use. If the aim is simply to visualize data in the form of vectors comprising a large number of coordinates, a reduction in dimensionality to two or three dimensions would seem to be the obvious choice. If you are...

You do not have access to this resource.

Exclusive to subscribers. 97% yet to be discovered!

You do not have access to this resource.
Click here to request your free trial access!

Already subscribed? Log in!

The Ultimate Scientific and Technical Reference

A Comprehensive Knowledge Base, with over 1,200 authors and 100 scientific advisors

+ More than 10,000 articles and 1,000 how-to sheets, over 800 new or updated articles every year

From design to prototyping, right through to industrialization, the reference for securing the development of your industrial projects

KEYWORDS

clustering | dimensionality reduction | generative model

CAN BE ALSO FOUND IN:

Home IT Software technologies and System architectures Unsupervised statistical machine learning

Home Innovations Technological innovations Unsupervised statistical machine learning

This article is included in

Control and systems engineering

This offer includes:

Knowledge Base

Updated and enriched with articles validated by our scientific committees

Services

A set of exclusive tools to complement the resources

Practical Path

Operational and didactic, to guarantee the acquisition of transversal skills

Doc & Quiz

Interactive articles with quizzes, for constructive reading

Subscribe now!

Ongoing reading
Unsupervised statistical learning

Different types of learning

Bibliography

(1) - KARRAS (T.), LAINE (S.), AILA (T.) - A Style-Based Generator Architecture for Generative Adversarial Networks - (2018).
(2) - KLEINBERG (J.) - An Impossibility Theorem for Clustering, - in NIPS (2002).
...

Software tools

For computations that do not involve deep learning and that deal with data volumes that do not require the use of distributed computing, the two reference software tools are scikit-learn and R

The Spark Mlib library adapts the main machine learning algorithms (excluding deep learning) to a distributed environment, enabling the processing of very large volumes of data.

Deep...