Stochastic Deep Networks - HotComputerScience · Stochastic Deep Networks Gwendoline de Bie DMA,...

Stochastic Deep Networks

Gwendoline de BieDMA, ENS

[email protected]

Gabriel PeyreCNRS

DMA, [email protected]

Marco CuturiGoogle Brain

CREST [email protected]

February 21, 2019

Abstract

Machine learning is increasingly targeting areaswhere input data cannot be accurately described bya single vector, but can be modeled instead using themore flexible concept of random vectors, namely prob-ability measures or more simply point clouds of vary-ing cardinality. Using deep architectures on measuresposes, however, many challenging issues. Indeed, deeparchitectures are originally designed to handle fixed-length vectors, or, using recursive mechanisms, or-dered sequences thereof. In sharp contrast, measuresdescribe a varying number of weighted observationswith no particular order. We propose in this work adeep framework designed to handle crucial aspects ofmeasures, namely permutation invariances, variationsin weights and cardinality. Architectures derived fromthis pipeline can (i) map measures to measures — us-ing the concept of push-forward operators; (ii) bridgethe gap between measures and Euclidean spaces —through integration steps. This allows to design dis-criminative networks (to classify or reduce the dimen-sionality of input measures), generative architectures(to synthesize measures) and recurrent pipelines (topredict measure dynamics). We provide a theoreti-cal analysis of these building blocks, review our archi-tectures’ approximation abilities and robustness w.r.t.perturbation, and try them on various discriminativeand generative tasks.

1 Introduction

Deep networks can now handle increasingly com-plex structured data types, starting historically fromimages (Krizhevsky et al., 2012) and speech (Hintonet al., 2012) to deal now with shapes (Wu et al., 2015a),sounds (Lee et al., 2009), texts (LeCun et al., 1998)or graphs (Henaff et al., 2015). In each of these ap-plications, deep networks rely on the composition ofseveral elementary functions, whose tensorized oper-ations stream well on GPUs, and whose computa-

tional graphs can be easily automatically differentiatedthrough back-propagation. Initially designed for vec-torial features, their extension to sequences of vectorsusing recurrent mechanisms (Hochreiter and Schmid-huber, 1997) had an enormous impact.

Our goal is to devise neural architectures that canhandle probability distributions under any of theirusual form: as discrete measures supported on (possi-bly weighted) point clouds, or densities one can samplefrom. Such probability distributions are challengingto handle using recurrent networks because no orderbetween observations can be used to treat them re-cursively (although some adjustments can be made, asdiscussed in Vinyals et al. 2016) and because, in thediscrete case, their size may vary across observations.There is, however, a strong incentive to define neuralarchitectures that can handle distributions as inputsor outputs. This is particularly evident in computervision, where the naive representation of complex 3Dobjects as vectors in spatial grids is often too costlymemorywise, leads to a loss in detail, destroys topol-ogy and is blind to relevant invariances such as shapedeformations. These issues were successfully tackledin a string of papers well adapted to such 3D set-tings (Qi et al., 2016, 2017; Fan et al., 2016). In othercases, ranging from physics (Godin et al., 2007), biol-ogy (Grover et al., 2011), ecology (Tereshko, 2000) tocensus data (Guckenheimer et al., 1977), populationscannot be followed at an individual level due to ex-perimental costs or privacy concerns. In such settingswhere only macroscopic states are available, densitiesappear as the right object to perform inference tasks.

1.1 Previous works

Specificities of Probability Distributions. Datadescribed in point clouds or sampled i.i.d. from adensity are given unordered. Therefore architecturesdealing with them are expected to be permutation in-variant ; they are also often expected to be equivariantto geometric transformations of input points (trans-lations, rotations) and to capture local structures of

1

arX

iv:1

811.

0742

9v2

[st

at.M

L]

20

Feb

2019

points. Permutation invariance or equivariance (Ra-vanbakhsh et al., 2016; Zaheer et al., 2017), or withrespect to general groups of transformations (Gensand Domingos, 2014; Cohen and Welling, 2016; Ra-vanbakhsh et al., 2017) have been characterized, butwithout tackling the issue of locality. Pairwise inter-actions (Chen et al., 2014; Cheng et al., 2016; Gutten-berg et al., 2016) are appealing and helpful in buildingpermutation equivariant layers handling local informa-tion. Other strategies consist in augmenting the train-ing data by all permutations or finding its best ordering(Vinyals et al., 2015). (Qi et al., 2016, 2017) are closerto our work in the sense that they combine the searchfor local features to permutation invariance, achievedby max pooling.

(Point) Sets vs. Probability (Distributions).An important distinction should be made betweenpoint sets, and point clouds which stand usually fordiscrete probability measures with uniform masses.The natural topology of (point) sets is the Hausdorffdistance. That distance is very different from the nat-ural topology for probability distributions, that of theconvergence in law, a.k.a the weak∗ topology of mea-sures. The latter is metrized (among other metrics) bythe Wasserstein (optimal transport) distance, whichplays a key role in our work. This distinction betweensets and probability is crucial, because the architec-tures we propose here are designed to capture stablyand efficiently regularity of maps to be learned withrespect to the convergence in law. Note that this isa crucial distinction between our work and that pro-posed in PointNet (Qi et al., 2016) and PointNet++(Qi et al., 2017), which are designed to be smoothand efficients architectures for the Hausdorff topol-ogy of point sets. Indeed, they are not continuous forthe topology of measures (because of the max-poolingstep) and cannot approximate efficiently maps whichare smooth (e.g. Lipschitz) for the Wasserstein dis-tance.

Centrality of optimal transport. The Wasser-stein distance plays a central role in our architecturesthat are able to handle measures. Optimal transporthas recently gained popularity in machine learningdue to fast approximations, which are typically ob-tained using strongly-convex regularizers such as theentropy (Cuturi, 2013). The benefits of this regu-larization paved the way to the use of OT in vari-ous settings (Courty et al., 2017; Rolet et al., 2016;Huang et al., 2016). Although Wasserstein metricshave long been considered for inference purposes (Bas-setti et al., 2006), their introduction in deep learningarchitectures is fairly recent, whether it be for genera-tive tasks (Bernton et al., 2017; Arjovsky et al., 2017;

Genevay et al., 2018) or regression purposes (Frogneret al., 2015; Hashimoto et al., 2016). The purpose ofour work is to provide an extension of these works, toensure that deep architectures can be used at a gran-ulary level on measures directly. In particular, ourwork shares some of the goals laid out in (Hashimotoet al., 2016), which considers recurrent architecturesfor measures (a special case of our framework). Themost salient distinction with respect to our work isthat our building blocks take into account multipleinteractions between samples from the distributions,while their architecture has no interaction but takesinto account diffusion through the injection of randomnoise.

1.2 Contributions

In this paper, we design deep architectures that can(i) map measures to measures (ii) bridge the gap be-tween measures and Euclidean spaces. They can thusaccept as input for instance discrete distributions sup-ported on (weighted) point clouds with an arbitrarynumber of points, can generate point clouds with anarbitrary number of points (arbitrary refined resolu-tion) and are naturally invariant to permutations inthe ordering of the support of the measure. The math-ematical idealization of these architectures are infinitedimensional by nature, and they can be computed nu-merically either by sampling (Lagrangian mode) or bydensity discretization (Eulerian mode). The Eulerianmode resembles classical convolutional deep network,while the Lagrangian mode, which we focus on, definesa new class of deep neural models. Our first contri-bution is to detail this new framework for supervisedand unsupervised learning problems over probabilitymeasures, making a clear connexion with the idea ofiterative transformation of random vectors. These ar-chitectures are based on two simple building blocks:interaction functionals and self-tensorization. Thismachine learning pipeline works hand-in-hand withthe use of optimal transport, both as a mathemati-cal performance criterion (to evaluate smoothness andapproximation power of these models) and as a lossfunctional for both supervised and unsupervised learn-ing. Our second contribution is theoretical: we proveboth quantitative Lipschitz robustness of these archi-tectures for the topology of the convergence in lawand universal approximation power. Our last contri-bution is a showcase of several instantiations of suchdeep stochastic networks for classification (mappingmeasures to vectorial features), generation (mappingback and forth measures to code vectors) and predic-tion (mapping measures to measures, which can beintegrated in a recurrent network).

2

1.3 Notations

We denote X ∈ R(Rq) a random vector on Rq andαX ∈ M1

+(Rq) its law, which is a positive Radonmeasure with unit mass. It satisfies for any contin-uous map f ∈ C(Rq), E(f(X)) =

∫Rq f(x)dαX(x). Its

expectation is denoted E(X) =∫Rq xdαX(x) ∈ Rq.

We denote C(Rq;Rr) the space of continuous functions

from Rq to Rr and C(Rq) def.= C(Rq;R). In this pa-

per, we focus on the law of a random vector, so thattwo vectors X and X ′ having the same law (denotedX ∼ X ′) are considered to be indistinguishable.

2 Stochastic Deep Architec-tures

In this section, we define elementary blocks, mappingrandom vectors to random vectors, which constitute alayer of our proposed architectures, and depict howthey can be used to build deeper networks.

2.1 Elementary Blocks

Our deep architectures are defined by stacking a suc-cession of simple elementary blocks that we now define.

Definition 1 (Elementary Block). Given a functionf : Rq × Rq → Rr, its associated elementary blockTf : R(Rq)→ R(Rr) is defined as

∀X ∈ R(Rq), Tf (X)def.= EX′∼X(f(X,X ′)) (1)

where X ′ is a random vector independent from X hav-ing the same distribution.

Remark 1 (Discrete random vectors). A particular in-stance, which is the setting we use in our numericalexperiments, is when X is distributed uniformly on aset (xi)

ni=1 of n points i.e. when αX = 1

n

∑ni=1 δxi

. Inthis case, Y = Tf (X) is also distributed on n points

αY =1

n

n∑i=1

δyi where yi =1

n

n∑j=1

f(xi, xj).

This elementary operation (1) displaces the distri-bution of X according to pairwise interactions mea-sured through the map f . As done usually in deeparchitectures, it is possible to localize the computa-tion at some scale τ by imposing that f(x, x′) is zerofor ||x − x′|| > τ , which is also useful to reduce thecomputation time.

Remark 2 (Fully-connected case). As it is customaryfor neural networks, the map f : Rq × Rq → Rr weconsider for our numerical applications are affine mapscomposed by a pointwise non-linearity, i.e.

f(x, x′) = (λ(yk))rk=1 where y = A · [x;x′] + b ∈ Rr

where λ : R → R is a pointwise non-linearity (in ourexperiments, λ(s) = max(s, 0) is the ReLu map). Theparameter is then θ = (A, b) where A ∈ Rr×2q is amatrix and b ∈ Rr is a bias.

Remark 3 (Deterministic layers). Classical “determin-istic” deep architectures are recovered as special caseswhen X is a constant vector, assuming some value x ∈Rq with probability 1, i.e. αX = δx. A stochastic layercan output such a deterministic vector, which is impor-tant for instance for classification scores in supervisedlearning (see Section 4 for an example) or latent codevectors in auto-encoders (see Section 4 for an illus-tration). In this case, the map f(x, x′) = g(x′) doesnot depend on its first argument, so that Y = Tf (X)is constant equal to y = EX(g(X)) =

∫Rq g(x)dαX(x).

Such a layer thus computes a summary statistic vectorof X according to g.

Remark 4 (Push-Forward). In sharp contrast to theprevious remark, one can consider the case f(x, x′) =h(x) so that f only depends on its first argument. Onethen has Tf (X) = h(X), which corresponds to thenotion of push-forward of measure, denoted αTf (X) =

h]αX . For instance, for a discrete law αX = 1n

∑i δxi

then αf]X = 1n

∑i δh(xi). The support of the law of X

is thus deformed by h.

Remark 5 (Higher Order Interactions and Tensoriza-tion). Elementary Blocks are generalized to han-dle higher-order interactions by considering f :

(Rq)N → Rr, one then defines Tf (X)def.=

EX2,...,XN(f(X,X2, . . . , XN )) where (X2, . . . , XN ) are

independent and identically distributed copies of X.An equivalent and elegant way to introduce these in-teractions in a deep architecture is by adding a ten-sorization layer, which maps X 7→ X2 ⊗ . . . ⊗ XN ∈R((Rq)N−1). Section 3 details the regularity and ap-proximation power of these tensorization steps.

2.2 Building Stochastic Deep Architec-tures

These elementary blocks are stacked to constructdeep architectures. A stochastic deep architecture isthus a map

X ∈ R(Rq0) 7→ Y = TfT · · · Tf1(X) ∈ R(RqT ), (2)

where ft : Rqt−1 × Rqt−1 → Rqt . Typical instances ofthese architectures includes:

• Predictive: this is the general case where the ar-chitecture inputs a random vector and outputs an-other random vector. This is useful to model forinstance time evolution using recurrent networks,and is used in Section 4 to tackle a dynamic pre-diction problem.

3

• Discriminative: in which case Y is constant equalto a vector y ∈ RqT (i.e. αY = δy) which canrepresent either a classification score or a latentcode vector. Following Remark 3, this is achievedby imposing that fT only depends on its sec-ond argument. Section 4 shows applications ofthis setting to classification and variational auto-encoders (VAE).

• Generative: in which case the network should in-put a deterministic code vector x0 ∈ Rq0 andshould output a random vector Y . This isachieved by adding extra randomization througha fixed random vector X0 ∈ R(Rq0−q0) (for in-stance a Gaussian noise) and stacking X0 =(x0, X0) ∈ R(Rq0). Section 4 shows an appli-cation of this setting to VAE generative models.Note that while we focus for simplicity on VAEmodels, it is possible to use our architectures forGANs Goodfellow et al. (2014) as well.

2.3 Recurrent Nets as Gradient Flows

Following the work of Hashimoto et al. (2016), in thespecial case Rq = Rr, one can also interpret iterativeapplications of such a Tf (i.e. considering a recurrentdeep network) as discrete optimal transport gradientflows (Santambrogio, 2015) (for the W2 distance, seealso Definition (3)) in order to minimize a quadratic in-

teraction energy E(α)def.=∫Rq×Rq F (x, x′)dα(x)dα(x′)

(we assume for ease of notation that F is symmet-ric). Indeed, introducing a step size τ > 0, settingf(x, x′) = x+2τ∇xF (x, x′), one sees that the measureαX`

defined by the iterates X`+1 = Tf (X`) of a recur-rent nets is approximating at time t = `τ the Wasser-stein gradient flow α(t) of the energy E . As detailed forinstance in Santambrogio (2015), such a gradient flowis the solution of the PDE ∂α

∂t = div(α∇(E ′(α))) whereE ′(α) =

∫Rq F (x, ·)dα(x) is the “Euclidean” derivative

of E . The pioneer work of Hashimoto et al. (2016) onlyconsiders linear and entropy functionals of the formE(α) =

∫(F (x) + log(dα

dx ))dα(x) which leads to evo-lutions α(t) being Fokker-Plank PDEs. Our work canthus be interpreted as extending this idea to the moregeneral setting of interaction functionals (see Section 3for the extension beyond pairwise interactions).

3 Theoretical Analysis

In order to get some insight on these deep archi-tectures, we now highlight some theoretical resultsdetailing the regularity and approximation power ofthese functionals. This theoretical analysis relies onthe Wasserstein distance, which allows to make quan-titative statements associated to the convergence in

law.

3.1 Convergence in Law Topology

Wasserstein distance. In order to measure reg-ularity of the involved functionals, and also to de-fine loss functions to fit these architectures (see Sec-tion 4), we consider the p-Wasserstein distance (for1 6 p < +∞) between two probability distributions(α, β) ∈M1

+(Rq)

Wpp(α, β)

def.= min

π1=α,π2=β

∫(Rq)2

||x− y||pdπ(x, y) (3)

where π1, π2 ∈ M1+(Rq) are the two marginals of a

coupling measure π, and the minimum is taken amongcoupling measures π ∈M1

+(Rq × Rq).A classical result (see Santambrogio (2015)) asserts

that W1 is a norm, and can be conveniently computedusing

W1(α, β) = W1(α− β) = maxLip(g)61

∫Xgd(α− β),

where Lip(g) is the Lipschitz constant of a map g :X → R (with respect to the Euclidean norm unlessotherwise stated).

With an abuse of notation, we write Wp(X,Y ) todenote Wp(αX , αY ), but one should be careful thatwe are considering distances between laws of randomvectors. An alternative formulation is Wp(X,Y ) =minX′,Y ′ E(||X ′ − X ′||p)1/p where (X ′, Y ′) is a coupleof vectors such that X ′ (resp. Y ′) has the same law asX (resp. Y ), but of course X ′ and Y ′ are not neces-sarily independent. The Wasserstein distance metrizesthe convergence in law (denoted ) in the sense thatXk X is equivalent to W1(Xk, X)→ 0.

In the numerical experiments, we estimate Wp usingSinkhorn’s algorithm (Cuturi, 2013), which provides asmooth approximation amenable to (possibly stochas-tic) gradient descent optimization schemes, whether itbe for generative or predictive tasks (see Section 4).

Lipschitz property. A map T : R(Rq) → R(Rr)is continuous for the convergence in law (aka theweak∗ of measures) if for any sequence Xk X, thenT (Xk) T (X). Such a map is furthermore said to beC-Lipschitz for the 1-Wasserstein distance if

∀ (X,Y ) ∈ R(Rq)2, W1(T (X), T (Y )) 6 C W1(X,Y ).(4)

Lipschitz properties enable us to analyze robustness toinput perturbations, since it ensures that if the inputdistributions of random vectors are close enough (inthe Wasserstein sense), the corresponding output lawsare close too.

4

3.2 Regularity of Building blocks

Elementary blocks. The following proposition,whose proof can be found in Appendix B, shows thatelementary blocks are robust to input perturbations.

Proposition 1 (Lipschitzianity of elementary blocks).If for all x, f(x, ·) and f(·, x) are C(f)-Lipschitz, thenTf is 2rC(f)-Lipschitz in the sense of (4).

As a composition of Lipschitz functions defines Lip-schitz maps, the architectures of the form (2) are thusLipschitz, with a Lipschitz constant upper-bounded by2r∑t C(ft), where we used the notations of Proposi-

tion 1.

Tensorization. As highlighted in Remark 5, ten-sorization plays an important role to define higher-order interaction blocks.

Definition 2 (Tensor product). Given two (X,Y ) ∈R(X ) × R(Y), a tensor product random vector is

X ⊗ Ydef.= (X ′, Y ′) ∈ R(X × Y) where X ′ and Y ′

are independent and have the same law as X and Y .This means that dαX⊗Y (x, y) = dαX(x)dαY (y) is thetensor product of the measures.

Remark 6 (Tensor Product between Discrete Mea-sures). If we consider random vectors supportedon point clouds, with laws αX = 1

n

∑ni=1 δxi

andαY = 1

m

∑mj=1 δyj , then X ⊗ Y is a discrete ran-

dom vector supported on nm points, since αX⊗Y =1nm

∑i,j δ(xi,yj).

The following proposition shows that tensorizationblocks maintain the stability property of a deep archi-tecture.

Proposition 2 (Lipschitzness of tensorization). Onehas, for (X,X ′, Y, Y ′) ∈ R(X )2 ×R(Y)2,

W1(X ⊗ Y,X ′ ⊗ Y ′) 6 W1(X,X ′) + W1(Y, Y ′).

Proof. One has

W1(α⊗ β, α′ ⊗ β′)

= maxLip(g)61

∫1

g(x, y)[dα(x)dβ(y)− dα′(x)dβ′(y)]

= maxLip(g)61

∫X

∫Yg(x, y)[dβ(y)− dβ′(y)]dα(x)

+

∫Y

∫Xg(x, y)[dα(x)− dα′(x)]dβ(y),

hence the result.

3.3 Approximation Theorems

Universality of elementary block. The followingtheorem shows that any continuous map between ran-dom vectors can be approximated to arbitrary preci-sion using three elementary blocks. Note that it in-cludes through Λ a fixed random input which operatesas an “excitation block” similar to the generative VAEmodels studied in Section 4.2.

Theorem 1. Let F : R(X ) → R(Y) be a continuousmap for the convergence in law, where X ⊂ Rq andY ⊂ Rr are compact. Then ∀ε > 0 there exists threecontinuous maps f, g, h such that

∀X ∈ R(X ), W1(F(X), ThΛTgTf (X)) 6 ε. (5)

where Λ : X 7→ (X,U) concatenates a uniformly dis-tributed random vector U .

The architecture that we use to prove this theoremis displayed on Figure 1, bottom (left). Since f , gand h are smooth maps, according to the universalitytheorem of neural networks (Cybenko, 1989; Leshnoet al., 1993) (assuming some restriction on the non-linearity λ, namely its being a nonconstant, boundedand continuous function), it is possible to replace eachof them (at the expense of increasing ε) by a sequenceof fully connected layers (as detailed in Remark 2).This is detailed in Section D of the appendix.

Since deterministic vectors are a special case of ran-dom vectors (see Remark 3), this results encompassesas a special case universality for vector-valued mapsF : R(Ω) → Rr (used for instance in classificationin Section 4.1) and in this case only 2 elementaryblocks are needed. Of course the classical universal-ity of multi-layer perceptron (Cybenko, 1989; Leshnoet al., 1993) for vectors-to-vectors maps F : Rq → Rris also a special case (using a single elementary block).

Proof. (of Theorem 1) In the following, we denote theprobability simplex as Σn =

a ∈ Rn+ ;

∑i ai = 1

.

Without loss of generality, we assume X ⊂ [0, 1]q

and Y ⊂ [0, 1]r. We consider two uniform gridsof n and m points (xi)

ni=1 of [0, 1]q and (yj)

mj=1 of

[0, 1]r. On these grids, we consider the usual piece-wise affine P1 finite element bases (ϕi)

ni=1 and (ψj)

mj=1,

which are continuous hat functions supported on cells(Ri)i and (Sj)j which are cubes of width 2/n1/q

and 2/m1/r. We define discretization operators asDX : α ∈ M+

1 (X ) 7→ (∫Riϕidα)ni=1 ∈ Σn and

DY : β ∈ M+1 (Y) 7→ (

∫Sjψjdβ)mj=1 ∈ Σn. We also

define D∗X : a ∈ Σn 7→∑i aiδxi ∈ M+

1 (X ) andD∗Y : b ∈ Σm 7→

∑j bjδyi ∈M

+1 (Y).

The map F induces a discrete map G : Σn → Σmdefined by G

def.= DY F D∗X . Remark that D∗X is

continuous from Σn (with the usual topology on Rn)

5

to M1+(X ) (with the convergence in law topology),

F is continuous (for the convergence in law), DY iscontinuous fromM1

+(Y) (with the convergence in lawtopology) to Σm (with the usual topology on Rm).This shows that G is continuous.

For any b ∈ Σm, Lemma 2 proved in the appendicesdefines a continuous map H so that, defining U to bea random vector uniformly distributed on [0, 1]r (withlaw U), H(b, U) has law (1− ε)D∗Y(b) + εU .

We now have all the ingredients, and define the threecontinuous maps for the elementary blocks as

f(x, x′) = (ϕi(x′))ni=1 ∈ Rn, g(a, a′) = G(a′) ∈ Rm,

and h((b, u), (b′, u′)) = H(b, u) ∈ Y.

The corresponding architecture is displayed on Fig-ure 1, bottom. Using these maps, one needs to control

the error between F and F def.= ThΛTgTf = H]Λ

DY F D∗X DX where we denoted H](b)def.= H(b, ·)]U

the law of H(b, U) (i.e. the pushforward of the uniformdistribution U of U by H(b, ·)).

(i) We define αdef.= D∗XDX (α). The diameters of the

cells Ri is ∆j =√q/n1/q, so that Lemma 1 in the

appendices shows that W1(α, α) 6√q/n1/q. Since F

is continuous for the convergence in law, choosing nlarge enough ensures that W1(F(α),F(α)) 6 ε.

(ii) We define βdef.= D∗YDYF(α). Similarly, using m

large enough ensures that W1(F(α), β) 6 ε.

(iii) Lastly, let us define βdef.= H] DY(β) = F(α).

By construction of the map H in Lemma 2, one hashat β = (1−ε)β+εU so that W1(β, β) = εW1(β,U) 6Cε for the constant C = 2

√r since the measures are

supported in a set of diameter√r.

Putting these three bounds (i), (ii) and (iii) to-gether using the triangular inequality shows thatW1(F(α), F(α)) 6 (2 + C)ε.

Universality of tensorization. The following The-orem, whose proof can be found in Appendix E, showsthat one can approximate any continuous map usinga high enough number of tensorization layers followedby an elementary block.

Theorem 2. Let F : R(Ω) → R a continuous mapfor the convergence in law, where Ω ⊂ Rq is compact.Then ∀ε > 0, there exists n > 0 and a continuousfunction f such that

∀X ∈ R(Ω), |F(X)− Tf θn(X)| 6 ε (6)

where θn(X) = X⊗ . . .⊗X is the n-fold self tensoriza-tion.

The architecture used for this theorem is displayedon the bottom (right) of Figure 1. The function fappearing in (6) plays a similar role as in (5), but note

that the two-layers factorizations provided by thesetwo theorems are very different. It is an interestingavenue for future work to compare them theoreticallyand numerically.

4 Applications

To exemplify the use of our stochastic deep architec-tures, we consider classification, generation and dy-namic prediction tasks. The goal is to highlight theversatility of these architectures and their ability tohandle as input and/or output both probability dis-tributions and vectors. We also illustrate the gain inmaintaining stochasticity among several layers.

4.1 Classification tasks

MNIST Dataset. We perform classification on the2-D MNIST dataset of handwritten digits. To converta MNIST image into a 2D point cloud, we thresholdpixel values (threshold ρ = 0.5) and use as a supportof the input empirical measure the n = 256 pixels ofhighest intensity, represented as points (xi)

ni=1 ⊂ R2

(if there are less than n = 256 pixels of intensity overρ, we repeat input coordinates), which are remappedalong each axis by mean and variance normalization.Each image is therefore turned into a sum of n = 256Diracs 1

n

∑i δxi

. Our stochastic network architectureis displayed on the top of Figure 1 and is composed of5 elementary blocks (Tfk)5

k=1 with an interleaved self-tensorisation layer X 7→ X ⊗X. The first elementaryblock Tf1 maps measures to measures, the second oneTf2 maps a measure to a deterministic vector (i.e. doesnot depend on its first coordinate, see Remark 3), andthe last layers are classical vectorial fully-connectedones. We use a ReLu non-linearity λ (see Remark 2).The weights are learnt with a weighted cross-entropyloss function over a training set of 55,000 examplesand tested on a set of 10,000 examples. Initializa-tion is performed through the Xavier method (Glorotand Bengio, 2010) and learning with the Adam opti-mizer (Kingma and Ba, 2014). Table 1 displays ourresults, compared with the PointNet Qi et al. (2016)baseline. We observe that maintaining stochasticityamong several layers is beneficial (as opposed to re-placing one Elementary Block with a fully connectedlayer allocating the same amount of memory).

input type error (%)PointNet point set 0.78

Ours measure (1 stochastic layer) 1.07Ours measure (2 stochastic layers) 0.76

Table 1: MNIST classification resultsModelNet40 Dataset. We evaluate our model onthe ModelNet40 (Wu et al., 2015b) shape classifica-

6

Figure 1: Top and center: two examples of deep stochastic architectures applied to the MNIST dataset: top forclassification purpose (Section 4.1), center for generative model purpose (Section 4.2). Bottom: architecture forthe proof of Theorems 1 and 2.

tion benchmark. The dataset contains 3-D CAD mod-els from 40 man-made categories, split into 9,843 ex-amples for training and 2,468 for testing. We con-sider n = 1, 024 samples on each surface, obtained bya farthest point sampling procedure. Our classifica-tion network is similar to the one displayed on topof Figure 1, excepted that the layer dimensions are[3, 10, 500, 800, 400, 40]. Our results are displayed infigure 2. As previously observed in 2D, peformance isimproved by maintaining stochasticity among severallayers, for the same amount of allocated memory.

input type accuracy (%)3DShapeNets volume 77

Pointnet point set 89.2Ours measure 82.0

(1 stochastic layer)Ours measure 83.5

(2 stochastic layers)

Table 2: ModelNet40 classification results

4.2 Generative networks

We further evaluate our framework for generativetasks, on a VAE-type model (Kingma and Welling,2013) (note that it would be possible to use our archi-tectures for GANs Goodfellow et al. (2014) as well).The task consists in generating outputs resemblingthe data distribution by decoding a random variable zsampled in a latent space Z. The model, an encoder-decoder architecture, is learnt by comparing input and

output measures using the W2 Wasserstein distanceloss, approximated using Sinkhorn’s algorithm (Cu-turi, 2013; Genevay et al., 2018). Following (Kingmaand Welling, 2013), a Gaussian prior is imposed onthe latent variable z. The encoder and the decoderare two mirrored architectures composed of two ele-mentary blocks and three fully-connected layers each.The corresponding stochastic network architecture isdisplayed on the bottom of 1. Figure 2 displays anapplication on the MNIST database where the latentvariable z ∈ R2 parameterizes a 2-D of manifold ofgenerated digits. We use as input and output discreteprobability measures of n = 100 Diracs, displayed aspoint clouds on the right of Figure 2.

4.3 Dynamic Prediction

The Cucker-Smale flocking model (Cucker andSmale, 2007) is non-linear dynamical system modellingthe emergence of coherent behaviors, as for instance inthe evolution of a flock of birds, by solving for posi-

tions and speed xi(t)def.= (pi(t) ∈ Rd, vi(t) ∈ Rd) for

i = 1, . . . , n

p(t) = v(t), and v(t) = L(p(t))v(t) (7)

where L(p) ∈ Rn×n is the Laplacian matrix associatedto a group of points p ∈ (Rd)n

L(p)i,jdef.=

1

1 + ||pi − pj ||m, L(p)i,i = −

∑j 6=i

L(p)i,j .

7

Figure 2: Left: Manifold of digits generated by theVAE network displayed on the bottom of 1. Right:Corresponding point cloud (displaying only a subsetof the left images).

In the numerics, we set m = 0.6. This settingcan be adapted to weighted particles (xi(t), µi)i=1···n,where each weight µi stands for a set of physical at-tributes impacting dynamics – for instance, mass –which is what we consider here. This model equiva-lently describes the evolution of the measure α(t) =∑ni=1 µiδxi(t) in phase space (Rd)2, and following Re-

mark 2.3 on the ability of our architectures to modeldynamical system involving interactions, (7) can bediscretized in time which leads to a recurrent networkmaking use of a single elementary block Tf betweeneach time step. Indeed, our block allows to maintainstochasticity among all layers – which is the naturalway of proceeding to follow densities of particles overtime.

It is however not the purpose of this article to studysuch a recurrent network and we aim at showcas-ing here whether deep (non-recurrent) architecturesof the form (2) can accurately capture the Cucker-Smale model. More precisely, since in the evolu-tion (7) the mean of v(t) stays constant, we can assume∑i vi(t) = 0, in which case it can be shown (Cucker

and Smale, 2007) that particles ultimately reach sta-ble positions (p(t), v(t)) 7→ (p(∞), 0). We denote

F(α(0))def.=∑ni=1 µiδpi(∞) the map from some initial

configuration in the phase space (which is describedby a probability distribution α(0)) to the limit prob-ability distribution (described by a discrete measuresupported on the positions pi(∞)). The goal is to ap-proximate this map using our deep stochastic archi-tectures. To showcase the flexibility of our approach,we consider a non-uniform initial measure α(0) andapproximate its limit behavior F(α(0)) by a uniformone (µi = 1

n ).

In our experiments, the measure α(t) models thedynamics of several (2 to 4) flocks of birds mov-ing towards each other, exhibiting a limit behav-ior of a single stable flock. As shown in Figures 3and 4, positions of the initial flocks are normally dis-

tributed, centered respectively at edges of a rectangle(−4; 2), (−4;−2), (4; 2), (4;−2) with variance 1. Theirvelocities (displayed as arrows with lengths propor-tional to magnitudes in Figures 3 and 4) are uniformlychosen within the quarter disk [0;−0.1] × [0.1; 0].Their initial weights µi are normally distributed withmean 0.5 and sd 0.1, clipped by a ReLu and normal-ized. Figures 3 (representing densities) and 4 (de-picting corresponding points’ positions) show that fora set of n = 720 particles, quite different limit be-haviors are successfully retrieved by a simple net-work composed of five elementary blocks with layersof dimensions [2, 10, 20, 40, 60], learnt with a Wasser-stein (Genevay et al., 2018) fitting criterion (computedwith Sinkhorn’s algorithm (Cuturi, 2013)).

(a)

(b)

(c)

Initial α(0) F(α(0)) Predicted

Figure 3: Prediction of the asymptotic density of theflocking model, for various initial speed values v(0)and n = 720 particles. Eg. for top left cloud: (a)v(0) = (0.050;−0.085); (b) v(0) = (0.030;−0.094); (c)v(0) = (0.056;−0.081).

Conclusion

In this paper, we have proposed a new formalismfor stochastic deep architectures, which can cope in aseamless way with both probability distributions anddeterministic feature vectors. The salient features ofthese architectures are their robustness and approxi-mation power with respect to the convergence in law,which is crucial to tackle high dimensional classifi-cation, regression and generation problems over the

8

(a)

(b)

(c)

Initial α(0) F(α(0)) Predicted

Figure 4: Prediction of particles’ positions correspond-ing to Figure 3. Dots’ diameters are proportial toweights µi (predicted ones all have the same size sinceµi = 1

n ).

space of probability distributions.

Acknowledgements

The work of Gwendoline de Bie was supported by agrant from Region Ile-de-France. The work of GabrielPeyre has been supported by the European ResearchCouncil (ERC project NORIA).

References

Arjovsky, M., Chintala, S., and Bottou, L. (2017).Wasserstein gan. arXiv preprint arXiv:1701.07875.

Bassetti, F., Bodini, A., and Regazzini, E. (2006). Onminimum kantorovich distance estimators. Statistics& probability letters, 76(12):1298–1302.

Bernton, E., Jacob, P., Gerber, M., and Robert, C.(2017). Inference in generative models using theWasserstein distance. working paper or preprint.

Chen, X., Cheng, X., and Mallat, S. (2014). Unsu-pervised deep haar scattering on graphs. Advancesin Neural Information Processing Systems, pages1709–1717.

Cheng, X., Chen, X., and Mallat, S. (2016). Deep haar

scattering networks. Information and Inference: AJournal of the IMA, 5(2):105–133.

Cohen, T. and Welling, M. (2016). Group equivari-ant convolutional networks. International Confer-ence on Machine Learning, pages 2990–2999.

Courty, N., Flamary, R., Tuia, D., and Rakotoma-monjy, A. (2017). Optimal transport for domainadaptation. IEEE transactions on pattern analysisand machine intelligence, 39(9):1853–1865.

Cucker, F. and Smale, S. (2007). On the mathemat-ics of emergence. Japanese Journal of Mathematics,2(1):197–227.

Cuturi, M. (2013). Sinkhorn distances: Lightspeedcomputation of optimal transport. Advances inneural information processing systems, pages 2292–2300.

Cybenko, G. (1989). Approximation by superpositionsof a sigmoidal function. Mathematics of control, sig-nals and systems, 2(4):303–314.

Fan, H., Su, H., and Guibas, L. (2016). Apoint set generation network for 3d object recon-struction from a single image. arXiv preprintarXiv:1612.00603.

Frogner, C., Zhang, C., Mobahi, H., Araya, M., andPoggio, T. A. (2015). Learning with a wassersteinloss. Advances in Neural Information ProcessingSystems, pages 2053–2061.

Genevay, A., Peyre, G., and Cuturi, M. (2018). Learn-ing generative models with sinkhorn divergences. In-ternational Conference on Artificial Intelligence andStatistics, pages 1608–1617.

Gens, R. and Domingos, P. M. (2014). Deep symmetrynetworks. Advances in Neural Information Process-ing Systems 27, pages 2537–2545.

Glorot, X. and Bengio, Y. (2010). Understanding thedifficulty of training deep feedforward neural net-works. Proceedings of the thirteenth internationalconference on artificial intelligence and statistics,pages 249–256.

Godin, M., Bryan, A. K., Burg, T. P., Babcock, K.,and Manalis, S. R. (2007). Measuring the mass,density, and size of particles and cells using a sus-pended microchannel resonator. Applied physics let-ters, 91(12):123121.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Ben-gio, Y. (2014). Generative adversarial nets. InAdvances in neural information processing systems,pages 2672–2680.

9

Grover, W. H., Bryan, A. K., Diez-Silva, M., Suresh,S., Higgins, J. M., and Manalis, S. R. (2011). Mea-suring single-cell density. Proceedings of the Na-tional Academy of Sciences, 108(27):10992–10996.

Guckenheimer, J., Oster, G., and Ipaktchi, A. (1977).The dynamics of density dependent populationmodels. Journal of Mathematical Biology, 4(2):101–147.

Guttenberg, N., Virgo, N., Witkowski, O., Aoki, H.,and Kanai, R. (2016). Permutation-equivariant neu-ral networks applied to dynamics prediction. arXivpreprint arXiv:1612.04530.

Hashimoto, T., Gifford, D., and Jaakkola, T. (2016).Learning population-level diffusions with generativernns. International Conference on Machine Learn-ing, pages 2417–2426.

Henaff, M., Bruna, J., and LeCun, Y. (2015). Deepconvolutional networks on graph-structured data.arXiv preprint arXiv:1506.05163.

Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P.,Kingsbury, B., and Sainath, T. (2012). Deep neuralnetworks for acoustic modeling in speech recogni-tion. IEEE Signal Processing Magazine, 29:82–97.

Hochreiter, S. and Schmidhuber, J. (1997). Longshort-term memory. Neural computation, 9(8):1735–1780.

Huang, G., Guo, C., Kusner, M. J., Sun, Y., Sha,F., and Weinberger, K. Q. (2016). Supervised wordmover’s distance. Advances in Neural InformationProcessing Systems, pages 4862–4870.

Kingma, D. P. and Ba, J. (2014). Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Kingma, D. P. and Welling, M. (2013). Auto-encodingvariational bayes. arXiv preprint arXiv:1312.6114.

Krizhevsky, A., Sutskever, I., and Hinton, G. E.(2012). Imagenet classification with deep convolu-tional neural networks. Advances in Neural Infor-mation Processing Systems 25, pages 1097–1105.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner,P. (1998). Gradient-based learning applied todocument recognition. Proceedings of the IEEE,86(11):2278–2324.

Lee, H., Pham, P., Largman, Y., and Ng, A. Y. (2009).Unsupervised feature learning for audio classifica-tion using convolutional deep belief networks. Ad-vances in Neural Information Processing Systems22, pages 1096–1104.

Leshno, M., Lin, V. Y., Pinkus, A., and Schocken,S. (1993). Multilayer feedforward networks with anonpolynomial activation function can approximateany function. Neural networks, 6(6):861–867.

Qi, C. R., Su, H., Mo, K., and Guibas, L. J.(2016). Pointnet: Deep learning on point sets for3d classification and segmentation. arXiv preprintarXiv:1612.00593.

Qi, C. R., Yi, L., Su, H., and Guibas, L. J. (2017).Pointnet++: Deep hierarchical feature learning onpoint sets in a metric space. Advances in NeuralInformation Processing Systems, pages 5105–5114.

Ravanbakhsh, S., Schneider, J., and Poczos, B. (2016).Deep learning with sets and point clouds. arXivpreprint arXiv:1611.04500.

Ravanbakhsh, S., Schneider, J., and Poczos, B. (2017).Equivariance through parameter-sharing. arXivpreprint arXiv:1702.08389.

Rolet, A., Cuturi, M., and Peyre, G. (2016). Fast dic-tionary learning with a smoothed wasserstein loss.Artificial Intelligence and Statistics, pages 630–638.

Santambrogio, F. (2015). Optimal transport for ap-plied mathematicians. Birkauser, NY.

Tereshko, V. (2000). Reaction-diffusion model of ahoneybee colony’s foraging behaviour. In Interna-tional Conference on Parallel Problem Solving fromNature, pages 807–816. Springer.

Vinyals, O., Bengio, S., and Kudlur, M. (2015). Or-der matters: Sequence to sequence for sets. arXivpreprint arXiv:1511.06391.

Vinyals, O., Bengio, S., and Kudlur, M. (2016). Or-der matters: Sequence to sequence for sets. In In-ternational Conference on Learning Representations(ICLR).

Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang,X., and Xiao, J. (2015a). 3d shapenets: A deeprepresentation for volumetric shapes. Proceedings ofthe IEEE conference on computer vision and patternrecognition, pages 1912–1920.

Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang,X., and Xiao, J. (2015b). 3d shapenets: A deeprepresentation for volumetric shapes. Proceedings ofthe IEEE conference on computer vision and patternrecognition, pages 1912–1920.

Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B.,Salakhutdinov, R. R., and Smola, A. J. (2017). Deepsets. Advances in Neural Information ProcessingSystems 30, pages 3391–3401.

10

Supplementary

A Useful Results

We detail below operators which are useful to de-compose a building block Tf into push-forward andintegration. In the following, to ease the reading, weoften denote WX the W1 distance on a space X (e.g.X = Rq).

Push-forward. The push-forward operator allowsfor modifications of the support while maintaining thegeometry of the input measure.

Definition 3 (Push-forward). For a function f : X →Y, we define the push-forward f]α ∈ M(Y) of α ∈M(X ) by T as defined by

∀ g ∈ C(Y),

∫Ygd(f]α)

def.=

∫Xg(f(x))dα(x). (8)

Note that f] :M(X )→M(Y) is a linear operator.

Proposition 3 (Lipschitzness of push-forward). Onehas

WY(f]α, f]β) 6 Lip(f) WX (α, β), (9)

WY(f]α, g]α) 6 ||f − g||L1(α), (10)

where Lip(f) designates the Lipschitz constant of f .

Proof. ∀h : Y → R s.t. Lip(h) 6 1, hfLip(f) is 1-

Lipschitz, therefore∫X

h fLip(f)

d(α− β) 6 WX (α, β)

hence inequality (9). Similarly, ∀h s.t. Lip(h) 6 1,∫X

(h f − h g)dα 6∫X||f(x)− g(x)||2dα(x)

hence inequality (10).

Integration. We now define a (partial) integrationoperator.

Definition 4 (Integration). For f ∈ C(Z × X ;Y =Rr), and α ∈M(X ) we denote

f [·, α]def.=

∫Xf(·, x)dα(x) : Z → Y.

Proposition 4 (Lipschitzness of integration). Withsome fixed ζ ∈M1

+(Z), one has

||f [·, α]− f [·, β]||L1(ζ) 6 r Lip2(f) WX (α, β).

where we denoted by Lip2(f) a bound on the Lipschitzcontant of the function f(z, ·) for all z.

Proof.

||f [·, α]− f [·, β]||L1(ζ)

=

∫Z‖f [·, α](z)− f [·, β](z)‖2dζ(z)

=

∫Z

∥∥∥∥∫Xf(z, x)d(α− β)(x)

∥∥∥∥2

dζ(z)

6∫Z

∥∥∥∥∫Xf(z, x)d(α− β)(x)

∥∥∥∥1

dζ(z)

=

∫Z

r∑i=1

∣∣∣∣∫Xfi(z, x)d(α− β)(x)

∣∣∣∣dζ(z)

6r∑i=1

Lip2(fi) WX (α, β)

6 r Lip2(f) WX (α, β)

where we denoted by Lip2(fi) a bound on the Lipschitzcontant of the function fi(z, ·) (i-th component of f)for all z, since again, fi

Lip2(fi)is 1-Lipschitz.

Approximation by discrete measures The fol-lowing lemma shows how to control the approximationerror between an arbitrary random variable and a dis-crete variable obtained by computing moments againslocalized functions on a grid.

Lemma 1. Let (Sj)Nj=1 be a partition of a domain

including Ω (Sj ⊂ Rd) and let xj ∈ Sj. Let (ϕj)Nj=1

a set of bounded functions ϕj : Ω → R supported onSj, such that

∑j ϕj = 1 on Ω. For α ∈ M1

+(Ω), we

denote αNdef.=∑Nj=1 αjδxj

with αjdef.=∫Sjϕjdα. One

has, denoting ∆jdef.= maxx∈SJ

||xj − x||,

W1(αN , α) 6 max16j6N

∆j .

Proof. We define π ∈ M1+(Ω2), a transport plan cou-

pling marginals α and αN , by imposing for all f ∈C(Ω2), ∫

Ω2

fdπ =

N∑j=1

∫Sj

f(x, xj)ϕj(x)dα(x).

π indeed is a transport plan, since for all g ∈ C(Ω),∫Ω2

g(x)dπ(x, y) =

N∑j=1

∫Sj

g(x)ϕj(x)dα(x)

=

N∑j=1

∫Ω

g(x)ϕj(x)dα(x)

=

∫Ω

g(x)

N∑j=1

ϕj(x)

dα(x) =

∫Ω

gdα.

11

Also,

∫Ω2

g(y)dπ(x, y) =

N∑j=1

∫Sj

g(xj)ϕjdα

=

N∑j=1

g(xj)

∫Ω

ϕjdα =

N∑j=1

αjg(xj) =

∫Ω

gdαN .

By definition of the Wasserstein-1 distance,

W(αN , α) 6∫

Ω2

||x− y||dπ(x, y)

=

N∑j=1

∫Sj

ϕj(x)||x− xj ||dα(x) 6N∑j=1

∫Sj

ϕj∆jdα

6

(N∑i=1

∫Ω

ϕidα

)max

16j6N∆j = max

16j6N∆j .

B Proof of Proposition 1

Let us stress that the elementary block Tf (X) de-fined in (1) only depends on the law αX . In the fol-lowing, for a measure α we denote Tf (αX) the law ofTf (X). The goal is thus to show that Tf is Lipschitzfor the Wasserstein distance.

For a measure α ∈M(X ) (where X = Rq), the mea-sure β = TF (α) ∈ M(Y) (where Y = Rr) is definedvia the identity, for all g ∈ C(Y),

∫Yg(y)dβ(y) =

∫Xg

(∫Xf(z, x)dα(x)

)dα(z).

Let us first remark that an elementary block, whenview as operating on measures, can be decomposed us-ing the aforementioned push-forward and integrationoperator, since

TF (α) = f [·, α]]α.

Using the fact that WX is a norm,

WX (TF (α), Tf (β))

6 WX (TF (α), f [·, β]]α) + WX (f [·, β]]α, Tf (β))

6 ||f [·, α]− f [·, β]||L1(α) + Lip(f [·, β]) WX (α, β),

where we used the Lipschitzness of the push-forward,

Proposition 3. Moreover, for (z1, z2) ∈ X 2,

||f [z1, β]− f [z2, β]||2 6 ||f [z1, β]− f [z2, β]||1

= ||∫X

(f(z1, ·)− f(z2, ·))dβ||1

=

r∑i=1

|∫X

(fi(z1, ·)− fi(z2, ·))dβ|

6r∑i=1

∫X|fi(z1, ·)− fi(z2, ·)|dβ

6r∑i=1

Lip1(fi)||z1 − z2||2 6 r Lip1(f)||z1 − z2||2,

where we denoted by Lip1(fi) a bound on the Lips-chitz contant of the function fi(·, x) for all x. HenceLip(f [·, β]) 6 r Lip1(f). In addition, Lipschitzness ofintegration, Proposition 4 yields

WX (TF (α), Tf (β)) 6r Lip2(f) WX (α, β)

+ r Lip1(f) WX (α, β)

C Parametric Push Forward

An ingredient of the proof of the universality Theo-rem 1 is the construction of a noise-reshaping functionH which maps a uniform noise to another distributionparametrized by b.

Lemma 2. There exists a continuous map (b, u) ∈Σm × [0, 1]r 7→ H(b, u) so that the random vector

H(b, U) has law βdef.= (1 − ε)D∗Y(b) + εU , where U

has density U (uniformly distributed on [0, 1]r).

Proof. Since both the input measure U and the outputmeasure β have densities and have support on convexset, one can use for map H(b, ·) the optimal transportmap between these two distributions for the squaredEuclidean cost, which is known to be a continuousfunction, see for instance Santambrogio (2015)[Sec.1.7.6]. It is also possible to define a more direct trans-port map (which is not in general optimal), known asDacorogna-Moser transport, see for instance Santam-brogio (2015)[Box 4.3].

D Approximation by Fully Con-nected Networks for Theo-rem 1

We now show how to approximate the continuousfunctions f and g used in Theorem 1 by neural net-works. Namely, let us first detail the real-valued case(where two Elementary Blocks are needed), namely

12

prove that: ∀ε > 0, there exists integers N, p1, p2 andweight matrices matrices A1 ∈ Rp1×d, A2 ∈ Rp2×N ,C1 ∈ RN×p1 , C2 ∈ R1×p2 ; biases b1 ∈ Rp1 , b2 ∈ Rp2 :i.e., there exists two neural networks

• gθ(x) = C1λ(A1x+ b1) : Rd → RN

• fξ(x) = C2λ(A2x+ b2) : RN → R

s.t. ∀α ∈M1+(Ω), |F(α)− fξ (EX∼α (gθ(X)))| < ε.

Let ε > 0. By Theorem 1, F can be approxi-mated arbitrarily close (up to ε

3 ) by a composition offunctions of the form f (EX∼α (g(X))). By triangularinequality, we upper-bound the difference of interest|F(α)− fξ (EX∼α (gθ(X)))| by a sum of three terms:

• |F(α)− f (EX∼α (g(X)))|

• |f (EX∼α (g(X)))− fξ (EX∼α (g(X)))|

• |fξ (EX∼α (g(X)))− fξ (EX∼α (gθ(X)))|

and bound each term by ε3 , which yields the result.

The bound on the first term directly comes from theo-rem 1 and yields constant N which depends on ε. Thebound on the second term is a direct application ofthe universal approximation theorem (Cybenko, 1989;Leshno et al., 1993) (since α is a probability mea-sure, input values of f lie in a compact subset of RN :||∫

Ωg(x)dα||∞ 6 maxx∈Ω maxi |gi(x)|, hence the theo-

rem (Cybenko, 1989; Leshno et al., 1993) is applicableas long as λ is a nonconstant, bounded and continu-ous function). Let us focus on the third term. Uni-form continuity of fξ yields the existence of δ > 0 s.t.||u− v||1 < δ implies |fξ(u)− fξ(v)| < ε

3 . Let us applythe universal approximation theorem: each componentgi of g can be approximated by a neural network gθ,iup to δ

N . Therefore:

||EX∼α (g(X)− gθ(X)) ||16 EX∼α||g(X)− gθ(X)||1

6N∑i=1

∫Ω

|gi(x)− gθ,i(x)|dα(x)

6 N × δ

N= δ

since α is a probability measure. Hence the boundon the third term, with u = EX∼α (g(X)) and v =EX∼α (gθ(X)) in the definition of uniform continu-ity. We proceed similarly in the general case, andupper-bound the Wasserstein distance by a sum offour terms by triangular inequality. The same in-gredients (namely the universal approximation theo-rem (Cybenko, 1989; Leshno et al., 1993), since allfunctions f , g and h have input and output values ly-ing in compact sets; and uniform continuity) enable usto conclude.

E Proof of Theorem 2

We denote C(M1+(Ω)) the space of functions taking

probability measures on a compact set Ω to R whichare continuous for the weak-∗ topology. We denote theset of integrals of tensorized polynomials on Ω as

AΩdef.=

F :M1

+(Ω)→ R,∃n ∈ N,∃ϕ : Ωn → R,

∀µ ∈M1+(Ω),F(µ) =

∫Ωn

ϕdµ⊗n

.

The goal is to show that AΩ is dense in C(M1+(Ω)).

Since Ω is compact, Banach-Alaoglu theorem showsthatM1

+(Ω) is weakly-∗ compact. Therefore, in orderto use Stone-Weierstrass theorem, to show the densityresult, we need to show that AΩ is an algebra thatseparates points, and that, for all probability measureα, AΩ contains a function that does not vanish at α.For this last point, taking n = 1 and ϕ = 1 defines thefunction F(α) =

∫Ω

dα = 1 that does not vanish inα since it is a probability measure. Let us then showthat AΩ is a subalgebra of C(M1

+(Ω)):

• stability by a scalar follows from the definitionof AΩ;

• stability by sum: given (F1,F2) ∈ A2Ω (with as-

sociated functions (ϕ1, ϕ2) of degrees (n2, n2)),

denoting ndef.= max(n1, n2) and ϕ(x1, . . . , xn)

def.=

ϕ1(x1, . . . , xn1)+ϕ2(x1, . . . , xn2

) shows that F1 +F2 =

∫Ωn ϕdµ⊗n and hence F1 + F2 ∈ AΩ;

• stability by product: similarly as for the sum,denoting this time n = n1 + n2 and in-troducing ϕ(x1, . . . , xn) = ϕ1(x1, . . . , xn1

) ×ϕ2(xn1+1, . . . , xn3

) shows that F = F1×F2 ∈ AΩ,using Fubini’s theorem.

Lastly, we show the separation of points: if two prob-ability measures (α, β) on Ω are different (not equalalmost everywhere), there exists a set Ω0 ⊂ Ω suchthat α(Ω0) 6= β(Ω0); taking n = 1 and ϕ = 1Ω0

, weobtain a function F ∈ AΩ such that F(α) 6= F(β).

13

Stochastic Deep Networks - HotComputerScience · Stochastic Deep Networks Gwendoline de Bie DMA,...

Documents

Transcript of Stochastic Deep Networks - HotComputerScience · Stochastic Deep Networks Gwendoline de Bie DMA,...