Non-linear Neurons with Human-like Apical Dendrite Activations

11
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Non-linear Neurons with Human-like Apical Dendrite Activations Mariana-Iuliana Georgescu, Member, IEEE, Radu Tudor Ionescu, Member, IEEE, Nicolae-C ˘ at˘ alin Ristea, and Nicu Sebe, Member, IEEE Abstract—In order to classify linearly non-separable data, neurons are typically organized into multi-layer neural networks that are equipped with at least one hidden layer. Inspired by some recent discoveries in neuroscience, we propose a new neuron model along with a novel activation function enabling the learning of non-linear decision boundaries using a single neuron. We show that a standard neuron followed by the novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy. Furthermore, we conduct experiments on five benchmark data sets from computer vision, signal processing and natural language processing, i.e. MOROCO, UTKFace, CREMA-D, Fashion-MNIST, and Tiny ImageNet, showing that ADA and the leaky ADA functions provide superior results to Rectified Linear Units (ReLU), leaky ReLU, RBF and Swish, for various neural network architectures, e.g. one-hidden-layer or two-hidden-layer multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs) such as LeNet, VGG, ResNet and Character-level CNN. We obtain further performance improvements when we change the standard model of the neuron with our pyramidal neuron with apical dendrite activations (PyNADA). Our code is available at: https://github.com/raduionescu/pynada. Index Terms—Pyramidal neurons, activation function, transfer function, deep learning. 1 I NTRODUCTION T HE power of neural networks in classifying linearly non- separable data lies in the use of multiple (at least two) layers. We take inspiration from the recent study of Gidon et al. [1] and propose a simpler yet more effective approach: a new computational model of the neuron, termed pyramidal neuron with apical dendrite activations (PyNADA), along with a novel activation function, termed apical dendrite activation (ADA), allowing us to classify linearly non- separable data using an individual neuron. Biological motivation. Recently, Gidon et al. [1] observed that the apical dendrites of pyramidal neurons in the human cerebral cortex have a different activation function than what was previously known from observations on rodents. The newly-discovered apical dendrite activation function produces maximal amplitudes for electrical currents close to threshold-level stimuli and dampened amplitudes for stronger electrical currents, as shown in Figure 1a. This new discovery indicates that an individual pyramidal neuron from the human cerebral cortex can classify linearly non- separable data, contrary to the conventional belief that non- linear problems require multi-layer neural networks. This is the main reason that motivated us to propose ADA and PyNADA. Psychological motivation. Remember the first time you ate your favorite dish. Was it better than the second or the last time you ate the same dish? According to Knutson et al. [2], M.I. Georgescu and R.T. Ionescu are with SecurifAI and the Department of Computer Science, University of Bucharest, Romania. R.T. Ionescu is also affiliated with the Romanian Young Academy. E-mail: [email protected] N.C. Ristea is with the Department of Telecommunications, University Politehnica of Bucharest, Romania. N. Sebe is with the Department of Information Engineering and Computer Science, University of Trento, Italy. Manuscript received April 19, 2005; revised August 26, 2015. a Activation function observed in apical dendrites of pyramidal neurons in the human cerebral cortex. b Our leaky apical dendrite activation (ADA) function that can be expressed in closed form. This output is obtained by setting l =0.005, α =1 and c =1 in Equation (4). Fig. 1: Original [1] and proposed versions of the apical dendrite activation. The input corresponds to the horizontal axis and the output to the vertical axis. our brains provide higher responses to novel stimuli than to known (repetitive) stimuli. This means that our brains get bored while eating the same dish over and over again, although the dish might become our favorite. If we were to model the brain response over time for a certain stimulus, we would obtain the function illustrated in Figure 1a. This is yet another reason to propose and experiment with ADA and PyNADA in a computational framework based on deep neural networks, which try to mimic the brain. arXiv:2003.03229v3 [cs.NE] 19 Nov 2021

Transcript of Non-linear Neurons with Human-like Apical Dendrite Activations

Page 1: Non-linear Neurons with Human-like Apical Dendrite Activations

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Non-linear Neurons with Human-like ApicalDendrite Activations

Mariana-Iuliana Georgescu, Member, IEEE, Radu Tudor Ionescu, Member, IEEE, Nicolae-Catalin Ristea,and Nicu Sebe, Member, IEEE

Abstract—In order to classify linearly non-separable data, neurons are typically organized into multi-layer neural networks that areequipped with at least one hidden layer. Inspired by some recent discoveries in neuroscience, we propose a new neuron model along witha novel activation function enabling the learning of non-linear decision boundaries using a single neuron. We show that a standard neuronfollowed by the novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy. Furthermore, we conductexperiments on five benchmark data sets from computer vision, signal processing and natural language processing, i.e. MOROCO,UTKFace, CREMA-D, Fashion-MNIST, and Tiny ImageNet, showing that ADA and the leaky ADA functions provide superior results toRectified Linear Units (ReLU), leaky ReLU, RBF and Swish, for various neural network architectures, e.g. one-hidden-layer ortwo-hidden-layer multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs) such as LeNet, VGG, ResNet andCharacter-level CNN. We obtain further performance improvements when we change the standard model of the neuron with ourpyramidal neuron with apical dendrite activations (PyNADA). Our code is available at: https://github.com/raduionescu/pynada.

Index Terms—Pyramidal neurons, activation function, transfer function, deep learning.

F

1 INTRODUCTION

THE power of neural networks in classifying linearly non-separable data lies in the use of multiple (at least two)

layers. We take inspiration from the recent study of Gidon etal. [1] and propose a simpler yet more effective approach: anew computational model of the neuron, termed pyramidalneuron with apical dendrite activations (PyNADA), alongwith a novel activation function, termed apical dendriteactivation (ADA), allowing us to classify linearly non-separable data using an individual neuron.Biological motivation. Recently, Gidon et al. [1] observedthat the apical dendrites of pyramidal neurons in the humancerebral cortex have a different activation function thanwhat was previously known from observations on rodents.The newly-discovered apical dendrite activation functionproduces maximal amplitudes for electrical currents closeto threshold-level stimuli and dampened amplitudes forstronger electrical currents, as shown in Figure 1a. This newdiscovery indicates that an individual pyramidal neuronfrom the human cerebral cortex can classify linearly non-separable data, contrary to the conventional belief that non-linear problems require multi-layer neural networks. Thisis the main reason that motivated us to propose ADA andPyNADA.Psychological motivation. Remember the first time you ateyour favorite dish. Was it better than the second or the lasttime you ate the same dish? According to Knutson et al. [2],

• M.I. Georgescu and R.T. Ionescu are with SecurifAI and the Departmentof Computer Science, University of Bucharest, Romania. R.T. Ionescu isalso affiliated with the Romanian Young Academy.E-mail: [email protected]

• N.C. Ristea is with the Department of Telecommunications, UniversityPolitehnica of Bucharest, Romania.

• N. Sebe is with the Department of Information Engineering and ComputerScience, University of Trento, Italy.

Manuscript received April 19, 2005; revised August 26, 2015.

a Activation function observed in apical dendritesof pyramidal neurons in the human cerebral cortex.

b Our leaky apical dendrite activation (ADA)function that can be expressed in closed form. Thisoutput is obtained by setting l = 0.005, α = 1 andc = 1 in Equation (4).

Fig. 1: Original [1] and proposed versions of the apicaldendrite activation. The input corresponds to the horizontalaxis and the output to the vertical axis.

our brains provide higher responses to novel stimuli thanto known (repetitive) stimuli. This means that our brainsget bored while eating the same dish over and over again,although the dish might become our favorite. If we were tomodel the brain response over time for a certain stimulus,we would obtain the function illustrated in Figure 1a. Thisis yet another reason to propose and experiment with ADAand PyNADA in a computational framework based on deepneural networks, which try to mimic the brain.

arX

iv:2

003.

0322

9v3

[cs

.NE

] 1

9 N

ov 2

021

Page 2: Non-linear Neurons with Human-like Apical Dendrite Activations

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

Mathematical motivation. Despite the recent significantadvances brought by deep learning [3] in various applicationdomains [4], [5], state-of-the-art deep neural networks relyon an old and simple mathematical model of the neuronintroduced by Rosenblatt [6]. Minsky and Papert [7] arguedthat a single artificial neuron is incapable of learning non-linear functions, such as the XOR function. In order to classifylinearly non-separable data, standard artificial neurons aretypically organized in multi-layer neural networks that areequipped with at least one hidden layer. Contrary to thecommon belief, we propose an activation function (ADA)that transforms a single artificial neuron into a non-linearclassifier. We also prove that the non-linear neuron can learnthe XOR logical function with 100% accuracy. Hence, theADA function can increase the computational power ofindividual artificial neurons.Empirical motivation. We provide empirical evidence infavor of replacing the commonly-used (e.g. Rectified LinerUnits (ReLU) [8] and leaky ReLU [9]) or the recently-proposed (e.g. Swish [10]) activation functions with theones proposed in this work, namely ADA and leaky ADA,in various neural network architectures ranging from one-hidden-layer or two-hidden-layer multi-layer perceptrons(MLPs) to convolutional neural networks (CNNs) such asLeNet [11], VGG [12], ResNet [13] and Character-level CNN[14]. We attain accuracy improvements on several tasks:object class recognition on Fashion-MNIST [15] and TinyImageNet [16], gender prediction and age estimation onUTKFace [17], voice emotion recognition on CREMA-D [18]and Romanian dialect identification on MOROCO [19]. Wereport further accuracy improvements when the standardartificial neurons are replaced with our pyramidal neuronswith apical dendrite activations.Contribution. In summary, our contribution is threefold:

• We propose a new artificial neuron called pyramidalneurons with apical dendrite activations (PyNADA)along with a new activation function called apicaldendrite activation (ADA).

• We demonstrate that, due to the novel apical dendriteactivation, a single neuron can learn the XOR logicalfunction.

• We show that the proposed ADA and PyNADAprovide superior results compared to standard neu-rons based on the notorious ReLU and leaky ReLUactivations, for a broad range of tasks and neuralarchitectures. In most cases, our improvements arestatistically significant.

2 RELATED WORK

2.1 Activation Functions

Since activation functions have a large impact on the per-formance of deep neural networks (DNNs), studying andproposing new activation functions is an interesting topic [20].Nowadays, perhaps the most popular activation function isReLU [8]. Formally, ReLU is defined as max(0, x), wherex is the input. Because ReLU is linear on the positive side(for x > 0), its derivative is 1, so it does not saturate likesigmoid and tanh. On the negative side of the domain, ReLUis constant, so the gradient is 0. Hence, a neuron that uses

ReLU as activation function cannot update its weights viagradient-based methods on examples for which the neuronis inactive. To eliminate the problem caused by inactivateneurons with ReLU activation, Maas et al. [9] introducedleaky ReLU. The leaky ReLU function is defined as:

y = ReLUleaky(x, l) = l ·min(0, x) + max(0, x), (1)

where l is a number between 0 and 1 (typically very close to0), allowing the gradient to pass even if x < 0. While in leakyReLU the parameter l is kept fixed, He et al. [21] proposedParametric Rectified Linear Units (PReLU), in which the leakl is learned by back-propagation. Different from ReLU, theExponential Linear Unit (ELU) [22] outputs negative values,while still avoiding the vanishing gradient problem on thepositive side of the domain. This helps in bringing the meanunit activation down to zero, enabling faster convergencetimes. Another generalization of ReLU is the Maxout unit[23], which instead of applying an element-wise function,divides the input into groups of k values and then outputsthe maximum value of each group.

In contrast to most recent (ReLU, PReLU, ELU, etc.) andhistorically-motivated (sign, sigmoid, tanh, etc.) activationfunctions, we propose an activation function that transformsa single artificial neuron into a non-linear classifier. To sup-port our statement, we prove that a neuron followed by ourapical dendrite activation function can learn the XOR logicalfunction. We note that there are other activation functions,e.g. Swish [10] and Radial Basis Function (RBF), that generatenon-liner decision boundaries. Different from Swish and RBF,our activation function is supported by recent neurosciencediscoveries [1]. In addition, our experiments show that ADAand leaky ADA provide superior performance levels.

2.2 Models of Artificial NeuronsTo our knowledge, the first mathematical model of thebiological neuron is the perceptron [6]. The Rosenblatt’sperceptron was introduced along with a rule for updatingthe weights, which converges to a solution only if thedata set is linearly separable. Although the perceptron isa simple and old model, it represents the foundation ofmodern DNNs. Different from the Rosenblatt’s perceptron,the Adaptive Linear Neuron (ADALINE) [24] updates itsweights via stochastic gradient descent, back-propagatingthe error before applying the sign function. ADALINE hasthe same disadvantage as Rosenblatt’s perceptron, namelythat it cannot produce non-linear decision boundaries.

In contrast to Rosenblatt’s perceptron and ADALINE, wepropose an artificial neuron that has two input branches, thebasal branch and the apical tuft. The apical tuft is particularlynovel because it uses a novel activation function [1] that cansolve non-linearly separable problems.

There are a few works that studied various aspects of themodeling of pyramidal neurons, e.g. segregated dendrites inthe context of deep learning [25] or memorizing sequenceswith active dendrites and multiple integration zones [26].Inspired by the recent discovery of Gidon et al. [1], toour knowledge, we are the first to propose a human-likeartificial pyramidal neuron. Different from previous studies,the apical tuft of our pyramidal neuron is equipped withthe novel apical dendrite activation suggested by Gidon et

Page 3: Non-linear Neurons with Human-like Apical Dendrite Activations

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

al. [1]. Furthermore, we integrate our pyramidal neuron intovarious deep neural architectures, showing its benefits overstandard artificial neurons. We note that Gidon et al. [1]have not presented the pyramidal neuron in a computationalscenario, hence we are the first modeling it computationally.

3 NON-LINEAR NEURONS WITH APICAL DEN-DRITE ACTIVATIONS

3.1 ADA: Apical Dendrite Activation FunctionThe activation function illustrated in Figure 1a and deducedfrom the experiments conducted by Gidon et al. [1] can beformally expressed as follows:

y =

{0, if x < 0exp(−x), if x ≥ 0

, (2)

where x ∈ R is the input of the activation function and y isthe output.

Since the function defined in Equation (2) is not usefulin practice1, we propose a closed form definition thatapproximates the activation function defined in Equation (2),as follows:

y = ADA(x, α, c) = max(0, x) · exp(−x·α+ c), (3)

where x ∈ R is the input of the activation function, α > 0is a parameter that controls the width of the peak, c > 0is a constant that controls the height of the peak and y isthe output. Similar to ReLU, our apical dendrite activation(ADA) is saturated on the negative side, i.e. its gradients areequal to zero for x < 0. Thus, a neural model trained withback-propagation [29] would not update the correspondingweights. We therefore propose leaky ADA, a more genericversion that avoids saturation on the negative side, just asleaky ReLU. We formally extend the definition of ADA fromEquation (3) to leaky ADA as follows:

y=ADAleaky(x, α, c, l) = l·min(0, x) +ADA(x, α, c), (4)

where 0 ≤ l ≤ 1 is the leak parameter controlling thefunction steepness on the negative side and the otherparameters are the same as in Equation (3). By settingl = 0.005, α = 1 and c = 1 in Equation (4), we obtainthe activation function illustrated in Figure 1b. By comparingFigures 1a and 1b, we observe that the (leaky) ADA functiondefined in Equation (4) has a similar shape to the transferfunction defined in Equation (2). Indeed, both functions haveno activation or almost no activation when x < 0. Then,there is a high activation peak for small but positive valuesof x. Finally, the activation damps along the horizontal axis,as x gets larger and larger.

Lemma 1. There exists an artificial neuron followed by the apicaldendrite activation from Equation (3) which can predict the labelsfor the XOR logical function, by rounding its output.

Proof. Given an input data sample represented as a rowvector x ∈ Rn, the output y ∈ R of an artificial neuron withADA is obtained as follows:

y = ADA(x · w + b, α, c), (5)

1. Commonly-used deep learning frameworks such as TensorFlow[27] and PyTorch [28] do not cope well with functions containing ifbranches.

where α and c are defined as in Equation (3), w is the columnweight vector and b is the bias term. The following equationshows how to obtain the rounded output:

y = bADA(x · w + b, α, c)e , (6)

where b·e is the rounding function. Similarly, we can obtainthe rounded outputs for an entire set of data samplesrepresented as row vectors in an input matrix X :

Y = bADA(X · w + b, α, c)e. (7)

Let X and T represent the data samples and the targetscorresponding to the XOR logical function, i.e.:

X =

0 00 11 01 1

, T =

0110

. (8)

We next provide an example of weights and parameters

to prove our lemma. By setting w =

[55

], b = −4, α = 1 and

c = 1 in Equation (7), we obtain the following output:

Y =

ADA0 00 11 01 1

· [55]− 4, 1, 1

=

ADA−4116

, 1, 1 ≈

011

0.04

=

0110

.(9)

Since the output Y computed in Equation (9) is equal to thetarget T defined in Equation (8), it results that Lemma 1 istrue.

Our proof is intuitively explained in Figure 2. The fourdata points from the XOR data set are represented on a planeand the output of the neuron followed by ADA is representedon the axis perpendicular to the plane in which the XORpoints reside. The output for the red points (labeled as class0) is 0 or close to 0, while the output for the green points(labeled as class 1) is 1. Applying the rounding function b·eon top of the output depicted in Figure 2 is equivalent tosetting a threshold equal to 0.5, labeling all points above thethreshold with class 1 and all points below the thresholdwith class 0. This gives us the labels for the XOR logicalfunction.

Corollary 2. There exists an artificial neuron followed by theapical dendrite activation from Equation (3) which can predict thelabels for the OR logical function, by rounding its output.

Proof. We can trivially prove Corollary 2 by following theproof for Lemma 1. We just have to set α and c to differentvalues, e.g. α = 0.4 and c = 0.5.

Corollary 3. There exists an artificial neuron followed by theapical dendrite activation from Equation (3) which can predict thelabels for the AND logical function, by rounding its output.

Proof. We can trivially prove Corollary 3 by following theproof for Lemma 1. We just have to set the bias term to adifferent value, e.g. b = −9.

Page 4: Non-linear Neurons with Human-like Apical Dendrite Activations

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Fig. 2: The output of a neuron with apical dendrite activation,as defined in Equation (5), able to classify the XOR logicalfunction. The output is obtained by setting the weights to

w =

[55

], the bias term to b = −4 and the parameters of

the ADA function to α = 1 and c = 1. Large output values(closer to 1) correspond to the green data points labeled asclass 1, while low output values (closer to 0) correspond tothe red data points labeled as class 0. Best viewed in color.

From Lemma 1, Corollary 2 and Corollary 3, it resultsthat an artificial neuron followed by ADA has more compu-tational power than a standard artificial neuron followed bysigmoid, ReLU or other commonly-used activation functions.More precisely, the ADA function enables individual artificialneurons to classify both linearly and non-linearly separabledata.

3.2 PyNADA: Pyramidal Neurons with Apical DendriteActivations

Pyramidal neurons have two types of dendrites: apicaldendrites and basal dendrites. Electrical impulses are sent tothe neuron through both kinds of dendrites and the impulseis passed down the axon, if an action potential occurs. Priorto Gidon et al. [1], it was thought that apical and basaldendrites had identical activation functions. This is becauseexperiments were usually conducted on pyramidal neuronsextracted from rodents. In this context, proposing an artificialpyramidal neuron would not make much sense, becauseits mathematical model would be identical to a standardartificial neuron. Gidon et al. [1] observed that the apicaldendrites of pyramidal neurons in the human cerebral cortexhave a different (previously unknown) activation function,while the basal dendrites exhibit the well-known hard-limittransfer function. This observation calls for a new model ofartificial pyramidal neurons. We therefore propose pyramidalneurons with apical dendrite activations (PyNADA).

Given an input data sample x ∈ Rn, the output y ∈ R ofa PyNADA is obtained through the following equation:

y = ReLU(x · w′ + b′) +ADA(x · w′′ + b′′, α, c), (10)

Fig. 3: A pyramidal neuron with apical dendrite activations(PyNADA). The input x goes through the basal dendritesfollowed by ReLU and through the apical tuft followed byADA. The results are summed up and passed through theaxon. Best viewed in color.

where α and c are defined as in Equation (3), w′ and w′′

are column weight vectors and b′ and b′′ are bias terms. Agraphical representation of PyNADA is provided in Figure 3.In the proposed model, the input x is distributed to the basaldendrites represented by the weight vector w′ and the biasterm b′, and to the apical dendrites (apical tuft) representedby the weight vector w′′ and the bias term b′′.

For practical reasons, we replace the hard-limit transferfunction, suggested by Gidon et al. [1] for the basal dendrites,with the ReLU activation. This change ensures that wecan optimize the weights w′ and the bias b′ through back-propagation, i.e. we have at least some non-zero gradients.Since the intensity of electrical impulses is always positive,the biological model proposed by Gidon et al. [1] is definedfor positive inputs and the thresholds for the activationfunctions are well above 0. However, an artificial neuron canalso take as input negative values, i.e. x ∈ Rn. Hence, thethresholds of the activation functions used in PyNADA areset to 0 (but the bias terms can shift these thresholds).

Corollary 4. There exists a pyramidal neuron with apical dendriteactivation, as defined in Equation (10), which can predict the labelsfor the XOR logical function, by rounding its output.

Proof. We can trivially prove Corollary 4 by following theproof for Lemma 1. For the apical tuft, we can simply set

w′′ =

[55

], b′′ = −4, α = 1 and c = 1 in Equation (10),

just as in the proof for Lemma 1. The demonstration resultsimmediately when we simply drop out the basal dendrites

by setting w′ =[00

]and b′ = 0.

4 EXPERIMENTS

4.1 Data Sets

We conduct experiments on five data sets: MOROCO [19],UTKFace [17], CREMA-D [18], Fashion-MNIST [15] and TinyImageNet.MOROCO. We conduct text classification experiments onMOROCO [19], a data set of 33,564 news articles that arewritten in Moldavian or Romanian. Each text sample has anaverage of 309 tokens, the total number of tokens being over10 millions. The main task is to discriminate between theMoldavian and the Romanian dialects. The data set comes

Page 5: Non-linear Neurons with Human-like Apical Dendrite Activations

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

with a training set of 21,719 text samples, a validation set of5,921 text samples and a test set of 5,924 text samples.UTKFace. UTKFace [17] is a large data set of 23,708 high-resolution images. The images contain faces of people ofvarious age, gender and ethnicity. We use the unalignedcropped faces in our experiments. We randomly dividethe data set into a training set of 16,595 images (70%),a validation set of 3,556 images (15%) and a test set of3,557 images (15%). We consider two tasks on UTKFace:gender recognition (binary classification) and age estimation(regression).CREMA-D. The CREMA-D multi-modal database [18] con-tains 7,442 videoclips of 91 actors (48 male and 43 female)with different ethnic backgrounds. The actors were asked toconvey particular emotions while producing, with differentintonations, 12 particular sentences that evoke the targetemotions. Six labels have been used to discriminate amongdifferent emotion categories: neutral, happy, anger, disgust,fear and sad. In our experiments, we consider only the audiomodality. We split the audio samples into 70% for training,15% for validation and 15% for testing.Fashion-MNIST. Fashion-MNIST [15] is a recently intro-duced data set that shares the same structure as the morepopular MNIST [11] data set, i.e. it contains 60,000 trainingimages and 10,000 test images that belong to 10 classes offashion items. We use a subset of 10,000 images from thetraining set for validation.Tiny ImageNet. We present results with longer apical den-drites on Tiny ImageNet, a subset of ImageNet [16]. The dataset provides 500 training images, 50 validation images and50 test images for 200 object classes. The size of each imageis 64× 64 pixels.

4.2 Evaluation Setup

Evaluation metrics. For the classification tasks (object recog-nition, gender prediction, voice emotion recognition anddialect identification), we report the classification accuracy.For the regression task (age estimation), we report the meanabsolute error (MAE).Baselines. We consider several neural architectures rangingfrom shallow MLPs to deep CNNs: one-hidden-layer MLP,two-hidden-layer MLP, LeNet, VGG-9, ResNet-18, ResNet-50 and character-level CNNs with and without Squeeze-and-Excitation (SE) blocks [30]. The baseline architecturesare based on ReLU, leaky ReLU, RBF or Swish. The lattertwo activation functions are added because they share oneproperty with ADA, namely the capability to solve XORwhen plugged into a single artificial neuron. The goal of ourexperiments is to study the effect of (i) replacing ReLU andleaky ReLU with ADA and leaky ADA, respectively, and(ii) replacing the standard neurons with our PyNADA. Thenumber of weights in PyNADA is twice as high, but the sizeof the activation maps is the same as in standard neurons(due to summation of the two branches in PyNADA). Fora fair comparison, we included three baseline pyramidalneurons: one with ReLU on both branches (PyNReLU), onewith ReLU and RBF (PyNRBF) and one with ReLU andSwish (PyNSwish). All neural models are trained using theAdam optimizer [31]. With the exception of ResNet-18 andResNet-50, which are implemented in PyTorch [28], all other

neural networks are implemented in TensorFlow [27]. Sincewe employ different architectures in each task, we describethem in more details in the corresponding subsections below.

In summary, we emphasize that our experiments aimto compare models within the following three groups: (i)ReLU, RBF, Swish versus ADA; (ii) leaky ReLU versus leakyADA; (iii) PyNReLU, PyNRBF, PyNSwish versus PyNADAand leaky PyNADA. Note that for PyNADA we considerboth standard and leaky activations. In leaky PyNADA,the basal dendrites are followed by leaky ReLU and theapical dendrites are followed by leaky ADA. Significancetesting is performed within each group, considering theabove organization. We also emphasize that the comparedmodels within a group have the same number of parametersand the same computational complexity, e.g. we do not aimto compare ReLU with PyNADA directly.Generic hyperparameter tuning. We tune the basic hyper-parameters such as the learning rate, the mini-batch size andthe number of epochs, for each neural architecture using gridsearch on the validation set of the corresponding task. Weconsider learning rates between 10−3 and 10−5, mini-batchesof 10, 32, 64 or 128 samples and numbers of epochs between10 and 100. Other parameters of the Adam optimizer areused with default values. The parameter tuning is performedusing the baseline architectures based on ReLU. Once tuned,the same parameters are used for leaky ReLU, RBF, Swish,ADA, leaky ADA, PyNReLU, PyNRBF, PyNSwish, PyNADAand leaky PyNADA. Hence, we emphasize that the basicparameters are tuned in favor of ReLU. The leak parameterfor leaky ReLU and leaky ADA is set to l = 0.01 inall the experiments, without further tuning. We note that(leaky) ADA and (leaky) PyNADA have two additionalhyperparameters that require tuning on validation. For theconstant c, we consider two possible values, either 0 or 1. Forthe parameter α, we consider two options: (a) perform gridsearch in the range [0.1, 1] using a step of 0.1, or (b) learn αusing gradient descent during training. As recommended in[10], we used a learnable β in Swish.

To avoid accuracy variations due to weight initializationor stochastic training, we trained each neural model infive consecutive trials, keeping the model with the highestvalidation performance. In the experiments, we report theperformance level of the selected neural models on the held-out test set.

4.3 Results on MOROCONeural architectures. For dialect identification, we considerthe character-level CNN models presented in [19], whichfollow closely the model of Zhang et al. [14]. The twoCNNs share the same architecture, being composed of anembedding layer, followed by three convolutional and max-pooling blocks, two fully-connected layers with dropout0.5 and the final softmax classification layer. The secondarchitecture incorporates an attention mechanism in theform of Squeeze-and-Excitation (SE) blocks [30] insertedafter every convolutional layer. For the SE blocks, we set thereduction ratio to 64. For both architectures, we keep thesame size for the embedding (256) and the same number ofconvolutional filters (128) as Butnaru et al. [19]. In fact, ourbaseline architectures (CNN and CNN+SE) are identical tothose of Butnaru et al. [19].

Page 6: Non-linear Neurons with Human-like Apical Dendrite Activations

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

TABLE 1: Dialect identification accuracy rates (in %) for two character-level neural models (CNN and CNN+SE) on MOROCO.Results are reported with various activations (ReLU, leaky ReLU, RBF, Swish, ADA, leaky ADA) and artificial neurons(standard, PyNReLU, PyNRBF, PyNSwish and PyNADA). Results that are significantly better than corresponding baselines,according to a paired McNemar’s test [32], are marked with † or ‡ for the significance levels 0.05 or 0.01, respectively.Training times are measured on a computer with Nvidia GeForce GTX 1080 GPU with 11GB of RAM. Best model withineach group is highlighted in bold.

Model Activation Parameter Test Accuracy Epoch Time (s)CNN [19] ReLU - 92.79 27CNN RBF - 54.10 34CNN Swish β =learnable 86.96 33CNN ADA α =learnable 93.68† 33CNN leaky ReLU - 92.90 30CNN leaky ADA α = 0.4 93.55† 37CNN+PyNReLU ReLU, ReLU - 92.79 52CNN+PyNRBF ReLU, RBF - 86.44 58CNN+PyNSwish ReLU, Swish β =learnable 91.20 55CNN+PyNADA ReLU, ADA α =learnable 93.61† 52CNN+PyNADA leaky ReLU, leaky ADA α =learnable 93.53† 57CNN+SE [19] ReLU - 92.99 42CNN+SE RBF - 54.10 48CNN+SE Swish β =learnable 87.99 47CNN+SE ADA α =learnable 93.99‡ 47CNN+SE leaky ReLU - 93.06 45CNN+SE leaky ADA α =learnable 93.49† 51CNN+SE+PyNReLU ReLU, ReLU - 92.97 52CNN+SE+PyNRBF ReLU, RBF - 86.23 65CNN+SE+PyNSwish ReLU, Swish β =learnable 91.39 61CNN+SE+PyNADA ReLU, ADA α = 0.5 93.72† 63CNN+SE+PyNADA leaky ReLU, leaky ADA α =learnable 93.61† 60

Specific hyperparameter tuning. Since Butnaru et al. [19]already tuned the hyperparameters of the character-levelCNNs on the MOROCO validation set, we decided to use thesame parameters and skip the grid search. Hence, we set thelearning rate to 5 · 10−4 and use mini-batches of 128 samples.Each CNN is trained for 50 epochs in 5 trials, keeping themodel with highest validation accuracy for evaluation on test.For (leaky) ADA and (leaky) PyNADA, we obtain optimalresults with c = 1, while α is either validated or optimizedduring training.

Results. We present the dialect identification results onMOROCO in Table 1. First, we observe the baseline CNN andCNN+SE models confirm the results reported by Butnaru etal. [19]. For the character-level CNN (without SE blocks), weobtain the largest improvement on the test set when we re-place ReLU (92.79%) with ADA (93.68%). Furthermore, theimprovements of leaky ADA, PyNADA and leaky PyNADAare all higher than 0.6%, and the differences are statisticallysignificant. The results are somewhat consistent among thetwo architectures, CNN and CNN+SE. For example, forthe CNN+SE model, we report the largest improvement byreplacing ReLU (92.99%) with ADA (93.99%), just as forthe CNN without SE blocks. Our highest absolute gain onMOROCO is 1%. Overall, the results indicate that all variantsof (leaky) ADA and (leaky) PyNADA attain significantlybetter results than the corresponding baselines. We observethat neither character-level CNN model is able to convergeusing RBF as activation. For RBF, we reported the testaccuracy corresponding to the last model before the gradientsexplode. Interestingly, PyNRBF is able to converge, probablydue to the basal tuft based on ReLU, but its performancelevel is not adequate. The networks converge with Swish andPyNSwish, but the corresponding results are much lower

than those with (leaky) ReLU or (leaky) ADA.Running times. Since ADA needs to compute the exponen-tial function, it is more computational intensive than ReLU.Moreover, PyNADA has twice more weights than a standardneuron. Hence, in addition to the accuracy rates, we herebyreport the training time (in seconds per epoch) in Table 1.With respect to (leaky) ReLU, it seems that (leaky) ADArequires 5 to 7 additional seconds per epoch, which meansthat the training times increases by 15% or 20%. The otheractivation functions that include the exponential function,e.g. RBF and Swish, have the same disadvantage as ADA.

Meanwhile, (leaky) PyNADA seems to need about 25to 30 extra seconds compared to (leaky) ReLU, increasingthe training time by 60% to 100%. We thus conclude thatthe accuracy improvements brought by (leaky) ADA and(leaky) PyNADA come with a non-negligible computationalcost with respect to ReLU or leaky ReLU. In the same time,RBF and Swish are slower than ReLU, while also attaininginferior accuracy levels.

As the time measurements are consistent across allbenchmarks, we refrain from reporting and commentingon the running times in the subsequent experiments.

4.4 Results on UTKFaceNeural architecture. For gender prediction and age estima-tion, we employ the ResNet-50 architecture [13]. Residualnetworks use batch normalization and skip connections topropagate information over convolutional layers, avoidingthe vanishing or exploding gradient problem. This enableseffective training of very deep models such as ResNet-50,which is formed of 50 layers.Specific hyperparameter tuning. All ResNet-50 variants aretrained on mini-batches of 10 samples using a learning rate

Page 7: Non-linear Neurons with Human-like Apical Dendrite Activations

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

TABLE 2: Gender prediction accuracy rates (in %) and age estimation MAEs for ResNet-50 on UTKFace. Results are reportedwith various activations (ReLU, leaky ReLU, RBF, Swish, ADA, leaky ADA) and artificial neurons (standard, PyNReLU,PyNRBF, PyNSwish and PyNADA). Results that are significantly better than corresponding baselines, according to a pairedMcNemar’s test [32], are marked with ‡ for the significance level 0.01. Best model within each group is highlighted in bold.

Task Model Activation Parameter Test AccuracyResNet-50 ReLU - 88.56ResNet-50 RBF - 50.12ResNet-50 Swish β =learnable 87.97ResNet-50 ADA α = 0.1 88.49

Gender ResNet-50 leaky ReLU - 88.91prediction ResNet-50 leaky ADA α = 0.1 89.12

ResNet-50+PyNReLU ReLU, ReLU - 88.50ResNet-50+PyNRBF ReLU, RBF - 90.77ResNet-50+PyNSwish ReLU, Swish β =learnable 50.44ResNet-50+PyNADA ReLU, ADA α = 0.5 91.65‡

ResNet-50+PyNADA leaky ReLU, leaky ADA α = 0.5 91.53‡

Task Model Activation Parameter Test MAEResNet-50 ReLU - 6.39ResNet-50 RBF - 14.76ResNet-50 Swish β =learnable 6.59ResNet-50 ADA α =learnable 5.91‡

Age ResNet-50 leaky ReLU - 6.01estimation ResNet-50 leaky ADA α =learnable 5.88‡

ResNet-50+PyNReLU ReLU, ReLU - 7.35ResNet-50+PyNRBF ReLU, RBF - 6.61ResNet-50+PyNSwish ReLU, Swish β =learnable 15.12ResNet-50+PyNADA ReLU, ADA α = 0.5 5.74‡

ResNet-50+PyNADA leaky ReLU, leaky ADA α = 0.5 5.77‡

TABLE 3: Voice emotion recognition accuracy rates (in %) for ResNet-18 on CREMA-D. Results are reported with variousactivations (ReLU, leaky ReLU, RBF, Swish, ADA, leaky ADA) and artificial neurons (standard, PyNReLU, PyNRBF,PyNSwish, PyNADA). Results that are significantly better than corresponding baselines, according to a paired McNemar’stest [32], are marked with ‡ for the significance level 0.01. As reference, the state-of-the-art methods are also included. Bestmodel within each group is highlighted in bold.

Model Activation Parameter Test AccuracyShukla et al. [33] ReLU - 55.01He et al. [34] leaky ReLU - 58.71ResNet-18 ReLU - 62.28ResNet-18 RBF - 57.66ResNet-18 Swish β =learnable 16.66ResNet-18 ADA α = 0.1 64.53‡ResNet-18 leaky ReLU - 63.13ResNet-18 leaky ADA α = 0.1 65.37‡ResNet-18+PyNReLU ReLU, ReLU - 63.39ResNet-18+PyNRBF ReLU, RBF - 58.72ResNet-18+PyNSwish ReLU, Swish β =learnable 16.66ResNet-18+PyNADA ReLU, ADA α = 0.1 65.15‡

ResNet-18+PyNADA leaky ReLU, leaky ADA α = 0.1 65.10‡

of 10−4. The models are trained for 15 epochs on the genderprediction task, and for 100 epochs on the age estimation task.For (leaky) ADA and (leaky) PyNADA, we either validate orlearn the parameter α, in the same time, setting the parameterc = 0.

Results. We present the gender prediction and age estimationresults in Table 2. In the gender prediction task, we notice thatADA yields slightly lower results than ReLU, while leakyADA attains slightly better results than leaky ReLU. BothRBF and Swish provide lower results than ReLU and ADA.Nevertheless, we observe significant improvements withPyNADA over PyNReLU. Our highest absolute gain (3.15%)in the gender prediction task is obtained when ResNet-50is equipped with PyNADA (91.65%) instead of the baselinePyNReLU on both branches (88.50%). In the age estimation

task, we notice that all versions of (leaky) ADA and (leaky)PyNADA surpass the corresponding baselines by significantmargins. With an MAE of 5.74 on the test set, our PyNADAattains the best results in age estimation. With respect to thePyNReLU baseline, PyNADA reduces the average error by1.61 years and the difference is statistically significant.

4.5 Results on CREMA-D

Neural architecture. For speech emotion recognition, weemploy the ResNet-18 architecture [13]. We modified thenumber of input channels of ResNet-18 from 3 to 2, in orderto feed the network with the Short Time Fourier Transformof the raw audio signals, where the real and the imaginaryparts are considered as separate input channels.

Page 8: Non-linear Neurons with Human-like Apical Dendrite Activations

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

TABLE 4: Object class recognition accuracy rates (in %) for four neural models (one-hidden-layer MLP, two-hidden-layerMLP, LeNet, VGG-9) on Fashion-MNIST. Results are reported with various activations (ReLU, leaky ReLU, RBF, Swish, ADA,leaky ADA) and artificial neurons (standard, PyNReLU, PyNRBF, PyNSwish and PyNADA). Results that are significantlybetter than corresponding baselines, according to a paired McNemar’s test [32], are marked with † or ‡ for the significancelevels 0.05 or 0.01, respectively. As reference, the results with one-hidden-layer and two-hidden-layer MLPs reported byXiao et al. [15] are also included. Best model within each group is highlighted in bold.

Model Activation Parameter Test Accuracyone-hidden-layer MLP [15] ReLU - 87.10one-hidden-layer MLP [15] tanh - 86.80one-hidden-layer MLP ReLU - 88.88one-hidden-layer MLP RBF - 88.38one-hidden-layer MLP Swish β =learnable 88.85one-hidden-layer MLP ADA α = 0.3 88.98one-hidden-layer MLP leaky ReLU - 88.40one-hidden-layer MLP leaky ADA α = 0.3 88.97‡one-hidden-layer MLP+PyNReLU ReLU, ReLU - 89.00one-hidden-layer MLP+PyNRBF ReLU, RBF - 88.89one-hidden-layer MLP+PyNSwish ReLU, Swish β =learnable 89.03one-hidden-layer MLP+PyNADA ReLU, ADA α =learnable 89.45‡

one-hidden-layer MLP+PyNADA leaky ReLU, leaky ADA α =learnable 89.34†

two-hidden-layer MLP [15] ReLU - 87.00two-hidden-layer MLP [15] tanh - 86.30two-hidden-layer MLP ReLU - 88.71two-hidden-layer MLP RBF - 87.84two-hidden-layer MLP Swish β =learnable 88.92two-hidden-layer MLP ADA α =learnable 88.99two-hidden-layer MLP leaky ReLU - 88.18two-hidden-layer MLP leaky ADA α = 0.1 88.93‡two-hidden-layer MLP+PyNReLU ReLU, ReLU - 88.98two-hidden-layer MLP+PyNRBF ReLU, RBF - 88.43two-hidden-layer MLP+PyNSwish ReLU, Swish β =learnable 88.88two-hidden-layer MLP+PyNADA ReLU, ADA α =learnable 89.40‡

two-hidden-layer MLP+PyNADA leaky ReLU, leaky ADA α =learnable 89.42‡

LeNet ReLU - 90.84LeNet RBF - 88.94LeNet Swish β =learnable 90.47LeNet ADA α = 0.3 91.49†LeNet leaky ReLU - 90.88LeNet leaky ADA α = 0.9 91.38†LeNet+PyNReLU ReLU, ReLU - 90.87LeNet+PyNRBF ReLU, RBF - 90.43LeNet+PyNSwish ReLU, Swish β =learnable 91.04LeNet+PyNADA ReLU, ADA α = 0.3 91.27LeNet+PyNADA leaky ReLU, leaky ADA α =learnable 91.60‡

VGG-9 ReLU - 93.39VGG-9 RBF - 10.00VGG-9 Swish β =learnable 91.96VGG-9 ADA α = 1.0 93.84‡VGG-9 leaky ReLU - 93.26VGG-9 leaky ADA α = 0.3 93.63†VGG-9+PyNReLU ReLU, ReLU - 93.49VGG-9+PyNRBF ReLU, RBF - 10.00VGG-9+PyNSwish ReLU, Swish β =learnable 93.70VGG-9+PyNADA ReLU, ADA α = 1.0 93.73VGG-9+PyNADA leaky ReLU, leaky ADA α = 1.0 93.70

Specific hyperparameter tuning. All ResNet-18 variants aretrained on mini-batches of 16 samples using a learning rateof 5 · 10−4. The models are trained for 70 epochs. For (leaky)ADA and (leaky) PyNADA, we either validate or learn theparameter α, in the same time, setting the parameter c = 0.Moreover, we applied training data augmentation using timeshifting, thus obtaining more training samples required toimprove the robustness to data variation of our deep neuralnetwork.

Results. We present the speech emotion recognition resultsin Table 3. We observe that ADA and leaky ADA attainsuperior results compared with the other activation functions,

when conventional neurons are employed in ResNet-18. TheRBF activation function offers significantly lower resultswith respect to the most commonly-used activation function,ReLU, while the Swish function fails to converge for bothstandard and pyramidal neurons. We generally observe thatthe ResNet-18 based on pyramidal neurons outperformsthe ResNet-18 based on conventional neurons, regardless ofthe activation function. This highlights the effectiveness ofthe pyramidal design. Our highest absolute gain (2.24%) isobtained for leaky ADA in comparison with leaky ReLU.Moreover, the performance gains brought by (leaky) ADAand (leaky) PyNADA are statistically significant with respect

Page 9: Non-linear Neurons with Human-like Apical Dendrite Activations

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

TABLE 5: Object class recognition accuracy rates (in %) for ResNet-18 on Tiny ImageNet. Results are reported with twoartificial neurons, PyNReLU and PyNADA, respectively. Results significantly better than the baseline, according to a pairedMcNemar’s test [32], are marked with ‡ for the significance level 0.01. Best model is highlighted in bold.

Model Activation Parameter Test AccuracyResNet-18+PyNReLU ReLU, ReLU - 51.54ResNet-18+PyNRBF ReLU, RBF - 51.99ResNet-18+PyNSwish ReLU, Swish β =learnable 52.93ResNet-18+PyNADA ReLU, ADA α = 0.1 53.86‡

to the corresponding baselines. We notice that our accuracyrates for the audio modality on CREMA-D surpass the state-of-the-art accuracy levels reported in [33], [34]. Given thatwe report significant improvements over baselines that arealready superior to the state of the art, we consider that ourresults on CREMA-D are remarkable.

4.6 Results on Fashion-MNISTNeural architectures. For the Fashion-MNIST data set, weconsider two MLPs and two CNNs (LeNet, VGG-9). Wekept the same design choices as in the original paper [12]for VGG-9, but for LeNet [11], we replaced the average-pooling layer with max-pooling. The first MLP architecture(MLP-1) is composed of one hidden layer with 100 units andone output layer with 10 units (the number of classes). Thesecond MLP has two hidden layers with 100 units and 10units, respectively, followed by the output layer with another10 units. The considered MLP architectures are similar tothose attaining better results among the MLP architecturesevaluated by Xiao et al. [15].Specific hyperparameter tuning. We trained LeNet for 30epochs, using a learning rate of 10−3 for the first 15 epochsand 10−4 for the last 15 epochs. We trained MLP-1 and MLP-2 in the same manner as LeNet. However, VGG-9 was trainedfor 100 epochs, starting with a learning rate of 10−4 in thefirst 50 epochs, decreasing it to 10−5 in the last 50 epochs.We trained all models on mini-batches of 64 images. In allthe experiments with (leaky) ADA or (leaky) PyNADA, weeither validated or learned α and we set the parameter c = 0.Results. We present the Fashion-MNIST results with variousneural architectures and activation functions in Table 4. Ourbaseline MLP architectures obtain better accuracy rates thanthose of Xiao et al. [15], e.g. the difference obtained forMLP-1 with ReLU is 1.78% (we report 88.88%, while Xiaoet al. [15] report 87.10%). We notice that leaky ReLU obtainsslightly lower accuracy rates than ReLU, the only exceptionbeing the result with LeNet. Nonetheless, for both MLPand CNN architectures, the accuracy of ADA on the testset is superior (by up to 0.5%) compared to the accuracyof ReLU. We obtain statistically significant improvementsfor the replacement of ReLU with ADA, from 90.84% to91.34% using LeNet and from 93.39% to 93.84% using VGG-9, respectively. Interestingly, we also noticed that the valueof the cross-entropy loss is always lower (on both validationand test sets) when we use ADA instead of ReLU.

We observe that RBF converges only for small networks(MLP-1, MLP-2 or LeNet). When it converges, the resultsare below the ReLU and leaky ReLU baselines. For VGG-9,the accuracy of RBF is equal to the random choice baseline.Swish attains better results than ReLU for MLP-1 and MLP-2,

but below ADA and leaky ADA. For LeNet and VGG-9,Swish surpasses only RBF (all other activation functions arebetter). When we employ our PyNADA, the accuracy ratesimprove for all the architectures. The largest improvement ofleaky PyNADA on the test set (with respect to the baselinePyNReLU) is 0.73%, obtained with LeNet. The reporteddifference is statistically significant.

4.7 Results on Tiny ImageNetWe note that, in order to follow exactly the connectivity ofpyramidal neurons, the apical dendrites should be longer,i.e. not connected to neurons in the immediately precedinglayer i−1, but to neurons in layers i−2, i−3 or perhaps evenfurther apart. With this design change, the apical dendriteswill act as a kind of skip-connections. We hereby show someresults in this direction, although this aspect can be studiedconsidering various architecture configurations in futurework. Our goal is to prove that the biological design ofpyramidal neurons is viable for artificial neural networks aswell.Neural architecture. For object recognition on Tiny Im-ageNet, we consider a ResNet-18 architecture [13] with-out standard skip-connections. Instead of standard skip-connections, we use PyNADA with longer apical dendrites,closely modeling the biological pyramidal neurons. Usinglonger apical dendrites introduces new parameters (e.g.,where to connect the apical dendrites) and constraints(the pyramidal design does not apply to every networkarchitecture). The ResNet-18 model allows us to connect theapical dendrites to layer i−3 instead of the immediatelypreceding layer i−1. As a first baseline for this experiment,we use PyNReLU with equally-long apical dendrites, butwith ReLU instead of ADA for the apical tuft. The ResNet-18with PyNReLU is similar to ResNet-18 with standard skip-connections, the difference being that the skip-connectionshave learnable weights and ReLU activations. We addi-tionally consider two more baselines with similar design:PyNRBF and PyNSwish.Specific hyperparameter tuning. The ResNet-18 modelswith PyNReLU, PyNRBF, PyNSwish and PyNADA are eachtrained for 120 epochs using a learning rate of 10−3 and mini-batches of 200 samples. For PyNADA, we obtain optimalresults with c = 0 and α = 0.1 (obtained through validation).For Swish, the parameter β is learnable.Results. We present the object recognition results on TinyImageNet in Table 5. First of all, we note that, without pre-training on ImageNet, the accuracy rates on Tiny ImageNetreported in literature are typically around 50% or 60%. Thebaseline PyNReLU attains an accuracy of 51.54%. PyNADAbrings a statistically significant improvement of 2.32% over

Page 10: Non-linear Neurons with Human-like Apical Dendrite Activations

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

the baseline. The accuracy rates reached by PyNRBF andPyNSwish are between the accuracy rates of PyNReLU andPyNADA. This experiment shows that it is useful to considerlonger apical dendrites in conjunction with ADA, as observedin biology.

5 CONCLUSION AND FUTURE WORK

In this paper, we proposed a biologically-inspired activationfunction and a new model of artificial neuron. The novelapical dendrite activation function (i) enables individualartificial neurons to solve non-linearly separable problemssuch as the XOR logical function and (ii) brings significantperformance improvements for a broad range of neuralarchitectures and tasks. Indeed, we observed consistentperformance improvements over the most popular activationfunction, ReLU, across five benchmark data sets. The pro-posed ADA also outperformed other activations, RBF andSwish, that enable individual artificial neurons to solve XOR.Even though RBF and Swish share the capability of solvingXOR with ADA, the accuracy rates of RBF and Swish acrossthe five evaluation benchmarks are inconsistent, in somecases even failing to converge. Our pyramidal neural designis another way to further boost the performance. Importantly,we observed the largest performance improvements whenwe used ADA instead of ReLU, RBF or Swish for the apicaltuft of the pyramidal neurons. In conclusion, we believethat the biologically-inspired ADA and PyNADA are usefuladditions to set of deep learning tools. We will release ourcode as open source subject to a favorable decision.Future work. Our research also opens a few directions forfuture research. Since the activation damps along the positiveside of the domain, we believe it is worth investigating ifADA is more robust to out-of-distribution or adversarialexamples. As the gradient saturates on the positive side,other directions of study are to inject noise into ADA toavoid saturation [35] or to employ alternative optimizationmethods (that do not rely on gradients) in conjunction withADA.

ACKNOWLEDGMENTS

This work was supported by a grant of the RomanianMinistry of Education and Research, CNCS - UEFISCDI,project number PN-III-P1-1.1-TE-2019-0235, within PNCDIIII. The article has also benefited from the support of theRomanian Young Academy, which is funded by StiftungMercator and the Alexander von Humboldt Foundation forthe period 2020-2022.

REFERENCES

[1] A. Gidon, T. A. Zolnik, P. Fidzinski, F. Bolduan, A. Papoutsi,P. Poirazi, M. Holtkamp, I. Vida, and M. E. Larkum, “Dendriticaction potentials and computation in human layer 2/3 corticalneurons,” Science, vol. 367, no. 6473, pp. 83–87, 2020.

[2] B. Knutson and J. C. Cooper, “The Lure of the Unknown,” Neuron,vol. 51, no. 3, pp. 280–282, 2006.

[3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.521, no. 7553, pp. 436–444, 2015.

[4] D. Xu, Y. Yan, E. Ricci, and N. Sebe, “Detecting anomalous eventsin videos by learning deep representations of appearance andmotion,” Computer Vision and Image Understanding, vol. 156, pp.117–127, 2017.

[5] X. Wang, L. Gao, J. Song, X. Zhen, N. Sebe, and H. T. Shen, “Deepappearance and motion learning for egocentric activity recognition,”Neurocomputing, vol. 275, pp. 438–447, 2018.

[6] F. Rosenblatt, “The Perceptron: a probabilistic model for informa-tion storage and organization in the brain,” Psychological review,vol. 65, no. 6, p. 386, 1958.

[7] M. Minsky and S. A. Papert, Perceptrons: An introduction to computa-tional geometry. MIT press, 1969.

[8] V. Nair and G. E. Hinton, “Rectified Linear Units Improve Re-stricted Boltzmann Machines,” in Proceedings of ICML, 2010, pp.807–814.

[9] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlineari-ties improve neural network acoustic models,” in Proceedings ofWDLASL, 2013.

[10] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for ActivationFunctions,” in Proceedings of ICLR Workshops, 2018.

[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” Proceedings of the IEEE,vol. 86, no. 11, pp. 2278–2324, 1998.

[12] K. Simonyan and A. Zisserman, “Very Deep Convolutional Net-works for Large-Scale Image Recognition,” in Proceedings of ICLR,2014.

[13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning forImage Recognition,” in Proceedings of CVPR, 2016, pp. 770–778.

[14] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutionalnetworks for text classification,” in Proceedings of NIPS, 2015, pp.649–657.

[15] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: a Novel ImageDataset for Benchmarking Machine Learning Algorithms,” arXivpreprint arXiv:1708.07747, 2017.

[16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, andL. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252,2015.

[17] Z. Zhang, Y. Song, and H. Qi, “Age Progression/Regression byConditional Adversarial Autoencoder,” in Proceedings of CVPR,2017, pp. 5810–5818.

[18] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova,and R. Verma, “CREMA-D: Crowd-sourced emotional multimodalactors dataset,” IEEE Transactions on Affective Computing, vol. 5,no. 4, pp. 377–390, 2014.

[19] A. M. Butnaru and R. T. Ionescu, “MOROCO: The Moldavianand Romanian Dialectal Corpus,” in Proceedings of ACL, 2019, pp.688–698.

[20] S. Hayou, A. Doucet, and J. Rousseau, “On the Impact of theActivation function on Deep Neural Networks Training,” inProceedings of ICML, 2019, pp. 2672–2680.

[21] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers:Surpassing Human-Level Performance on ImageNet Classification,”in Proceedings of ICCV, 2015, pp. 1026–1034.

[22] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and AccurateDeep Network Learning by Exponential Linear Units (ELUs),” inProceedings of ICLR, 2016.

[23] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, andY. Bengio, “Maxout Networks,” in Proceedings of ICML, 2013, pp.1319–1327.

[24] B. Widrow, “An Adaptive ‘Adaline’ Neuron Using Chemical‘Memistors’,” Stanford Electronics Laboratories, Tech. Rep. 1553-2,1960.

[25] J. Guerguiev, T. P. Lillicrap, and B. A. Richards, “Towards deeplearning with segregated dendrites,” eLife, vol. 6, p. e22901, 2017.

[26] J. Hawkins and S. Ahmad, “Why Neurons Have Thousands ofSynapses, a Theory of Sequence Memory in Neocortex,” Frontiersin Neural Circuits, vol. 10, p. 23, 2016.

[27] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg,R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Va-sudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow:A system for large-scale machine learning,” in Proceedings of OSDI,2016, pp. 265–283.

[28] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison,A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An ImperativeStyle, High-Performance Deep Learning Library,” in Proceedings ofNeurIPS, 2019, pp. 8024–8035.

Page 11: Non-linear Neurons with Human-like Apical Dendrite Activations

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

[29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learningrepresentations by back-propagating errors,” Nature, vol. 323, no.6088, pp. 533–536, 1986.

[30] J. Hu, L. Shen, and G. Sun, “Squeeze-and-Excitation Networks,” inProceedings of CVPR, 2018, pp. 7132–7141.

[31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” in Proceedings of ICLR, 2015.

[32] T. G. Dietterich, “Approximate Statistical Tests for Comparing Su-pervised Classification Learning Algorithms,” Neural Computation,vol. 10, no. 7, pp. 1895–1923, 1998.

[33] A. Shukla, K. Vougioukas, P. Ma, S. Petridis, and M. Pantic,“Visually guided self-supervised learning of speech representations,”in Proceedings of ICASSP, 2020, pp. 6299–6303.

[34] G. He, X. Liu, F. Fan, and J. You, “Image2Audio: Facilitating Semi-supervised Audio Emotion Recognition with Facial ExpressionImage,” in Proceedings of CVPR Workshops, 2020, pp. 912–913.

[35] C. Gulcehre, M. Moczulski, M. Denil, and Y. Bengio, “NoisyActivation Functions,” in Proceedings of ICML, 2016, pp. 3059–3068.