Network Representation Learning: Consolidation and Renewed ... · Network Representation Learning:...

Network Representation Learning: Consolidation andRenewed Bearing

Saket Gurukar ∗ 1, Priyesh Vijayan*2, 4, Aakash Srinivasan*2, Goonmeet Bajaj1, Chen Cai1, Moniba Keymanesh1, Saravana

Kumar1, Pranav Maneriker1, Anasua Mitra3, Vedang Patel1, Balaraman Ravindran2, 4, and Srinivasan Parthasarathy1

1Computer Science and Engineering, The Ohio State University

2Department of Computer Science and Engineering, IIT Madras

3Department of Computer Science and Engineering, IIT Guwahati

4Robert Bosch Centre for Data Sciences and AI, IIT Madras

ABSTRACTGraphs are a natural abstraction for many problems wherenodes represent entities and edges represent a relationshipacross entities. The abstraction can be explicit (e.g., trans-portation networks, social networks, foreign key relation-ships) or implicit (e.g., nearest neighbor problems). Animportant area of research that has emerged over the lastdecade is the use of graphs as a vehicle for non-linear di-mensionality reduction in a manner akin to previous ef-forts based on manifold learning with uses for downstreamdatabase processing (e.g., entity resolution and link predic-tion, outlier analysis), machine learning and visualization.In this systematic yet comprehensive experimental survey,we benchmark several popular network representation learn-ing methods operating on two key tasks: link prediction andnode classification.

We examine the performance of 12 unsupervised embed-ding methods on 15 datasets. To the best of our knowledge,the scale of our study – both in terms of the number ofmethods and number of datasets – is the largest to date.Our benchmarking study in as far as possible uses the orig-inal codes provided by the original authors.

Our results reveal several key insights about work-to-datein this space. First, we find that certain baseline methods(task-specific heuristics, as well as classic manifold meth-ods) that have often been dismissed or are not consideredby previous efforts can compete on certain types of datasetsif they are tuned appropriately. Second, we find that recentmethods based on matrix factorization offer a small but rel-atively consistent advantage over alternative methods (e.g.,random-walk based methods) from a qualitative standpoint.

∗Equal contribution. Rest of the authors are listed in thealphabetical order of their last names

Specifically, we find that MNMF, a community preservingembedding method, is the most competitive method for thelink prediction task. While NetMF is the most competitivebaseline for node classification. Third, no single methodcompletely outperforms other embedding methods on bothnode classification and link prediction tasks. We also presentseveral drill-down analysis that reveals settings under whichcertain algorithms perform well (e.g., the role of neighbor-hood context on performance; dataset characteristics thatinfluence performance) – guiding the end-user.

1. INTRODUCTIONGraphs are effective in multiple disparate domains to model,

query and mine relational data. Examples abound rangingfrom the use of nearest neighbor graphs in database systems[18, 63] and machine learning [9, 43] to the analysis of bio-logical networks [24, 7] and from social network analysis [22,62] to the analysis of transportation networks[13]. ML-enhanced data structures and algorithms such as learned in-dexes [31] have recently shown promising results in databasesystems. An active area in ML research – network repre-sentation learning – has potential in multiple applicationsrelated to the downstream database processing tasks suchas outlier analysis [63, 34], entity resolution [15, 19], linkprediction [35, 47] and visualization [17, 50]. However, aplethora of new network representation learning methodshas been proposed recently [11, 25]. Given the wide rangeof methods proposed, it is often tough for a practitionerto determine or understand which of these methods theyshould consider adopting for a particular task on a particu-lar dataset. Part of the challenge is the lack of a standardevaluation benchmark and a thorough independent under-standing of the strengths and weaknesses of each methodfor the particular task on hand. The challenges are daunt-ing and can be summarized as follows:

Lack of Standard Assessment Protocol: First, thereis no standard to evaluate the quality of generated embed-dings. The efficacy among embedding methods is often eval-uated based on downstream machine learning tasks. As a re-sult, the superiority of one embedding method over anotherhinges on the performance in a downstream machine learn-ing task. With the lack of a standard evaluation protocol for

arX

iv:1

905.

0098

7v2

[cs

.LG

] 1

5 Ju

n 20

19

these downstream tasks, the results reported by different re-search articles are often inconsistent. As a specific example,Node2vec [21] report the node classification performance ofDeepwalk on Blogcatalog dataset – for multi-label classifi-cation with a train-test split of 50:50 – as 21.1% Macro-f1,whereas the Deepwalk paper [47] reports Deepwalks’ perfor-mance as 27.3% Macro-f1.

Tuning Comparative Strawman: Second, a new methodalmost always compares its performance against a subset ofother methods, and on a subset of tasks and datasets pre-viously evaluated. In many cases, while great care is takento tune the new method (via careful hyper-parameter tun-ing) – the same care is often not taken when evaluatingbaselines. For example, in our experiments on Blogcata-log, we find that with a train-test split of 50:50 the Lapla-cian Eigenmaps method [6] without Grid-Search achieves aMacro-f1 score of 3.9% (similar to what was reported in [20,21]). However, with tuning the hyper-parameters of logisticregression, we find that the Laplacian Eigenmaps methodachieves a Macro-F1 of 29.2%. Importantly, while logis-tic regression is commonly used to evaluate the quality ofnode embeddings in such methods, Grid-Search over logis-tic regression parameters is rarely conducted or reported.Additionally, reported results are rarely averaged over mul-tiple shuffles to reduce any bias or patterns in the trainingdata1. In short, a lack of consistency in evaluation inhibitsour understanding of the scientific advances in this arena,discussed next.

Standard Benchmark: Third, there is no agreed list ofdatasets that are used consistently in the literature. Anew embedding method evaluates their method on selecteddatasets with a suitable node classification/link predictionsetup. For instance, few methods report node classifica-tion performance for the baselines with the train-test split of10:90 while few methods report the same with the train-testsplit of 50:50. As a result, the comparison across embeddingmethods is often unclear. Additionally, there are no clearguidelines on whether the proposed embedding methodologyfavors a certain type of dataset characteristic (e.g., sparsity).

Task Specific Baselines: Fourth, for many tasks such asnode classification and link prediction there is a rich pre-existing literature [8, 36] focused on such tasks (that do notexplicitly rely on node embedding methodology as a prepro-cessing step). Few, if any, of the prior art in network repre-sentation learning consider such baselines – often such meth-ods compare performance on downstream ML tasks againstother node embedding methods. In our experiments, we findthat a curated feature vector based on heuristics can achievea similar competitive AUROC score on many of the datasetsfor the link prediction task.

To summarize, there is a clear and pressing need for acomprehensive and careful benchmarking of such methodswhich is the focus of this study. To address the aforemen-tioned issues in the network embedding literature, we per-form an experimental study of 12 promising network embed-dings methods on 15 diverse datasets. The selected embed-ding methods are unsupervised techniques to generate thenode embeddings of a graph. Our goal is to perform a uni-form, principled comparison of these methods on a variety of

1This is our observation based on the evaluation scripts pub-lically shared by multiple authors.

different datasets and across two key tasks – link predictionand node classification.

Specific findings of our experimental survey that we wishto highlight are that:

1. For the link-prediction task we find that MNMF [59], acommunity preserving embedding method, offers a com-pelling advantage when it scales to a particular dataset.Other more scalable alternatives such as Verse and LINEalso perform well on most datasets. The heuristic ap-proach we present for link prediction competes excep-tionally well on all types of datasets surveyed for thistask.

2. For the node classification task, NetMF [49] when it scalesto the particular dataset, offers a small but consistentperformance advantage. We find that for node-classificationtask, the task-specific heuristic methodology we comparewith works well when operating on datasets with fewer la-bels – in such scenarios, it competes well with a majorityof the methods surveyed, whereas, some recent methodsproposed fare much worse.

3. We also drill down to study the impact of context embed-dings on the link prediction and node classification tasks(and find some methods impervious to the use of context– where for others it helps significantly). We also exam-ine two common ways in which link prediction strategiesare evaluated (explicitly through a classifier, or implicitlythrough vector dot-product ranking). We find that thereis a clear separation in performance when using these al-ternative strategies.

2. NOTATIONS AND DEFINITIONSWe denote the input graph as G = (V,E) where V and

E denote the set of nodes and edges of the graph, G. Thenotations used in this work are listed in Table 1. In thisstudy, we consider both directed as well as undirected graphsalong with weighted as well as unweighted graphs. We eval-uate the embedding methods on non-attributed, homoge-neous graphs.

Definition 2.1. Network Embedding: Given a Graph, G =(V,E) and an embedding dimension, d where d � |V |, thegoal of a network embedding method is to learn a d-dimensionalrepresentation of the graph, G such that similarity in graphspace approximates closeness in d-dimensional space.

3. NETWORK EMBEDDING METHODSIn this section, we give a summary of the network em-

bedding methods evaluated in our work. Herein, for eachmodels along with their description, we also provide addi-tional experimental details for reproducibility.

1. Laplacian Eigenmaps [6]: Laplacian Eigenmaps gen-erates a d-dimensional embedding of the graph us-ing the smallest d eigenvectors of Laplacian matrixL = D −A.

minimizeU

trace(UTLU)

subject to UTDU = I

U is generated embedding matrix R|V |∗d. The aboveequation can be reduced to simple minimization of

Symbol MeaningG Input graphV NodesE Edgesn Number of nodes, |V |A Adjacency matrix. A ∈ Rn∗n

D Degree Matrix of Graph.Di,i =

∑nj=0 Ai,j and Di,j = 0 where i 6= j

I Identity MatrixΦ(u) Node embedding of node uψ(u) Context embedding of node uU, V Node and context embedding matrix of size Rn∗d

vol(G) Sum of weights of all edgesS Graph Similarity matrixσ(x) a non-linear function such as Sigmoid functionλ Number of negative samplesP D−1ALrw Lrw = I −D−1A

Lsym Lsym = I −D−1/2AD−1/2

Table 1: The notations symbols used in this study and theterms associated with those symbols.

L2 distance for adjacent nodes - Σi,j ||u(i)− u(j)||2Aij .Laplacian Eigenmaps levers the first order informationfor generating the embeddings.

Reproducibility notes: We search for following hy-perparameters: Embedding dimension = [64, 128, 256].On the datasets with >1M nodes, Laplacian Eigen-maps did not scale for embedding dimension 128, 256.

2. DeepWalk [47]: DeepWalk is a random walk basednetwork embedding method which uses truncated ran-dom walks and levers local information from these gen-erated walks to learn similar latent representations.DeepWalk draws inspiration from Skip-gram model inWord2vec [39] by treating random walks as sequencesand optimizing following objective function:

minimizeΦ − logPr({vi−w, · · · , vi+w}\vi | Φ(vi))(1)

where vi is target node while {vi−w, · · · , vi+w}\vi arethe context nodes. Φ(vi) denotes the embedding ofthe node vi. Since the objective function is expensiveto compute for large graphs, it is approximated usingHierarchical Softmax[40].

Reproducibility notes: We search for following hy-perparameters: Walk length = [5, 20, 40], Number ofwalks = [20, 40, 80], Window size = [2, 4, 10], Embed-ding dimension = [64, 128, 256]. In the case of directedgraphs, we observe lower performance in node classi-fication and link prediction task. In order to have afair comparison with other methods, we treat directedgraphs as undirected for DeepWalk.

3. Node2Vec [21]: Node2Vec is a biased random walkbased network embedding method which allows therandom walk to be more flexible in exploring the graphneighborhoods. The flexibility of the random walk isachieved by interpolating between Breadth-first traver-sal and Depth-first traversal. The objective function isagain based on Skip-gram model -based on Word2vec

[39], and since the objective function is expensive tocompute for large graphs, it is approximated by nega-tive sampling [39].

Reproducibility notes: We search for following hy-perparameters: Walk length = [10, 20, 40], Number ofwalks = 80. Window Size = 10, p and q = [0.25, 1,2, 4], Embedding dimension = [64, 128, 256]. In caseof directed graphs, we observe lower performance innode classification and link prediciton task. In orderto have a fair comparison with other methods, we treatdirected graphs as undirected for Node2Vec.

4. GraRep [12]: GraRep is a matrix factorization basednetwork embedding method which captures the globalstructural information of the graph while learning nodeembeddings. The authors observe that the existingSkip-gram based models project all the k-step rela-tional information into a common subspace and then,argue the importance of preserving different k-step re-lational information in separate subspaces. The lossfunction to preserve the k-step relationship betweennode w and c is proposed as:

Lk(u, v) =Aku,v · log σ(Φ(u) · Φ(v))

+λ

|V |∑

v′∈V,v′ 6=v

Aku,v′ · log σ(−Φ(u) · Φ(v′))

(2)

where v′ refers to the negative node at k-th step fornode u (see Table 1 for additional notation, e.g. λ).The above loss function in closed form results in log-transformed, probabilistic adjacency matrix which isfactorized with SVD for generating each k-step repre-sentation. The final node representation is generatedby concatenation of all the k-step representations.

Reproducibility notes: We search for following hy-perparameters: k from 1 to 6, Embedding dimension= [64, 128, 256]. On the datasets with >2M edges,due to scalability issue, we searched for k from 1 to 2and Embedding dimension = [64, 128].

5. NetMF [49]: NetMF is a matrix factorization basednetwork embedding method. NetMF presents theo-retical proofs for their claim that Skip-gram modelswith negative sampling are implicitly approximatingand factorizing appropriate matrices constructed withthe help of graph Laplacians. The objective matrixbased for NetMF on small context window T is givenby (see Table 1 for notation):

log

(vol(G)

(1

T

T∑r=1

(D−1A

)r)D−1

)− log b (3)

where vol(G) refers to sum of all edge weights and bcorresponds to number of negative samples in skip-gram model. NetMF factorizes the above closed formDeepWalk matrix with SVD in order to generate nodeembedding and provides two algorithms for small con-text window and large context window.

Reproducibility notes: We search for following hy-perparameters: T = [1, 10], Negative samples λ = [1,2, 3], Rank H for large context window = [128, 256,512], Embedding dimension = [64, 128, 256].

6. M-NMF [59]: M-NMF is a matrix factorization basednetwork embedding method which generates node em-beddings that preserves the microscopic information inform of first-order and second-order proximities amongnodes and the generated embeddings also preserve meso-scopic information in form of community structure.The objective function for M-NMF is given as

O = minU,V C,H≥0

‖S − V UT ‖2 + α ∗ ‖H − UCT ‖2F

− βtr(HTBH) + ζ‖HHT − I‖2F(4)

where H is the binary community membership matrix,C is the latent representations of communities and Bis the modularity matrix obtained from the adjacencymatrix, A (see Table 1 for rest of the notations). Over-all, M-NMF discovers communities through modular-ity constraints. The node embeddings generated withthe help of microscopic information and communityembeddings are then, jointly optimized by assumingconsensus relationship between both node and com-munity embeddings.

Reproducibility notes: We search for following hy-perparameters: α = [0.1, 1.0, 10.0], β = [0.1, 1.0, 10.0],Embedding dimension = [64, 128, 256].

7. HOPE [44]: HOPE is a matrix factorization basednetwork embedding method which generates node em-beddings that preserve asymmetric transitivity of nodesin directed graphs. If there exists a directed edge fromnode u to w and w to v, then – due to asymmetrictransitivity property – an edge from u to v is morelikely to form than edge from v to u. The objectivefunction of HOPE is given as follows

min ‖S − Us(U t)T ‖22 (5)

where Us and U t are the source and target embed-dings. In order to preserve asymmetric transitivity ofnodes, the proximity matrix S is constructed using asimilarity metric which respects the directionality ofedges. The node embeddings are generated by factor-izing the proximity matrix with generalized SVD [46].

Reproducibility notes: We search for following hy-perparameters: The decay parameter β = 0.5/α, whereα is spectral radius of the graph. Embedding dimen-sion = [64, 128, 256].

8. LINE [54]: LINE is an optimization-based networkembedding method which optimizes an objective func-tion that preserves both first and second order prox-imity among nodes in the embedding space. The ob-jective function for first-order proximity is given as:

O1 = −∑

(u,v)∈E

Au,v log σ(Φ(u).Φ(v)) (6)

The objective function to preserve the second orderproximity is given as:

O2 = −∑

(u,v)∈E

Au,v logexp(Φ(u).ψ(v))∑

v′∈V,v′ 6=vexp(Φ(u).ψ(v′))

(7)

where ψ(u) represents context embedding of node u(see Table 1 for rest of the notations). The first-order

proximity corresponds to local proximity between nodesbased on the presence of edges in the graph while thesecond-order proximity corresponds to global proxim-ity between nodes based on shared neighborhoods ofthose nodes in the graph. Since the objective functionis expensive to compute for large graphs, it is approx-imated by negative sampling [39].

Reproducibility notes: We search for following hy-perparameters: Number of samples = 10 billion, Em-bedding dimension ∈ [64, 128, 256]. In the case ofdirected graphs, as suggested by the authors of LINE,we evaluate only second-order proximity.

9. Verse [55]: Verse is an optimization-based networkembedding method which optimizes an objective func-tion that minimizes the Kullback-Leibler (KL) diver-gence from the given similarity distribution in graphspace to similarity distribution in embedding space(E). The objective function is given as follows:∑

v∈V

KL(simG(v, .)‖simE(v, .)) (8)

The similarity distribution in graph space could beconstructed with help of Personalized PageRank[45],SimRank[27], or Adjacency matrix[55]. Since the ob-jective function is expensive to compute for large graphs,it is approximated by Noise Constrastive Estimation[23].

Reproducibility notes: We search for following hy-perparameters: PageRank damping factor α = [0.7,0.85, 0.9], Negative samples = [3, 10], Embedding di-mension = [64, 128, 256].

10. SDNE [58]: SDNE is a deep autoencoder based net-work embedding method which optimizes an objectivefunction that preserves both first and second orderproximity among nodes in the embedding space. Theobjective function of SDNE is given below

Ljoint = αL1st + L2nd + νLreg (9)

where L1st and L2nd are loss functions to preservefirst-order and second-order proximities respectively,while Lreg is the regularizer term. The authors pro-pose a semi-supervised deep model to minimize thementioned objective function. The deep model con-sists of two components: supervised and unsupervised.The supervised component attempts to preserve thefirst-order proximity while the unsupervised compo-nent attempts to preserve the second-order proximityby minimizing reconstruction loss of nodes.

Reproducibility notes: We search for following hy-perparameters: α = [1e-05, 0.2], Penalty coefficient β= [5, 10], Embedding dimension = [64, 128, 256].

11. VAG [28]: VAG is a graph autoencoder based net-work embedding method which minimizes the recon-struction loss of the adjacency matrix. The reconstruc-tion matrix is generated as A = σ(ZZT ) where Z isnode embeddings generated with Graph ConvolutionalNetworks (GCN) [29] as Z = GCN(X,A) with X asnode features (see Table 1 for additional notation). Inthe case of unattributed graphs, the node feature ma-trix is the identity matrix.

Reproducibility notes: We search for following hy-perparameters: Epochs = [50, 100], Embedding di-mension = [64, 128, 256].

12. Watch Your Step [2]: Watch Your Step (WYS)addresses the sensitivity issue of hyper-parameters inthe random walk based embedding methods. WYSsolves the sensitivity issue with the attention mecha-nism on the expected random walk matrix. The at-tention mechanism guides the random walk to focuson short or long term dependencies pertinent to theinput graph. The objective function of WYS is givenas

minL,R,q

β‖q‖22 − ‖E[D; q] ◦ log(σ(L ∗RT ))

− 1[A = 0] ◦ log(1− σ(L ∗RT ))‖1(10)

where q is attention parameter vector, L and R arenode embeddings, E[D; q] is expectation on the ran-dom walk matrix (see Table 1 for rest of the notations).

Reproducibility notes: We search for following hy-perparameters: Learning rate = [0.05, 0.1, 0.2, 0.5,1.0], Number of Hops = 5, Embedding dimension =[64, 128, 256].

4. DATASETSWe select datasets from multiple domains, Table 2 de-

scribes empirical properties of datasets. The selected datasetssupport both multi-label and multi-class classification. Di-rected as well as undirected datasets were selected in orderto evaluate the embeddings methods on the link-predictiontask efficiently. Further, datasets with and without edgeweights are also included, thereby, providing us with a com-prehensive set of possibilities to evaluate the methods. Wesummarize the datasets below:

• Web: The WebKB datasets2 [14] consist of classifiedwebpages (nodes) and hyperlinks between them (edges).Here, labels are the categories of the webpages.

• Medical: The PPI dataset [10] represents a subgraphof protein interactions in humans. Labels representbiological states corresponding to hallmark gene sets.

• Natural Language: The Wikipedia dataset [38] is adump of Wikipedia with nodes as words, edges corre-sponding to the co-occurrence matrix and labels cor-responding to Part-of-Speech (POS) tags.

• Social: The Blogcatalog dataset and Flickr dataset [60]represent social networks. Blogcatalog and Flickr bothrepresent bloggers and their friendships. YouTube dataset[21] represents users and their friendships. Labels forBlogcatalog, Flickr, and YouTube correspond to thegroups to which each user belongs. The Epinions dataset[51] represents user annotated trust relationships, whereusers annotate which other users they trust. These areused to determine the reviews shown to a user.

• Citation: The DBLP, CoCit, and Pubmed datasetsrepresent citation networks. DBLP (Co-Author) rep-resents a subset of papers in DBLP3 from closely re-lated fields. CoCit (Microsoft) [1] corresponds to a co-citation subgraph of the Microsoft Academic Graph.

2http://linqs.cs.umd.edu/projects/projects/lbc/3https://dblp.uni-trier.de/

Dataset #Nodes #Edges #Labels (C/L) D W

WebKB (Texas) 186 464 4 C F TWebKB (Cornell) 195 478 5 C F TWebKB (Washington) 230 596 5 C F TWebKB (Wisconsin) 265 724 5 C F TPPI 3,890 38,739 50 L F FWikipedia 4,777 92,517 40 L F TBlogcatalog 10,312 333,983 39 L F FDBLP (Co-Author) 18,721 122,245 3 C F TCoCit (Microsoft) 44,034 195,361 15 C F FWiki-Vote 7,115 103,689 - - T FPubmed 19,717 44,338 3 C T Fp2p-Gnutella 62,586 147,892 - - T FFlickr 80,513 5,899,882 195 L F FEpinions 75,879 508,837 - - T FYouTube 1,134,890 2,987,624 47 C F F

Table 2: Dataset Properties. “C”/“L” refers to Multi Classor vs Multi Label(“L”). “D” refers to Directed and “W”refers to weighted.

Finally, Pubmed corresponds to a subset of diabetes-related publications on Pubmed4. Labels in DBLPcorrespond to the sub-field of the paper. In CoCit,they correspond to the conference of the paper, and inPubmed correspond to the types of diabetes.

• Digital: The p2p-Gnutella dataset [52] represents con-nections between hosts on a peer-to-peer file sharingnetwork. This dataset has no node labels.

• Voting: The Wiki-Vote dataset [32] is constructed fromvoting data in multiple elections for Wikipedia admin-istratorship. Users are nodes, and (directed) edge (i,j) represents a vote from user i to user j. This datasetalso has no node labels.

5. EXPERIMENTAL SETUPIn this section, we elaborate on the experimental setup

for link prediction and node classification tasks employed toevaluate the quality of embeddings generated by differentmethods. We present two heuristics baselines for both thetasks and define the metrics used for comparing the embed-ding methods.

5.1 Link PredictionPrediction of ties is an essential task in multiple domains

where the relational information is costlier to obtain suchas drug-target interactions [16], protein-protein interactions[4], or when the environment is partially observable. Theproblem of prediction of tie/link between two nodes i and jis often evaluated in one of two ways. The first is to treat theproblem as a binary classification problem. The second isto use the dot product on the embedding space as a scoringfunction to evaluate the strength of the tie.

The edge features for binary classification consists of nodeembeddings of nodes i and j, where two node embeddingsare aggregated with a binary function. In our study, we ex-perimented with three binary functions on node embeddings:Concatenation, Hadamard, and L2 distance. We used logis-tic regression as our base classifier for the prediction of thelink. The parameters of the logistic regression are tuned us-ing GridSearchCV with 5-fold cross validation with scoringmetric as ‘roc auc’. We evaluate the link prediction per-formance with metrics: Area Under the Receiver Operating

4https://www.ncbi.nlm.nih.gov/pubmed/

Characteristics (AUROC) and Area Under Precision-Recallcurve (AUPR). An alternative evaluation strategy is to pre-dict the presence of link (i, j) based on dot product value ofnode embeddings of nodes i and j. We study the impact ofboth the evaluation strategies in Section 6.1.

Construction of the train and test sets: The methodof construction of train and test sets for link prediction taskis crucial for comparison of embedding methods. The trainand test split consists of 80% and 20% of the edges respec-tively and are constructed in the following order:

1. Self-loops are removed.2. We randomly select 20% of all edges as positive test

edges and add them in the test set.3. Positive test edges are removed from the graph. We

find the largest weakly connected component formedwith the non-removed edges. The edges of the con-nected component form positive train edges.

4. We sample negative edges from the largest weakly con-nected component and add the sampled negative edgesto both the training set and test set. The number ofnegative edges is equal to the number of positive edgesin both training and test sets.

5. For directed graphs, we form “directed negative testedges” which satisfy the following constraint: (j, i) 6∈E but (i, j) ∈ E where E refers to edges in the largestweakly connected component. We add the directednegative test edges (j, i) edges to our test set. Thenumber of “directed negative test edges” is around10% of negative test edges in the test set.

6. Nodes present in the test set, but not present in thetraining set, are deleted from the test set.

In case of large datasets (>5M edges), we reduce our train-ing set. We consider 10% of both randomly selected positiveand negative train edges for learning the binary classifier.The learned model is evaluated on the test set. The abovesteps are repeated for 5 folds of a train:test splits of 80:20%and we report the average AUROC and AUPR scores across5 folds.

5.2 Node classificationIn network embedding literature, node classification is the

most popular way of comparing the quality of embeddingsgenerated by different embedding methods. The generatednode embeddings are treated as node features, and nodelabels are treated as ground truth. The classification taskperformed in our experiments is either Multi-label or Multi-class classification. The details on the classification taskperformed on each dataset are provided in Table 2. We selectLogistic Regression as our classifier. The hyperparametersof the logistic regression are tuned using GridSearchCV with5-fold cross validation with scoring metric as ‘f1 micro’. Wesplit the dataset with 50:50 train-test splits. The learnedmodel is evaluated on the test set, and we report the resultsaveraged over 10 shuffles of train-test sets. The model doesnot have access to test instances while training.

We note that a majority of the efforts in the literaturedo not tune the hyper-parameters of Logistic Regression.Default hyper-parameters are not always the best hyper-parameters for Logistic Regression. For instance, with de-fault hyper-parameters of LR classifier, the Macro-f1 per-formance of Laplacian eigenmaps on Blogcatalog datasetis 3.9% for the train-test split of 50:50. However, tuning

the hyper-parameters results in significant improvement ofMacro-f1 score to 29.2%.

The choice of a “linear” classifier to evaluate the quality ofembeddings is not a hard constraint in the node classificationtask. In this work, we also test the idea of leveraging a“non-linear” classifier for the node classification task and useEigenPro [37] classifier for the same task. On large datasets,EigenPro provides a significant performance boost over thestate-of-the-art kernel methods with faster convergence rates[37]. In the experiments, we see a benefit to this approach,up to 15% improvement in Micro-f1 scores with non-linearclassifier compared to the linear classifier.

5.3 HeuristicsNext, we present heuristics baseline for both link predic-

tion and node classification tasks. The purpose of definingheuristics baseline is to assess the difficulty of performinga particular task on a particular dataset and also to com-pare the performance of sophisticated network embeddingmethods over simple heuristics.

5.3.1 Link Prediction HeuristicsIn the link prediction literature, there exist multiple sim-

ilarity based metrics [36] which can predict a score for linkformation between two nodes. Examples of such metricsinclude Jaccard Index [26, 57], Adamic Adar [3]. Thesesimilarity-based metrics often base their predictions on theneighborhood overlap between the nodes. We combine thesimilarity-based metrics to form a curated feature vectorof an edge [53]. The binary classifier in the link predic-tion task is then trained on the generated edge embeddings.Our selected similarity-based metrics are Common Neigh-bors (CN), Adamic Adar (AA) [3], Jaccard Index (JA) [26],Resource Allocation Index (RA) [64] and Preferential At-tachment Index (PA) [5]. The similarity-based metrics CN,JA, and PA captures first-order proximity between nodes,while the metrics AA and RA capture second-order proxim-ity between nodes. We found this heuristic based model tobe highly competitive as compared to the embedding meth-ods on multiple datasets.

5.3.2 Node Classification HeuristicsNodes in the graph can be characterized/represented by

their properties. We combine the node properties to form afeature vector/embedding of a node. The classifier in nodeclassification task is then trained on the generated node em-beddings. The node properties capture information such asnodes’ neighborhood, influence on other nodes, structuralproperties. We select following node properties : Degree,PageRank [45], Clustering Coefficient, Hub and Author-ity scores [30], Average Neighbor Degree, and Eccentricity[42]. We treat the graph as undirected while computing thenode properties. As the magnitude of each node propertyvaries with another, we perform column-wise normalizationwith RobustScaler available from Scikit-learn. We will showin the experiments Section 6.2 that the node classificationheuristics baseline is competitive with most of the embed-ding methods on datasets with fewer labels.

5.4 Comparison MeasuresIn this section, we present two measures for comparing

the performance of embedding methods in the downstreammachine learning task.

Mean Rank : We compute the rank of all the embeddingmethods on each dataset based on selected performance met-ric and report the average rank of an embedding methodacross all datasets as the Mean Rank of the embeddingmethod. Let Re,d be the rank of embedding method e ondataset d with D being the set of datasets then mean rankof embedding method MRe is given by

MRe =

∑d∈D Re,d

|D| (11)

Mean Penalty [56] : We define penalty of an embed-ding method e on a dataset d as difference between bestscore achieved by any embedding method on dataset d andthe score achieved by embedding method e on same datasetd. Score is the selected performance metric for a particulardownstream ML task. Let E be the set of embedding meth-ods and Se,d be the score achieved by embedding method eon same dataset d, then the Mean Penalty MPe is given by

MPe =

∑d∈Dmax({Se′,d})− Se,d

|D| ; e′ ∈ E (12)

For a model, lower values for Mean Rank and Mean Penaltysuggest better performance. We compare the embeddingmethods with Mean Rank, and Mean Penalty measures onthe datasets where all the embedding methods complete ex-ecution. Though the measures do not consider the datasetsize or missing values, the measures are simple and intuitive.

6. EXPERIMENTAL RESULTSIn this section, we report the performance of network em-

bedding methods on link prediction task and node classi-fication task. We tune both the parameters of embeddingmethods and the parameters of classifiers in link predictionand node classification task. Whenever possible, we rely onthe authors’ code implementation of the embedding method.All the methods which do not complete execution on largedatasets are executed on a modern machine with 500 GBRAM and 28 cores. All the evaluation scripts are executedin the same virtual python environment.5

6.1 Link PredictionThe link prediction performance of 12 embedding methods

measured in terms of AUROC and AUPR on 15 datasets isshown in Figure 1 and Figure 2. The Overall (or aggregate)performance of an embedding method on all the datasets isalso shown at the end of the horizontal bar of each methodin Figure 1 and Figure 2. We represent Overall score as thesum of scores (AUROC or AUPR) of the method on all thedatasets. The Mean Rank and Mean Penalty of embeddingmethods – on datasets for which all methods run to comple-tion on our system – is shown in Figure 3. We also providethe tabulated results in Tables 3 and 4. As mentioned inSection 3, we tune the hyper-parameters of each embeddingmethod and report the best average AUROC scores and av-erage AUPR scores across 5 folds. In the case of WebKBdatasets, we evaluate the methods on embedding dimensions64, 128. We perform the link prediction task with both nor-malized and unnormalized embeddings and report the bestperformance.5The evaluation scripts and datasets are available athttps://github.com/PriyeshV/NRL Benchmark.

We make the following observations:

• Effectiveness of MNMF for Link Prediction: Weobserve that MNMF achieves the highest overall link pre-diction performance in terms of best average AUROC andAUPR scores as compared to other methods. The compet-itive performance of MNMF on link prediction could becredited to the community information imbibed into thenode embeddings generation. The Mean Rank and MeanPenalty is lowest for MNMF which also suggest MNMFas a competitive baseline for Link Prediction. MNMFachieves the first rank for 7 out of 15 datasets. The smallvalue of Mean Penalty suggests that even when MNMFis not the top-ranked method for a particular dataset,MNMF’s performance is closest to that of the top-rankedmethod on that dataset. However, MNMF does not com-pletely outperform other methods on all the datasets. Forinstance, on the Wiki-Vote and Pubmed dataset, WYSachieves the best average AUROC scores while on Mi-crosoft dataset, GraRep achieves the best average AU-ROC score. In Figure 1b and Figure 2b, we see thatamong the more scalable methods, LINE achieves thehighest overall link prediction performance followed byDeepWalk in terms of both AUROC and AUPR scores.Note that MNMF did not scale for the datasets with≥5M edges on a modern machine with 500 GB RAM and28 cores. However, the scalability issue of non-negativematrix factorization based methods can be addressed byadopting modern ideas [41, 33] (outside the scope of thisstudy).

• Performance of Heuristic Baseline: We observe thatthe Link Prediction Heuristics baseline – described in sec-tion 5.3.1 – is both efficient and effective. We see thatLink Prediction Heuristics baselines’ overall performanceis better than that of Laplacian Eigenmaps and SDNE andcompetitive to that of Node2vec, HOPE, Verse, LINE.The Mean Penalty of Link Prediction Heuristics is alsoclose to other embedding methods. On the largest datasetYouTube, Link Prediction Heuristics achieve an AUROCof 96.2% which is close to the best performing Verse withAUROC of 97.6%. As compared to the most competi-tive baseline MNMF, the heuristics baseline outperformsMNMF on Wikipedia, Blogcatalog datasets. We also ob-serve that the heuristics baseline performance is competi-tive against several methods on the directed datasets tooeven though the chosen similarity-based metrics in heuris-tics baseline treat the underlying graph as undirected.

Feature study on the Heuristic Baseline: We study theimportance of the individual feature in the heuristics byanalyzing the impact of the feature removal on link pre-diction. The results are reported in Figure 4. The blueline on the top of the columns in Figure 4 corresponds toAUROC scores achieved with the proposed link predic-tion heuristic. In the feature study on the link predictionheuristics, we see that the removal of preferential attach-ment (PA) feature results in consistent drop in AUROCscores. We find that the removal of PA feature results instatistically significant drop at significance level of 0.05with paired t-test. The removal of rest of the featuresboth in link prediction heuristics did not result in signifi-cant drop in the downstream performance.

77.7

73.1

79.7

83.0

81.7

78.2

78.7

78.7

82.4

80.9

96.0

78.6

83.0

81.5

77.3

79.2

82.0

87.0

77.5

84.4

79.9

80.2

81.2

96.7

74.8

84.4

75.3

70.1

75.3

75.0

82.2

72.9

78.1

72.8

76.7

75.5

97.5

73.9

79.2

79.4

71.4

80.7

78.0

88.2

72.5

84.3

75.5

76.9

82.3

98.9

73.9

84.5

90.9

78.2

89.1

88.3

89.6

87.8

90.0

88.4

89.3

87.3

96.9

87.4

91.5

91.6

77.9

90.9

90.9

91.3

91.2

92.3

90.4

50.0

91.4

88.4

89.5

92.3

91.5

83.5

97.4

97.6

94.9

96.6

88.4

97.8

96.6

95.5

92.2

94.3

98.2

95.2

77.4

94.3

95.0

97.3

95.2

96.2

95.3

95.6

95.1

94.0

94.8

96.0

95.6

93.3

96.0

95.4

97.9

94.3

97.1

89.6

50.0

95.9

99.4

94.1

96.8

87.7

89.6

89.1

89.3

96.6

92.7

77.7

90.1

88.9

89.8

94.3

93.6

97.0

866.3

791.8

871.6

874.6

906.5

859.0

867.2

858.5

786.6

874.7

954.2

855.0

903.0

Heuristics

LapEig

DeepWalk

Node2Vec

VERSE

LINE

GraRep

HOPE

SDNE

NetMF

MNMF

VAG

WYS

0.0 250.0 500.0 750.0 1000.0

W-Texas

W-Cornell

W-Washington

W-Wisconsin

PPI

Wikipedia

Wiki-Vote

BlogCatalog

DBLP

Pubmed

(a) Smaller datasets: All methods complete execution.

89.5

95.6

97.6

97.3

83.7

97.2

97.9

94.5

91.9

96.9

96.6

96.2

0.0

83.8

69.9

88.2

88.3

77.6

91.2

71.8

88.6

83.9

87.5

92.3

0.0

0.0

92.4

93.0

95.8

94.7

72.6

95.2

95.5

96.5

93.0

97.2

0.0

0.0

0.0

92.2

90.9

93.3

93.4

91.9

91.6

93.7

92.7

92.7

92.8

0.0

0.0

0.0

96.2

96.0

93.6

91.4

97.6

96.5

91.4

92.4

0.0

0.0

0.0

0.0

0.0

454.1

445.3

468.5

465.2

423.5

471.7

450.3

464.6

361.5

374.4

188.8

96.2

0.0

Heuristics

LapEig

DeepWalk

Node2Vec

VERSE

LINE

GraRep

HOPE

SDNE

NetMF

MNMF

VAG

WYS

0.0 100.0 200.0 300.0 400.0 500.0

Microsoft

P2P

Flickr

Epinions

Youtube

(b) Larger datasets: Not all methods complete execution.

Figure 1: Link Prediction performance measured with AUROC. Figure 1a shows the best average AUROC of all methodson datasets where all methods’ results are available. Not all methods are scalable on the datasets shown in Figure 1b. If amethod runs out of memory or faults on a particular dataset, we represent its performance as 0.0 in the above plot.

81.8

78.0

81.9

85.0

85.0

82.2

82.1

82.9

85.5

83.6

96.0

80.6

85.2

81.9

78.9

79.8

81.0

87.8

76.7

86.2

79.8

79.5

82.0

96.7

75.6

86.6

80.3

75.0

76.5

78.0

86.5

75.5

82.9

77.2

81.5

80.4

97.7

78.7

83.4

82.3

74.8

81.6

79.0

90.7

76.4

87.4

78.8

80.7

85.2

98.5

78.1

86.7

91.4

80.7

90.4

89.5

90.7

88.1

90.8

89.2

90.2

87.9

96.5

88.1

92.2

93.0

76.0

92.5

92.3

92.8

92.8

93.1

91.8

75.0

92.8

89.9

91.4

93.5

87.9

82.1

96.9

97.2

94.8

95.3

84.0

96.8

96.3

93.8

87.6

94.3

97.4

95.1

77.5

94.3

94.8

97.9

95.1

96.0

95.0

95.5

94.8

93.6

94.6

96.1

96.7

93.8

96.8

96.1

98.2

95.6

97.4

90.9

75.0

96.7

99.2

95.2

97.3

85.0

85.0

81.5

82.3

96.8

90.3

74.1

91.4

90.5

86.1

90.3

95.2

96.9

875.4

801.6

872.1

875.3

921.3

867.9

874.1

873.7

849.7

883.1

946.1

871.9

915.3

Heuristics

LapEig

DeepWalk

Node2Vec

VERSE

LINE

GraRep

HOPE

SDNE

NetMF

MNMF

VAG

WYS

0.0 250.0 500.0 750.0 1000.0

W-Texas

W-Cornell

W-Washington

W-Wisconsin

PPI

Wikipedia

Wiki-Vote

BlogCatalog

DBLP

Pubmed

(a) Smaller datasets: All methods complete execution.

91.9

95.5

97.9

97.5

76.4

97.7

97.9

95.3

93.5

97.1

95.2

96.4

0.0

79.3

68.1

84.3

84.5

68.3

88.6

71.3

85.8

80.8

84.0

89.6

0.0

0.0

92.5

95.0

96.1

95.0

70.9

95.5

95.7

96.7

94.0

97.6

0.0

0.0

0.0

89.2

89.5

91.7

91.9

88.6

88.8

93.0

91.5

91.6

91.8

0.0

0.0

0.0

96.7

96.7

95.0

93.0

98.2

97.0

92.2

94.0

0.0

0.0

0.0

0.0

0.0

449.6

444.7

465.0

461.9

402.5

467.6

450.1

463.4

359.9

370.5

184.7

96.4

0.0

Heuristics

LapEig

DeepWalk

Node2Vec

VERSE

LINE

GraRep

HOPE

SDNE

NetMF

MNMF

VAG

WYS

0.0 100.0 200.0 300.0 400.0 500.0

Microsoft

P2P

Flickr

Epinions

Youtube

(b) Larger datasets: Not all methods complete execution.

Figure 2: Link Prediction performance measured with AUPR. Figure 1a shows the best average AUPR of all methods ondatasets where all methods’ results are available. Not all methods are complete execution on the datasets shown in Figure1b. If a method runs out of memory or faults on a particular dataset, we represent its performance as 0.0 in the above plot.

7.5011.00

6.436.14

4.577.36

5.437.507.93

6.645.07

10.075.36

8.8414.83

7.667.62

6.738.29

9.268.94

15.317.85

1.4410.62

6.71

HeuristicsLapEig

DeepWalkNode2Vec

VERSELINE

GraRepHOPESDNE

NetMFMNMF

VAGWYS

0.00 10.00 20.00 30.00

Mean Rank Mean Penalty

Figure 3: Mean Rank and Mean Penalty for link predictionwith selected performance metric as AUROC.

• Impact of Evaluation Strategy: As described in sec-tion 5.1, the presence of a link between two nodes canbe predicted with either the Logistic Regression classifier(treating the embeddings as features) or the dot productbetween the node embeddings. We compare the perfor-mance of both evaluation strategies on each embeddingmethod over all datasets using the differences in the av-erage AUROC scores. A positive difference implies link

AURO

C

0

20

40

60

80

100

W-Texas

W-Cornell

W-Wash

ington

W-Wisc

onsin PPI

Wikipedia

Wiki-Vote

BlogCatalogDBLP

Pubmed

Microso

ftP2P

Flickr

Epinions

Youtube

No Adamic Adar No Common Neighbors No Jaccard IndexNo Preferential attachment Index No Resource Allocation Index All features

Figure 4: Feature study on the link prediction for link pre-diction heuristics.

prediction performance with classifier is better than thatof the dot product. The results are presented as box-plotin Figure 5. Paired t-test suggests the positive differenceis statistically significant for all methods, except for Verseand WYS, with a significance level of 0.05. Hence, the useof classifier over dot product provides significant predic-

WYS

VAG

MNMF

NetMF

SDNE

HOPE

GraRep

LINE

VERSE

N2V

DW

LapEig

0 10 20 30 40

Em

bedd

ing

met

hods

Figure 5: The box-plot represents distribution of the dif-ferences between computed AUROC score with classifierand computed AUROC score with dot product on all thedatasets. The difference is statistically significant (pairedt-test) for all methods, except for Verse and WYS, with sig-nificance level 0.05.

LapEig

DeepW

alk

Node2

VecVER

SE LINE

GraRep

HOPESD

NENetM

FMNMF

VAGWYS

W-TexasW-Cornell

W-WashingtonW-Wisconsin

PPIWikipediaWiki-Vote

BlogCatalogDBLP

PubmedMicrosoft

P2PFlickr

EpinionsYoutube

N/A

Hadmard

L2

Concat

Figure 6: The heatmap shows which binary function –Hadamard, Concatenation and L2 – resulted in best aver-age AUROC score for an embedding method on a particulardataset.

tive performance gain on the task of link prediction. Wealso investigate the changes in the ranking of embeddingmethods based on overall average AUROC scores whenpredictions are performed with classifier rather than dotproduct. The methods were ranked based on overall aver-age AUROC scores and we considered only those datasetson which all methods complete execution. We observedthat the rank of NetMF in the ranking generated withdot product was 10 while its rank in the ranking gen-erated with classifier improved to 3. Since the best linkprediction performance for the majority of the embeddingmethods was achieved with classifier, we believe the supe-riority of the embedding methods based on link predictiontask should be asserted by leveraging the classifier.

As mentioned in section 5.1, we lever binary functions:Hadamard, Concatenation, and L2 to generate the edgeembedding. In figure 6, we present which binary functionachieved the best average AUROC score for an embed-ding method on a particular dataset. We see that thebinary function Hadamard resulted in achieving a maxi-mum number of best average AUROC scores. However,there is no single winner in terms of choice of binary func-tions.

WYS

MNMF

NetMF

HOPE

GraRep

LINE

N2V

DW

−5 0 5 10

Em

bedd

ing

met

hods

Figure 7: The box-plot represents the distribution of thedifferences between computed AUROC score with node +context embeddings and computed AUROC score with onlynode embeddings on directed datasets. The difference is sta-tistically significant (paired t-test) for methods LINE, WYS,GraRep and MNMF with significance level 0.05.

• Impact of context embeddings: We study the im-pact of context embeddings on directed datasets for thelink prediction task. We consider only those embeddingmethods which generate both node and context embed-dings for this study. We compare the impact of usingnode + context embeddings over using only node embed-dings with the help of differences in AUROC scores. Theresults are detailed in Figure 7. A positive difference im-plies the use of context embeddings helps in link predic-tion. We see that levering node + context embeddingsimprove the link prediction performance of LINE, HOPE,and WYS. For MNMF, use of context embeddings doesnot improve the link prediction performance as in MNMFthe community information – crucial for link prediction –is incorporated in the node embeddings. In the case ofGraRep, we find that the node embeddings encapsulatehigh-order information and, hence, levering context doesnot help improve the performance. We find that the re-sults of DeepWalk and Node2Vec on directed datasets aresignificantly lower, so in order to have a fair comparisonwith other embedding methods, we treated the directeddatasets as undirected for DeepWalk and Node2Vec. Themedian of the box plot of DeepWalk and Node2Vec is closeto zero due to this treatment.

• Robustness of embedding methods: In link predic-tion, we compute the average AUROC score and averageAUROC standard error of an embedding method over 5folds of a selected dataset. The computed average AU-ROC standard error corresponds to the robustness of thatembedding method on the selected dataset – as larger val-ues of standard error corresponds to large variance in AU-ROC scores across 5 folds. In Figure 8, we report the dis-tribution of average AUROC standard error of each em-bedding method over all datasets. We observed large vari-ance in average AUROC standard error scores over We-bKB datasets and show the results for WebKB datasetsin Figure 8a while we show the results for other datasetsin Figure 8b. Interestingly, even on WebKB datasets, thevariance in average AUROC standard error scores is lowfor MNMF method. From Figure 8b, we observe that the

WYS

VAG

MNMF

NetMF

SDNE

HOPE

GraRep

LINE

VERSE

N2V

DW

LapEig

0 1 2 3 4 5

Em

bedd

ing

met

hods

(a) WebKB datasets

WYS

VAG

MNMF

NetMF

SDNE

HOPE

GraRep

LINE

VERSE

N2V

DW

LapEig

0.0 0.5 1.0 1.5 2.0

Em

bedd

ing

met

hods

(b) All datasets except WebKB datasets

Figure 8: The box-plot represents the distribution of the standard error of AUROC scores computed on all the datasets forthe link prediction task.

WYS

VAG

MNMF

NetMF

SDNE

HOPE

GraRep

LINE

VERSE

N2V

DW

LapEig

−2.5 0.0 2.5 5.0

Em

bedd

ing

met

hods

Figure 9: The box-plot represents the distribution of thedifferences in AUROC scores between 128 dimensional em-bedding and 64 dimensional embedding for each method onall datasets

median of box-plots of majority of the methods is closerto zero.

• Impact of embedding dimension: We study the im-pact of embedding size for all the embedding methodson the link prediction. Specifically, we compare the per-formances of 64 dimensional embedding with 128 dimen-sional embedding. The improvement – quantified in termsof performance difference – obtained with 128 dimensionalembedding over 64 dimensional embedding is reported inFigure 9. The box-plot represents the distribution of dif-ferences in AUROC scores between 128 dimensional em-bedding and 64 dimensional embedding for each methodon all datasets. In link prediction, we observe a statis-tically significant improvement at significance level 0.05with the 128 dimensional embedding for Laplacian Eigmaps,GraREP, HOPE, NetMF and MNMF methods.

• Impact of embedding normalization: We study theimpact of L2 normalization of the embeddings on thelink prediction performance. The comparison results areshown in Figure 10 where Figure 10a and Figure 10b

shows the comparison results when link prediction is per-formed through classifier and dot-product, respectively.The box plot represents the distribution of differences ofAUROC between normalized and unnormalized embed-dings on link prediction task. The positive difference im-plies L2 normalization results in better downstream per-formance. When link prediction is performed throughclassifier, the negative difference is statistically signifi-cant for VERSE and GraREP at significance level of 0.05with paired t-test. However, surprisingly the difference inperformance with respect to normalization of embeddingsis not statistically significantly for rest of the methodsfor link prediction. When link prediction is performedthrough dot-product, the normalization of embedding re-sults in statistically significant improvement for Node2vecand Verse, while not performing normalization of em-bedding results in statistically significant improvementHOPE, NetMF and WYS.

6.2 Node ClassificationThe node classification performance of 12 embedding meth-

ods measured in terms of Micro-f1 scores on 15 datasets withtrain-test split of 50:50 is reported in Figure 11 and Figure12. The overall performance of an embedding method onall the datasets is shown at the end of the horizontal barof each method in Figure 11 and Figure 12 and representsthe sum of scores (Micro-f1) of the method on the datasets.The Mean Rank and Mean Penalty of embedding methods– on the datasets for which all methods run to completionon our system – is shown in Figure 13. We also reportthe Mean Rank and Mean Penalty of embedding methods– on datasets with few labels – in Figure 14. We tune thehyper-parameters of each embedding method – mentioned insection 3 – and report the best Micro-f1 score. In the caseof WebKB datasets, we evaluate the methods on generatedembedding with dimensions 64, 128. We perform the nodeclassification with both normalized and unnormalized em-beddings and report the best performance. We also providetabulated results in Tables 5 and 6. We make the followingobservations.

• Effectiveness of NetMF for node classification: Weobserve that NetMF achieves the highest overall node clas-

WYS

VAG

MNMF

NetMF

SDNE

HOPE

GraRep

LINE

VERSE

N2V

DW

LapEig

−10 0 10

Em

bedd

ing

met

hods

(a) Prediction of link through classifier

WYS

VAG

MNMF

NetMF

SDNE

HOPE

GraRep

LINE

VERSE

N2V

DW

LapEig

−20 0 20 40 60

Em

bedd

ing

met

hods

(b) Prediction of link through dot-product

Figure 10: The box-plot represents the distribution of the differences in AUROC scores between L2 normalized embeddingsand unnormalized embeddings

61.83

54.62

55.05

57.20

54.52

61.83

56.13

59.14

58.00

67.10

57.96

54.80

60.65

42.14

30.92

40.51

34.29

35.41

44.08

40.92

41.94

48.50

48.06

36.12

40.30

41.84

65.04

43.30

56.00

58.43

51.48

65.30

46.70

62.78

60.50

61.13

60.43

59.00

65.30

51.65

41.50

52.26

45.56

41.73

52.86

50.83

51.50

51.30

56.69

53.91

48.10

53.16

37.35

37.06

35.86

35.55

37.21

36.98

35.69

35.98

37.40

36.57

36.19

36.20

0.00

57.84

81.67

81.46

81.10

63.00

64.36

79.09

74.72

67.70

79.96

77.05

63.50

73.62

315.85

289.08

321.14

312.14

283.34

325.42

309.35

326.06

323.40

349.51

321.67

301.90

294.56

Heuristics

LapEig

DeepWalk

Node2Vec

VERSE

LINE

GraRep

HOPE

SDNE

NetMF

MNMF

VAG

WYS

0.00 100.00 200.00 300.00

W - Texas W - Cornell W - Washington W - Wisconsin DBLP (Co-Author) Pubmed

(a) Datasets with few labels

10.82

22.34

21.39

21.02

19.65

19.88

20.39

18.79

17.40

21.31

18.62

19.20

22.56

41.91

46.27

49.96

51.42

43.77

56.34

58.82

57.88

52.40

58.40

48.10

41.10

44.39

17.05

42.13

41.46

41.73

35.50

38.56

41.33

34.43

29.50

41.66

21.64

17.10

38.93

25.04

43.00

46.29

46.73

45.99

46.49

46.58

44.60

38.10

43.46

44.67

43.50

0.00

19.08

33.97

35.56

35.14

30.05

33.38

10.54

28.68

30.80

34.17

0.00

0.00

0.00

24.39

0.00

40.73

40.25

38.54

40.33

38.00

38.71

0.00

0.00

138.30

187.71

235.39

236.29

213.51

234.97

215.66

223.08

168.20

199.00

133.02

120.90

105.88

Heuristics

LapEig

DeepWalk

Node2Vec

VERSE

LINE

GraRep

HOPE

SDNE

NetMF

MNMF

VAG

WYS

0.00 50.00 100.00 150.00 200.00 250.00

PPI Wikipedia Blogcatalog CoCit (microsoft) Flickr Youtube

(b) Datasets with more labels

Figure 11: The node classification performance measured with Micro-f1 on train-test split of 50:50 with Logistic Regression.For each method, the number at the end of bar represent the summation of the Micro-f1 values across the datasets.

sification performance in terms of best Micro-f1 scores us-ing both linear and non-linear classifiers. From Figure 13,we see that the Mean Rank and Mean Penalty is lowestfor NetMF which suggest NetMF as the strongest over-all method for node classification. NetMF achieves lowMean Rank suggesting NetMF is among the top-rankedmethods on the evaluated datasets. The smallest valueof Mean Penalty suggests that even when NetMF is notthe top-ranked method for a particular dataset, NetMF’sperformance is closest to that of the top-ranked methodfor that dataset. However, it does not entirely outperformother methods on all the datasets. LINE, DeepWalk, andNode2Vec are also competitive baselines for the task ofnode classification as their overall performance is clos-est to that of NetMF. The performance of GraRep ondatasets with more labels is comparable with other meth-ods when we exclude the Flickr dataset. However, the re-ported results for GraRep on the Flickr dataset are withembeddings of dimension 64. Embedding dimensions 128and 256 for GraRep resulted in memory errors on a mod-ern machine with 500GB RAM and 28 cores. Note thatNetMF did not scale for the YouTube dataset. Whilescalability is currently outside the scope of our study, thescalability of such methods is under active development(we refer the interested reader elsewhere [48, 33]).

• Laplacian Eigenmaps Performance: We observe thatthe Laplacian Eigenmaps method achieves competitiveMicro-f1 scores on several datasets. For instance, on Blog-catalog dataset with 39 labels, Laplacian Eigenmaps methodachieves the best Micro-f1 score of 42.1% while on thePubmed dataset, the Laplacian Eigenmaps methods out-perform all other embedding methods with 81.7% Micro-f1. With a non-linear classifier, Laplacian Eigenmapsachieves the second-best performance on the PPI datasetwith 23.8% Micro-f1. We observe from Figure 13 thatthe Mean Penalty of Laplacian Eigenmaps is also clos-est to other embedding methods, namely, Verse, MNMF,VAG. On the PPI and Flickr datasets, Laplacian Eigen-maps baselines’ Micro-f1 is close to the best Micro-f1. Theobserved results for Laplacian Eigenmaps on evaluateddatasets are better than the reported results [20, 21] forboth node classification and link prediction tasks. Thisimprovement in the performance of Laplacian Eigenmapsis due to the hyper-parameter tuning of parameters of lo-gistic regression classifier.

• Node Classification Heuristic: We observe from Fig-ure 11a and Figure 12a that the node classification heuris-tic baseline is competitive against other embedding meth-ods on datasets with fewer labels (up to 5 labels) as its

55.91

51.51

57.96

56.77

54.30

68.06

58.71

56.45

64.10

63.01

56.88

57.70

58.82

42.86

36.84

43.67

43.16

40.82

42.65

36.43

39.80

46.10

48.88

42.55

41.70

43.57

66.09

46.70

57.13

61.22

57.04

61.30

52.96

59.57

60.70

59.13

62.00

56.50

63.91

51.88

46.24

56.69

55.64

53.53

51.05

52.86

56.39

53.80

62.93

58.20

49.90

56.02

36.89

37.35

36.77

37.04

37.12

37.04

36.48

36.54

37.50

37.09

36.76

36.10

0.00

57.12

81.34

83.03

82.52

78.00

77.90

80.77

79.33

71.50

81.05

78.59

68.10

79.70

310.74

299.97

335.25

336.36

320.82

338.02

318.20

328.07

333.70

352.09

334.98

310.00

302.02

Heuristics

LapEig

DeepWalk

Node2Vec

VERSE

LINE

GraRep

HOPE

SDNE

NetMF

MNMF

VAG

WYS

0.00 100.00 200.00 300.00



10.73

23.81

23.65

23.56

21.15

20.01

24.27

21.86

20.50

23.85

22.45

22.30

22.78

40.56

47.10

50.55

49.17

44.26

54.06

58.70

56.41

53.00

57.46

47.36

40.10

44.73

16.40

43.81

41.67

41.81

35.86

38.76

43.97

37.30

29.90

43.92

23.17

19.40

41.22

28.81

43.63

47.55

48.24

47.60

47.09

47.53

44.05

39.70

44.56

46.08

45.60

0.00

20.15

33.85

36.35

34.21

32.31

34.46

16.02

34.95

31.70

37.01

0.00

0.00

0.00

26.74

0.00

43.60

43.00

42.45

42.61

40.00

41.00

0.00

0.00

0.00

0.00

0.00

143.39

192.20

243.37

239.98

223.64

237.00

230.49

235.56

174.80

206.79

139.06

127.40

108.73

Heuristics

LapEig

DeepWalk

Node2Vec

VERSE

LINE

GraRep

HOPE

SDNE

NetMF

MNMF

VAG

WYS

0.00 50.00 100.00 150.00 200.00

PPI Wikipedia Blogcatalog CoCit (microsoft) Flickr YouTube


Figure 12: The node classification performance measured with Micro-f1 on train-test split of 50:50 with non-linear classifier.

8.258.00

6.137.00

10.754.88

6.756.25

7.132.75

7.7510.38

5.00

11.8110.00

5.586.50

12.214.94

6.075.20

7.181.06

8.6212.46

5.29

HeuristicsLapEig

DeepWalkNode2Vec

VERSELINE

GraRepHOPESDNE

NetMFMNMF

VAGWYS

0.00 5.00 10.00 15.00 20.00 25.00


Figure 13: Mean Rank and Mean Penalty – on all datasetswhere all methods finish execution – for node classificationwith selected performance metric as Micro-f1 and LogisticRegression classifier.

overall score is better than many of the methods. This ob-servation can also be verified from Figure 14 as both MeanRank and Mean Penalty of node classification heuristicsbaseline is better than many of the methods. However,as the number of labels in the datasets increases (>5 la-bels), we observe that the Micro-f1 scores of node heuris-tics baseline decrease drastically. The decrease in overallperformance reflects that the node heuristics features lackthe discriminative power to classify multiple labels.

Feature study on the Heuristics baseline: We study theimportance of the individual feature in the node classifi-cation heuristics by analyzing the impact of the feature re-moval on the node classification performance. The resultsfor node classification heuristic with logistic regressionand EigenPro are reported in Figure 15a and Figure 15b,respectively. The blue line on the top of the columns in thefigures corresponds to Micro-f1 scores achieved with theproposed node classification heuristics. The removal ofindividual feature in node classification heuristics did notresult in significant drop in the downstream performance.However, we see that the node classification heuristics’classification performance with both Logistic Regressionand EigenPro is better than the ones achieved through theremoval of individual features on most of the datasets.

• Context embeddings can improve performance: Wesee from Figure 16 that levering both node and context

5.610.4

78.6

11.84

8.45.6

62.6

6.29.8

4.4

8.1813.46

6.788.54

14.646.16

9.145.86

6.661.26

6.7610.72

4.96

Few

labe

ls

HeuristicsLapEig

DeepWalkNode2Vec

VERSELINE

GraRepHOPESDNE

NetMFMNMF

VAGWYS

0 10 20 30


Figure 14: The Mean Rank and Mean Penalty – on datasetswith few labels – for node classification with selected perfor-mance metric as Micro-f1 and Logistic Regression classifier.

of Skip-gram based models results in significant improve-ment (up to 25%) for most of the methods. On Pubmeddataset, we observe that the node classification perfor-mance of embedding methods like LINE (2nd order), HOPE,and WYS was significantly lower than that of other meth-ods. The Micro-f1 scores of the embedding methods areshown in Fig. 16. We found that the Pubmed datasetconsists of around 80% sink nodes. As a result, whenthe embedding methods based on Skip-gram model gener-ate the node embeddings, the sink nodes are always con-sidered as “context” nodes and are never considered as“source” nodes. Hence, the quality of node embeddingsof sink nodes is of lower quality. In order to have a faircomparison, we concatenate both the node and contextembeddings of the methods (whenever possible) and eval-uate the performance on the concatenated embeddings.

• Impact of nonlinear classifier: We study the impactof the nonlinear classifier on the node classification per-formance. The comparison results are shown in box plot17. The box plot represents the distribution of differencesof Micro-f1 scores computed with the non-linear (Eigen-Pro [37]) and linear classifiers (Logistic Regression). Thepositive difference implies that the results with non-linearclassifier are better than linear classifier. For Verse, we seea 15% absolute increase with the use of nonlinear classi-fier on the PubMed dataset. The positive difference isstatistically significant (with paired t-test) for methods

Mic

ro-f1

0.0

0.2

0.4

0.6

0.8

W-Texas

W-Cornell

W-Wash

ington

W-Wisc

onsin PPI

Wikipedia

BlogCatalogDBLP

Pubmed

Microso

ftFlic

kr

Youtube

No Degree No PageRankScore No NodeClustCf No HubScore No NodeEccNo AverageNeighborDegree No AuthScore All features

(a) Linear classifier : Logistic Regression

Mic

ro-f1

0

20

40

60

80

W-Texas

W-Cornell

W-Wash

ington

W-Wisc

onsin PPI

Wikipedia

BlogCatalogDBLP

Pubmed

Microso

ftFlic

kr

Youtube

No Degree No PageRankScore No NodeClustCf No HubScore No NodeEccNo AverageNeighborDegree No AuthScore All features

(b) Nonlinear classifier : EigenPro

Figure 15: Feature study on the node classification for node classification heuristics.

Mic

ro-F

1

0.00

0.20

0.40

0.60

0.80

LINE (2nd)

NetMF GraRep HOPE MNMF WYS

Node + Context embedding Node embedding

Figure 16: Node Classification on Directed Dataset(PubMed) with/without concatenation of Node embeddingsand Context Embeddings (128 dimensions).

DeepWalk, Verse, SDNE, GraRep and MNMF with sig-nificance level 0.05. It is worth pointing out that on thesmaller datasets this gain is less evident while on, thelarger datasets (more training data) the benefits of usinga nonlinear classifier are much clearer.

• Impact of embedding dimension: We study the im-pact of embedding size for all the embedding methods onthe node classification task. Specifically, we compare theperformances of 64 dimensional embedding with 128 di-mensional embedding. The improvement – quantified interms of performance difference – obtained with 128 di-mensional embedding over 64 dimensional embedding isreported in Figure 18. The box-plot represents the dis-tribution of differences in Micro-f1 scores between 128 di-mensional embedding and 64 dimensional embedding foreach method on all datasets. In node classification withlinear classifier, none of the evaluated methods obtained astatistically significant difference at significance level 0.05.While in node classification with non-linear classifier, theembedding method HOPE obtained a statistically signif-icant positive difference – at significance level 0.05 – with128 embedding dimension.

• Impact of embedding normalization: We study theimpact of L2 normalization the embeddings for the nodeclassification task. The comparison results are shown inFigure 19. The box plot represents the distribution of

WYS

VAG

MNMF

NetMF

SDNE

HOPE

GraRep

LINE

VERSE

Node2Vec

DeepWalk

LapEig

Heuristics

−5 0 5 10 15Difference between NonLinear and Linear classifier

Em

bedd

ing

met

hods

Figure 17: Box-plot represents the distribution of differencesbetween non-linear and linear classifier on all the datasets.

differences of Micro-f1 scores for embedding methods be-tween normalized and unnormalized embeddings on nodeclassification task. The positive difference implies L2 nor-malization results in better downstream performance. Innode classification with linear classifier, the positive dif-ference is statistically significantly for NetMF while thenegative difference is statistically significantly for Deep-Walk at significance level 0.05 with paired t-test. While innode classification with non-linear classifier, the positivedifference is statistically significantly for NetMF methodat significance level 0.05 with paired t-test.

• Node classification performance on 10:90 train:test split: We report the node classification performanceof all methods on all the evaluated datasets with 10:90train:test split in with logistic regression classifier in Fig-ure 20 and non-linear classifier EigenPro in Figure 21. TheMean Rank and Mean Penalty of embedding methods onthe datasets for which all methods run to completion onour system is shown in Figure 22a. We also report theMean Rank and Mean Penalty of embedding methods ondatasets with few labels in Figure 22b. The observa-tions we reported with train:test 50:50 split also seem tohold with train:test 10:90 split. Specifically, we observeNetMF is the most competitive method for node classi-fication while Laplacian Eigenmaps method outperforms

WYS

VAG

MNMF

NetMF

SDNE

HOPE

GraRep

LINE

VERSE

N2V

DW

LapEig

−5 0 5

Em

bedd

ing

met

hods


WYS

VAG

MNMF

NetMF

SDNE

HOPE

GraRep

LINE

VERSE

N2V

DW

LapEig

−10 −5 0 5

Em

bedd

ing

met

hods


Figure 18: The box-plot represents the distribution of the differences in Micro-f1 scores between 128 dimensional embeddingand 64 dimensional embedding for each method on all datasets

WYS

VAG

MNMF

NetMF

SDNE

HOPE

GraRep

LINE

VERSE

N2V

DW

LapEig

−5 0 5 10

Em

bedd

ing

met

hods


WYS

VAG

MNMF

NetMF

SDNE

HOPE

GraRep

LINE

VERSE

N2V

DW

LapEig

−10 0 10 20 30 40

Em

bedd

ing

met

hods


Figure 19: Impact of embedding normalization on node classification performance.

multiple existing methods on multiple datasets (Blogcat-alog, Co-author datasets). Embedding methods such asDeepWalk and LINE also perform well on most datasets.

7. DISCUSSIONS AND CONCLUSIONSNetwork representational learning has attracted lot of at-

tention in past few years. An interested reader can refer tothe survey of network embedding methods [11, 25, 61]. Thesurveys focus on categorization of the embedding methodsbased either encoder-decoder framework [25] or novel tax-onomy [11, 61] but does not provide experimental compari-son of the embedding methods. There does exist one otherexperimental survey of network embedding methods [20].However there are key differences. First, we present a sys-tematic study on a larger set of embedding methods, includ-ing several more recent ideas, and on many more datasets(15 vs 7). Specifically, we evaluate 12 embedding meth-ods + 2 efficient heuristics on 15 datasets. Second, thereare several key differences in terms of results reported andreproducability. In our work we carefully tune all hyper-parameters of each method as well as the logistic classifier(and include information in our reproducability notes). Asa concrete example of where such careful tuning can makea difference consider that on Blogcatalog with a train-test

split of 50:50, Goyal et al, achieve Macro-f1 score of 3.9%while with tuning the hyper-parameters of logistic regressionwe achieve a Macro-f1 score of 29.2%. Third, our analysisreveals several important insights on the role of context, roleof different link prediction evaluation strategies (dot prod-uct vs classifier), impact of non-linear classifiers and manyothers. All of these provide useful insights for end-usersas well as guidance for future research and evaluation innetwork representational learning and downstream applica-tions. Fourth, we also provide a comparison against simplebut effective task-specific baseline heuristics which will serveas useful strawman methods for future work in these areas.

To conclude, we identify several issues in the current lit-erature: lack of standard assessment protocol, use of defaultparameters for baselines, lack of standard benchmark, igno-rance of task-specific baselines. Additionally, we make thefollowing observations:

• MNMF and NetMF are the most effective baselines for thelink prediction and node classification task respectively.• No one method completely outperform the other methods

on both link prediction and node classification tasks.• If one considers Laplacian Eigenmaps as a baseline, the

classifier parameters should be tuned appropriately.

52.32

50.54

50.18

53.57

49.88

57.44

54.11

53.93

55.80

57.86

54.76

54.30

54.96

33.92

38.58

39.15

38.47

40.57

39.60

39.83

40.68

43.20

41.70

40.91

38.70

42.23

58.36

41.30

45.41

48.99

45.65

54.35

42.75

50.43

50.70

46.86

51.79

49.20

57.17

40.92

43.68

45.23

44.85

42.59

44.73

45.36

45.86

46.40

49.83

46.36

44.50

47.14

36.40

36.95

34.73

34.86

34.51

34.65

34.61

34.34

36.30

35.59

34.67

35.30

0.00

56.61

79.53

80.12

79.66

61.00

62.75

76.95

71.67

66.30

77.26

70.83

62.20

71.12

278.53

290.58

294.82

300.39

274.20

293.52

293.61

296.91

298.70

309.11

299.32

284.20

272.62

Heuristics

LapEig

DeepWalk

Node2Vec

VERSE

LINE

GraRep

HOPE

SDNE

NetMF

MNMF

VAG

WYS

0.00 100.00 200.00 300.00



9.13

17.45

13.56

13.27

11.83

10.98

15.95

13.47

12.50

14.66

13.58

15.70

17.46

41.17

41.01

43.77

45.44

37.37

50.61

52.04

49.67

47.10

51.90

43.54

40.70

38.22

14.03

37.39

35.40

34.94

27.92

31.04

35.51

27.29

24.90

36.83

17.81

16.90

32.58

24.75

42.35

43.51

44.78

44.04

44.65

43.52

40.66

36.30

42.11

41.46

42.10

0.00

19.43

28.36

29.67

28.14

23.34

26.92

5.59

27.03

27.90

28.78

0.00

0.00

0.00

24.69

0.00

34.84

27.77

31.52

30.76

32.00

31.22

0.00

0.00

0.00

0.00

0.00

133.20

166.56

200.75

194.32

176.03

194.96

184.60

189.35

148.70

174.29

116.39

115.40

88.26

Heuristics

LapEig

DeepWalk

Node2Vec

VERSE

LINE

GraRep

HOPE

SDNE

NetMF

MNMF

VAG

WYS

0.00 50.00 100.00 150.00 200.00



Figure 20: The node classification performance measured with Micro-f1 on train-test split of 10:90 with Logistic Regression.For each method, the number at the end of bar represent the summation of the Micro-f1 values across the datasets.

46.55

53.15

53.39

54.11

54.88

62.08

58.93

57.50

59.70

59.82

55.12

51.30

59.11

29.32

39.77

42.27

42.05

40.45

42.73

41.08

40.68

41.10

42.33

40.34

39.90

42.22

54.59

36.43

53.24

57.29

48.65

57.92

44.49

57.63

50.50

53.77

56.14

46.00

58.70

33.10

44.60

43.89

42.59

43.97

43.22

45.27

45.40

47.70

45.31

44.69

48.20

44.18

36.64

37.27

35.45

35.98

34.94

35.27

35.31

35.89

37.10

36.55

35.81

36.00

0.00

56.27

79.42

80.29

79.80

68.00

62.15

77.40

75.16

68.20

78.07

72.51

63.40

72.72

256.47

290.65

308.54

311.82

290.90

303.37

302.48

312.26

304.30

315.86

304.60

284.80

276.92

Heuristics

LapEig

DeepWalk

Node2Vec

VERSE

LINE

GraRep

HOPE

SDNE

NetMF

MNMF

VAG

WYS

0.00 100.00 200.00 300.00



7.35

17.68

17.38

18.36

15.37

14.31

19.22

15.31

15.80

18.17

16.14

15.80

18.17

41.37

43.59

45.71

47.13

41.75

47.00

52.73

45.23

48.10

51.12

45.62

40.40

42.61

16.38

38.75

37.78

37.69

30.45

33.15

39.15

30.49

26.50

39.14

22.37

17.30

35.29

27.23

41.07

44.89

45.69

44.48

44.82

44.57

41.22

37.00

42.86

42.16

43.40

0.00

18.47

30.43

30.92

29.90

27.00

28.81

15.85

30.24

30.20

30.91

0.00

0.00

0.00

23.28

0.00

39.83

32.48

37.91

35.82

35.00

35.24

0.00

0.00

0.00

0.00

0.00

134.07

171.51

216.52

211.25

196.96

203.90

206.52

197.73

157.60

182.20

126.28

116.90

96.07

Heuristics

LapEig

DeepWalk

Node2Vec

VERSE

LINE

GraRep

HOPE

SDNE

NetMF

MNMF

VAG

WYS

0.00 50.00 100.00 150.00 200.00



Figure 21: The node classification performance measured with Micro-f1 on train-test split of 10:90 with non-linear classifier.

10.63

7.75

7.25

7.38

10.63

6.75

5.75

3.38

5.50

6.38

6.25

8.88

4.50

11.22

5.85

5.43

4.63

9.93

5.59

6.17

2.42

4.22

5.41

7.08

9.26

4.42

Heuristics

LapEig

DeepWalk

Node2Vec

VERSE

LINE

SDNE

NetMF

GraRep

HOPE

MNMF

VAG

WYS

Rank Penalty

(a) On all datasets where all methods finish execution

10.00

9.80

8.00

7.80

10.60

6.40

4.20

3.60

7.40

6.00

5.00

8.80

3.40

9.45

7.15

5.86

4.77

9.93

6.10

5.39

3.17

6.07

5.36

4.94

8.09

3.35

Heuristics

LapEig

DeepWalk

Node2Vec

VERSE

LINE

SDNE

NetMF

GraRep

HOPE

MNMF

VAG

WYS

Rank Penalty

(b) On datasets with few labels

Figure 22: The Mean Rank and Mean Penalty – on datasets with few labels – for node classification with selected performancemetric as Micro-f1 on train-test split of 10:90 and logistic regression as the classifier.

• The Link Prediction Heuristic we present is simple, effi-cient to compute and offers competitive performance. TheNode Classificaton Heuristic is also simple and efficient tocompute and is effective on data-sets with fewer labels.• For both tasks, some methods are impervious to the use

of context whereas for other methods context helps sig-nificantly.

• While comparing embeddings methods through link pre-diction task, the superiority of the embedding methodsshould be asserted by leveraging the classifier.

We hope the insights put forward in this study are help-ful to the community and encourage the comparison of novelembedding methods with the task-specific competitive meth-ods and proposed task-specific heuristics.

Table 3: Link Prediction performance measured with AUROC. The “-” represents that the method did not scale on theparticular dataset.

Datasets Heuristics LapEig DeepWalk Node2Vec Verse LINE GraRep HOPE SDNE NetMF MNMF VAG WYS

W-Texas 77.7 73.1 79.7 83 81.7 78.2 78.7 78.7 82.4 80.9 96.0 78.6 83.0W-Cornell 81.5 77.3 79.2 82 87 77.5 84.4 79.9 80.2 81.2 96.7 74.8 84.4W-Washington 75.3 70.1 75.3 75 82.2 72.9 78.1 72.8 76.7 75.5 97.5 73.9 79.2W-Wisconsin 79.4 71.4 80.7 78 88.2 72.5 84.3 75.5 76.9 82.3 98.9 73.9 84.5PPI 90.9 78.2 89.1 88.3 89.6 87.8 90 88.4 89.3 87.3 96.9 87.4 91.5Wikipedia 91.6 77.9 90.9 90.9 91.3 91.2 92.3 90.4 50 91.4 88.4 89.5 92.3Wiki-Vote 91.5 83.5 97.4 97.6 94.9 96.6 88.4 97.8 96.6 95.5 92.2 94.3 98.2BlogCatalog 95.2 77.4 94.3 95 97.3 95.2 96.2 95.3 95.6 95.1 94 94.8 96.0DBLP (Co-Author) 95.6 93.3 96 95.4 97.9 94.3 97.1 89.6 50 95.9 99.4 94.1 96.8Pubmed 87.7 89.6 89.1 89.3 96.6 92.7 77.7 90.1 88.9 89.8 94.3 93.6 97.0CoCit (microsoft) 89.5 95.6 97.6 97.3 83.7 97.2 97.9 94.5 91.9 96.9 96.6 96.2 -P2P 83.8 69.9 88.2 88.3 77.6 91.2 71.8 88.6 83.9 87.5 92.3 - -Flickr 92.4 93 95.8 94.7 72.6 95.2 95.5 96.5 93 97.2 - - -Epinions 92.2 90.9 93.3 93.4 91.9 91.6 93.7 92.7 92.7 92.8 - - -Youtube 96.2 96 93.6 91.4 97.6 96.5 91.4 92.4 - - - - -

Table 4: Link Prediction performance measured with AUPR. The “-” represents that the method did not scale on theparticular dataset.


W-Texas 81.8 78 81.9 85 85 82.2 82.1 82.9 85.5 83.6 96.0 80.6 85.2W-Cornell 81.9 78.9 79.8 81 87.8 76.7 86.2 79.8 79.5 82 96.7 75.6 86.6W-Washington 80.3 75 76.5 78 86.5 75.5 82.9 77.2 81.5 80.4 97.7 78.7 83.4W-Wisconsin 82.3 74.8 81.6 79 90.7 76.4 87.4 78.8 80.7 85.2 98.5 78.1 86.7PPI 91.4 80.7 90.4 89.5 90.7 88.1 90.8 89.2 90.2 87.9 96.5 88.1 92.2Wikipedia 93 76 92.5 92.3 92.8 92.8 93.1 91.8 75 92.8 89.9 91.4 93.5Wiki-Vote 87.9 82.1 96.9 97.2 94.8 95.3 84 96.8 96.3 93.8 87.6 94.3 97.4BlogCatalog 95.1 77.5 94.3 94.8 97.9 95.1 96 95 95.5 94.8 93.6 94.6 96.1DBLP (Co-author) 96.7 93.8 96.8 96.1 98.2 95.6 97.4 90.9 75 96.7 99.2 95.2 97.3Pubmed 85 85 81.5 82.3 96.8 90.3 74.1 91.4 90.5 86.1 90.3 95.2 96.9Cocit (microsoft) 91.9 95.5 97.9 97.5 76.4 97.7 97.9 95.3 93.5 97.1 95.2 96.4 -P2P 79.3 68.1 84.3 84.5 68.3 88.6 71.3 85.8 80.8 84 89.6 - -Flickr 92.5 95 96.1 95 70.9 95.5 95.7 96.7 94 97.6 - - -Epinions 89.2 89.5 91.7 91.9 88.6 88.8 93.0 91.5 91.6 91.8 - - -Youtube 96.7 96.7 95 93 98.2 97 92.2 94 - - - - -

Table 5: Node Classification performance measured in terms of Micro-f1 with train-test split of 50:50 with Logistic Regression.The “-” represents that the method did not scale on the particular dataset.


W - Texas 61.8 54.6 55.1 57.2 54.5 61.8 56.1 59.1 58.0 67.1 58.0 54.8 60.6W - Cornell 42.1 30.9 40.5 34.3 35.4 44.1 40.9 41.9 48.5 48.1 36.1 40.3 41.8W - Washington 65.0 43.3 56.0 58.4 51.5 65.3 46.7 62.8 60.5 61.1 60.4 59.0 65.3W - Wisconsin 51.7 41.5 52.3 45.6 41.7 52.9 50.8 51.5 51.3 56.7 53.9 48.1 53.2PPI 10.8 22.3 21.4 21.0 19.7 19.9 20.4 18.8 17.4 21.3 18.6 19.2 22.6Wikipedia 41.9 46.3 50.0 51.4 43.8 56.3 58.8 57.9 52.4 58.4 48.1 41.1 44.4Blogcatalog 17.1 42.1 41.5 41.7 35.5 38.6 41.3 34.4 29.5 41.7 21.6 17.1 38.9DBLP (Co-Author) 37.3 37.1 35.9 35.6 37.2 37.0 35.7 36.0 37.4 36.6 36.2 36.2 -Pubmed 57.8 81.7 81.5 81.1 63.0 64.4 79.1 74.7 67.7 80.0 77.1 63.5 73.6CoCit (microsoft) 25.0 43.0 46.3 46.7 46.0 46.5 46.6 44.6 38.1 43.5 44.7 43.5 -Flickr 19.1 34.0 35.6 35.1 30.1 33.4 10.5 28.7 30.8 34.2 - - -Youtube 24.4 - 40.7 40.3 38.5 40.3 38.0 38.7 - - - - -

Table 6: Node Classification performance measured in terms of Macro-f1 with train-test split of 50:50 with Logistic Regression.The “-” represents that the method did not scale on the particular dataset.


W -Texas 42.1 18.1 26.9 22.7 18.1 36.4 25.0 34.4 39.5 49.9 23.8 27.8 40.3W -Cornell 22.2 21.3 28.1 23.2 22.4 25.0 27.7 28.7 13.1 32.8 23.5 20.6 26.5W -Washington 32.2 22.2 24.3 27.3 23.1 30.6 28.7 30.2 29.1 29.7 28.6 26.3 31.1W -Wisconsin 24.7 29.0 31.9 23.9 21.9 27.8 34.7 28.3 26.1 34.8 33.8 25.5 33.9PPI 6.0 17.9 18.1 18.0 16.5 16.9 17.4 15.9 15.2 17.5 15.9 13.1 17.9Wikipedia 5.5 10.4 11.9 12.9 8.2 18.2 18.3 20.1 14.1 18.4 11.0 3.8 10.1Blogcatalog 3.1 29.2 27.3 27.9 22.1 23.6 28.9 20.8 14.8 28.8 8.2 3.1 26.3DBLP (Co-Author) 18.1 20.1 30.0 29.4 20.6 19.2 30.5 28.6 21.1 30.0 26.6 27.8 -Pubmed 48.9 80.2 80.1 79.8 58.0 61.3 77.6 73.0 63.3 78.4 75.4 60.6 71.6CoCit (microsoft) 12.6 27.3 34.3 34.2 33.3 33.8 34.8 32.8 27.8 34.0 30.4 29.2 -Flickr 1.7 20.4 21.2 20.7 17.6 18.2 0.9 11.4 14.9 20.2 - - -Youtube 9.3 - 34.7 34.0 32.1 33.1 30.0 30.8 - - - - -

8. REFERENCES

[1] Microsoft Academic Graph - KDD cup, 2016.https://kddcup2016.azurewebsites.net/Data.

[2] S. Abu-El-Haija, B. Perozzi, R. Al-Rfou, andA. Alemi. Watch your step: Learning nodeembeddings via graph attention. In NeuralInformation Processing Systems, 2018.

[3] L. A. Adamic and E. Adar. Friends and neighbors onthe web. Social networks, 25(3):211–230, 2003.

[4] E. M. Airoldi, D. M. Blei, S. E. Fienberg, E. P. Xing,and T. Jaakkola. Mixed membership stochastic blockmodels for relational data with application toprotein-protein interactions. In In Proceedings of theInternational Biometrics Society Annual Meeting,2006.

[5] A.-L. Barabasi and R. Albert. Emergence of scaling inrandom networks. science, 286(5439):509–512, 1999.

[6] M. Belkin and P. Niyogi. Laplacian eigenmaps fordimensionality reduction and data representation.Neural computation, 15(6):1373–1396, 2003.

[7] A. R. Benson, D. F. Gleich, and J. Leskovec.Higher-order organization of complex networks.Science, 353(6295):163–166, 2016.

[8] S. Bhagat, G. Cormode, and S. Muthukrishnan. Nodeclassification in social networks. In Social networkdata analytics, pages 115–148. Springer, 2011.

[9] D. K. Bhattacharyya and J. K. Kalita. Networkanomaly detection: A machine learning perspective.Chapman and Hall/CRC, 2013.

[10] B.-J. Breitkreutz, C. Stark, T. Reguly, L. Boucher,A. Breitkreutz, M. Livstone, R. Oughtred, D. H.Lackner, J. Bahler, V. Wood, et al. The biogridinteraction database. Nucleic acids research,36(suppl 1):D637–D640, 2007.

[11] H. Cai, V. W. Zheng, and K. C.-C. Chang. Acomprehensive survey of graph embedding: Problems,techniques, and applications. IEEE Transactions onKnowledge and Data Engineering, 30(9):1616–1637,2018.

[12] S. Cao, W. Lu, and Q. Xu. Grarep: Learning graphrepresentations with global structural information. InProceedings of the 24th ACM International onConference on Information and KnowledgeManagement, pages 891–900. ACM, 2015.

[13] E. K. Cetinkaya, M. J. Alenazi, A. M. Peck, J. P.Rohrer, and J. P. Sterbenz. Multilevel resilienceanalysis of transportation and communicationnetworks. Telecommunication Systems, 60(4):515–537,2015.

[14] S. Chakrabarti, B. Dom, and P. Indyk. Enhancedhypertext categorization using hyperlinks. In ACMSIGMOD Record, pages 307–318. ACM, 1998.

[15] W. W. Cohen and J. Richman. Learning to match andcluster large high-dimensional data sets for dataintegration. In Proceedings of the eighth ACMSIGKDD international conference on Knowledgediscovery and data mining, pages 475–480. ACM,2002.

[16] G. Crichton, Y. Guo, S. Pyysalo, and A. Korhonen.Neural networks for link prediction in realisticbiomedical graphs: a multi-dimensional evaluation of

graph embedding-based approaches. BMCBioinformatics, 19(1):176, May 2018.

[17] R. W. Eckardt III, R. G. Wolf Jr, A. Shapiro, K. G.Rivette, and M. F. Blaxill. Method and apparatus forselecting, analyzing, and visualizing related databaserecords as a network, Mar. 2 2010. US Patent7,672,950.

[18] D. Eppstein, M. S. Paterson, and F. F. Yao. Onnearest-neighbor graphs. Discrete & ComputationalGeometry, 17(3):263–282, 1997.

[19] L. Getoor and A. Machanavajjhala. Entity resolution:theory, practice & open challenges. Proceedings of theVLDB Endowment, 5(12):2018–2019, 2012.

[20] P. Goyal and E. Ferrara. Graph embeddingtechniques, applications, and performance: A survey.Knowledge-Based Systems, 151:78–94, 2018.

[21] A. Grover and J. Leskovec. node2vec: Scalable featurelearning for networks. In Proceedings of the 22nd ACMSIGKDD international conference on Knowledgediscovery and data mining, pages 855–864. ACM,2016.

[22] Y. Gu, Y. Sun, and J. Gao. The co-evolution modelfor social network evolving and opinion migration. InProceedings of the 23rd ACM SIGKDD internationalconference on knowledge discovery and data mining,pages 175–184. ACM, 2017.

[23] M. Gutmann and A. Hyvarinen. Noise-contrastiveestimation: A new estimation principle forunnormalized statistical models. In Proceedings of theThirteenth International Conference on ArtificialIntelligence and Statistics, pages 297–304, 2010.

[24] W. Hamilton, Z. Ying, and J. Leskovec. Inductiverepresentation learning on large graphs. In Advancesin Neural Information Processing Systems, pages1024–1034, 2017.

[25] W. L. Hamilton, R. Ying, and J. Leskovec.Representation learning on graphs: Methods andapplications. arXiv preprint arXiv:1709.05584, 2017.

[26] P. Jaccard. Etude comparative de la distributionflorale dans une portion des alpes et des jura. Bull SocVaudoise Sci Nat, 37:547–579, 1901.

[27] G. Jeh and J. Widom. Simrank: a measure ofstructural-context similarity. In Proceedings of theeighth ACM SIGKDD international conference onKnowledge discovery and data mining, pages 538–543.ACM, 2002.

[28] T. N. Kipf and M. Welling. Variational graphauto-encoders. NIPS Workshop on Bayesian DeepLearning, 2016.

[29] T. N. Kipf and M. Welling. Semi-supervisedclassification with graph convolutional networks. InInternational Conference on Learning Representations(ICLR), 2017.

[30] J. M. Kleinberg. Authoritative sources in ahyperlinked environment. Journal of the ACM(JACM), 46(5):604–632, 1999.

[31] T. Kraska, A. Beutel, E. H. Chi, J. Dean, andN. Polyzotis. The case for learned index structures. InProceedings of the 2018 International Conference onManagement of Data, pages 489–504. ACM, 2018.

[32] J. Leskovec, D. Huttenlocher, and J. Kleinberg.

https://kddcup2016.azurewebsites.net/Data

Signed networks in social media. In Proceedings of theSIGCHI conference on human factors in computingsystems, pages 1361–1370. ACM, 2010.

[33] J. Liang, S. Gurukar, and S. Parthasarathy. Mile: Amulti-level framework for scalable graph embedding.arXiv preprint arXiv:1802.09612, 2018.

[34] J. Liang, P. Jacobs, J. Sun, and S. Parthasarathy.Semi-supervised embedding in attributed networkswith outliers. In Proceedings of the 2018 SIAMInternational Conference on Data Mining, pages153–161. SIAM, 2018.

[35] D. Liben-Nowell and J. Kleinberg. The link-predictionproblem for social networks. Journal of the Americansociety for information science and technology,58(7):1019–1031, 2007.

[36] L. Lu and T. Zhou. Link prediction in complexnetworks: A survey. Physica A: statistical mechanicsand its applications, 390(6):1150–1170, 2011.

[37] S. Ma and M. Belkin. Diving into the shallows: acomputational perspective on large-scale shallowlearning. In Advances in Neural InformationProcessing Systems, pages 3778–3787, 2017.

[38] M. Mahoney. Large text compression benchmark.URL: http://www. mattmahoney. net/text/text. html,2011.

[39] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, andJ. Dean. Distributed representations of words andphrases and their compositionality. In Proceedings ofthe 26th International Conference on NeuralInformation Processing Systems - Volume 2, NIPS’13,pages 3111–3119, USA, 2013. Curran Associates Inc.

[40] A. Mnih and G. E. Hinton. A scalable hierarchicaldistributed language model. In Advances in neuralinformation processing systems, pages 1081–1088,2009.

[41] G. E. Moon, A. Sukumaran-Rajam, S. Parthasarathy,and P. Sadayappan. Pl-nmf: Parallellocality-optimized non-negative matrix factorization.arXiv preprint arXiv:1904.07935, 2019.

[42] M. Newman. Networks: an introduction. Oxforduniversity press, 2010.

[43] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich.A review of relational machine learning for knowledgegraphs. Proceedings of the IEEE, 104(1):11–33, 2016.

[44] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu.Asymmetric transitivity preserving graph embedding.In Proceedings of the 22nd ACM SIGKDDinternational conference on Knowledge discovery anddata mining, pages 1105–1114. ACM, 2016.

[45] L. Page, S. Brin, R. Motwani, and T. Winograd. Thepagerank citation ranking: Bringing order to the web.Technical report, Stanford InfoLab, 1999.

[46] C. C. Paige and M. A. Saunders. Towards ageneralized singular value decomposition. SIAMJournal on Numerical Analysis, 18(3):398–405, 1981.

[47] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk:Online learning of social representations. InProceedings of the 20th ACM SIGKDD internationalconference on Knowledge discovery and data mining,pages 701–710. ACM, 2014.

[48] J. Qiu, Y. Dong, H. Ma, J. Li, C. Wang, and K. Wang.

Netsmf: Large-scale network embedding as sparsematrix factorization. Proceedings of the 2019 WorldWide Web Conference on World Wide Web, 2019.

[49] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang.Network embedding as matrix factorization: Unifyingdeepwalk, line, pte, and node2vec. In Proceedings ofthe Eleventh ACM International Conference on WebSearch and Data Mining, pages 459–467. ACM, 2018.

[50] L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo.struc2vec: Learning node representations fromstructural identity. In Proceedings of the 23rd ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining, pages 385–394. ACM,2017.

[51] M. Richardson, R. Agrawal, and P. Domingos. Trustmanagement for the semantic web. In Internationalsemantic Web conference, pages 351–368. Springer,2003.

[52] M. Ripeanu and I. Foster. Mapping the gnutellanetwork: Macroscopic properties of large-scalepeer-to-peer systems. In international workshop onpeer-to-peer systems, pages 85–93. Springer, 2002.

[53] A. Sinha, R. Cazabet, and R. Vaudaine. Systematicbiases in link prediction: Comparing heuristic andgraph embedding based methods. In L. M. Aiello,C. Cherifi, H. Cherifi, R. Lambiotte, P. Lio, and L. M.Rocha, editors, Complex Networks and TheirApplications VII, pages 81–93, Cham, 2019. SpringerInternational Publishing.

[54] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, andQ. Mei. Line: Large-scale information networkembedding. In Proceedings of the 24th InternationalConference on World Wide Web, pages 1067–1077.International World Wide Web Conferences SteeringCommittee, 2015.

[55] A. Tsitsulin, D. Mottin, P. Karras, and E. Muller.Verse: Versatile graph embeddings from similaritymeasures. In Proceedings of the 2018 World Wide WebConference on World Wide Web, pages 539–548.International World Wide Web Conferences SteeringCommittee, 2018.

[56] P. Vijayan, Y. Chandak, M. M. Khapra, andB. Ravindran. Fusion graph convolutional networks.Mining and Learning with Graphs (MLG), KDD, 2018.

[57] C. Wang, V. Satuluri, and S. Parthasarathy. Localprobabilistic models for link prediction. In Proceedingsof the 2007 Seventh IEEE International Conference onData Mining, ICDM ’07, pages 322–331, Washington,DC, USA, 2007. IEEE Computer Society.

[58] D. Wang, P. Cui, and W. Zhu. Structural deepnetwork embedding. In Proceedings of the 22nd ACMSIGKDD international conference on Knowledgediscovery and data mining, pages 1225–1234. ACM,2016.

[59] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, andS. Yang. Community preserving network embedding.In AAAI, pages 203–209, 2017.

[60] R. Zafarani and H. Liu. Social computing datarepository at asu, 2009.

[61] D. Zhang, J. Yin, X. Zhu, and C. Zhang. Networkrepresentation learning: A survey. IEEE transactionson Big Data, 2018.

[62] F. Zhang, W. Zhang, Y. Zhang, L. Qin, and X. Lin.Olak: an efficient algorithm to prevent unraveling insocial networks. Proceedings of the VLDB Endowment,10(6):649–660, 2017.

[63] M. Zhao and V. Saligrama. Anomaly detection withscore functions based on nearest neighbor graphs. InAdvances in neural information processing systems,pages 2250–2258, 2009.

[64] T. Zhou, L. Lu, and Y.-C. Zhang. Predicting missinglinks via local information. The European PhysicalJournal B, 71(4):623–630, 2009.

Network Representation Learning: Consolidation and Renewed ... · Network Representation Learning:...

Documents

Transcript of Network Representation Learning: Consolidation and Renewed ... · Network Representation Learning:...