Graph Neural Networks use graphs when they shouldn't

Graph Neural Networks use graphs when they shouldn't




Besides the the title being a bit exaggerated, it's been long-acknowledged in the GNN community that graph rewiring often helps learning; see [1, 2] for example. Also, at first skim, I'm surprised there's no discussion on oversmoothing/oversquashing in this paper. It seems like what they're calling "overfitting" on the graph structure could be interpreted as the GNN's node representations saturating/starving because of poor graph topology.

[1] - [2] -


Anyone working with GNNs have an experience with this? Has this affected your work?


Yes, and no. I just completed a PhD on GNNs, and in the last 4 years I have not used a single GNN without residual connection. It is quite well known that res connections allow to alleviate a lot of issues such as oversmoothing, and I expect that it would be the case here as well.


In my use case I care not one bit about the overall graph structure, but only about the inter-node interactions/relationships. But often times you will get graphs where you have 1% type A interactions, 12% type B interactions and 87% type C interactions and that makes training quite challenging. But I am not sure if you can stay that that is over-fitting the graph structure rather than simply learning the most common interactions in the dataset.


the paper is on my queue, but yeah, these kind of class imbalance & class dependency problems are why things like RGCNs and negative sampling are so popular. It's almost the typical case in cyber, fraud, supply chain, social, etc.

As others posted here, ideas like metapaths, transformer equivalence, etc are common building blocks in the GNN world because of these kinds of things. Important areas so def curious.


I've work with GNNs. I have not explicitly looked into this, but in general, the dynamics of GNNs are really interesting and there are many different lenses in which one can study the behavior of GNNs and find interesting takeaways.

- Explicit Message Passing:

- Algorithmic Reasoning:

- Diffusion:

- Spectral:

- Sparsity:

- Training Tricks:

- Interaction Networks / Physical Models:

- Expressive Power:

- "Over-Squashing":


This is a personal impressionistic opinion, so please take it as such and don’t expect citations or details. It has been empirically clear since the inception of GNN that the message passing structure and forced equivariance would make it hard to optimize GNNs for many practical applications. In a sense if you could solve graph isomorphism with these architectures, you would find a counter example for an NP-complete problem, so few people expected magic, but it was still disheartening to see the old school hashed bags of subgraphs outperform fancy GNNs on practical problems and to see that pretraining didn’t help the GNNs as much, or to see that the internal structures of these models are not quickly building up sensible hierarchies on their own. So then people started the hacking and exploring and by now the practical models are better than the early hacks but the theoretical optimization for how to deal with graphs in subdomains of extreme interest (say drug discovery, biology, or materials design) has not yet finished. By comparison autoregressive language models are super mature and generally applicable, as are transformers. These mixups in the early GNN led people to spend a lot of effort reformulating problems to match established architectures better or complement the GNN parts with hacks.


When you say hashed bags of sub graphs, are you generalizing bag of words, here?


I haven't worked on GNNs specifically, but I’ve done academic research on other architectures, and I used to perform a comprehensive literature review. Hope it gives some useful context.


The paper “Graph Neural Networks Tend to Overfit the Graph Structure” provides a significant contribution to the field of Graph Neural Networks (GNNs) by highlighting the issue of overfitting to the graph structure. This overfitting phenomenon can lead to reduced performance, especially when the graph structure is non-informative or irrelevant to the task at hand. To address this issue, the authors propose a graph-editing method called R-COV, which aims to reduce the coefficient of variation (COV) of a given graph, making it more similar to regular graphs.

This research aligns with other studies in the field that have also addressed the issue of overfitting in GNNs. For instance, the paper “Overfitting Mechanism and Avoidance in Deep Neural Networks” discusses the problem of overfitting in deep neural networks and proposes a consensus-based classification algorithm to address this issue. This algorithm identifies consistently classified samples and rejects samples that are classified randomly, which can help avoid overfitting even with a small number of training samples.

Another relevant study is “Neighborhood Random Walk Graph Sampling for Regularized Bayesian Graph Convolutional Neural Networks,” which proposes a Bayesian Graph Convolutional Neural Network (BGCN) that accounts for uncertainty in the graph structure and reduces overfitting. The BGCN model utilizes a neighborhood random walk-based graph sampling method to sample nodes from the observed graph, improving sampling efficiency and diversifying connections between nodes.

The paper “Training Graph Neural Networks by Graphon Estimation” also proposes a method for training GNNs that mitigates the issues of overfitting and over-smoothing. This method involves resampling the adjacency matrix of the graph from a graphon estimate obtained from the underlying network data. By resampling the adjacency matrix, the method provides data augmentation and introduces randomness to the GNN training process, which helps alleviate overfitting and over-smoothing.

Finally, the paper “Eigen-GNN: A Graph Structure Preserving Plug-in for GNNs” proposes the Eigen-GNN module to enhance GNNs’ ability to preserve graph structures by integrating the eigenspace of graph structures with GNNs. This approach significantly improves the preservation of graph structures and consistently outperforms existing GNNs in structure-driven tasks.


I'm curious how this relates to MDL-tied regularization. I don't see any explicit reference to the L2 norm here, and I'm not too smart about graph neural networks.

But it seems like overfitting is almost necessary unless some kind of obscure variational technique is used? They did have one paper where they used a progressively compressed GNN to isolate an approximation for a cosmological equation smaller and more accurate than had been previously achieved. I'll update here if I find it.

Edit: Ah, here we are: I believe there was some very good followup work posted on it this year that I saw some movement during the 3-5 months that I was really active on Twitter (brief time that it was), seems like an extremely promising area, and one I should learn if I truly do care about the Kolmogorov complexity of a problem.


Isn't the natural solution to just use attention layers instead of graph convolution layers then? Then there the attention mechanism will end up learning the useful graph structure via the attention weights?


It’s more or less the same thing. You could consider transformer a GNN.


in fact everything is a graph


Well the whole point of this paper is NOT everything is a graph.


in fact everything is a graph



transformers are GNNs where all nodes are connected to all nodes (well in the decoder you have masks but you see my point).

If the problem is that the graph is not sparse enough / not a graph at all, adding more connections doesn't help.

edit: Why doesn't it help? They address this in the paper. 1. There is computational infeasiability problem. 2. The transformer decoder can't be a regular graph if you include masking.


If they don't actually help then the attention weight for that connection would tend to 0, right? Then it becomes a problem of overfitting which we have a large arsenal to combat.


> If they don't actually help then the attention weight for that connection would tend to 0, right?

Not necessarily. That is what this paper is about. From what I understand, they also consider graph attention networks.