少女祈祷中...

abstract

the limitation:

  • GNN needs fast inference.

  • Existing works focus on target GNN acceleration, such as GCN.

  • And many works rely on pre-processing which is not suitable to real-time application.

This paper work:

  • Propose FlowGNN, which is generalizable to the majority of message -passing GNNs.

and more in details:

  1. a novel and scalable dataflow architecture(enhance speed of message-passing)
  2. real-time GNN inference without any pre-processing
  3. verify the proposed architecture on FPGA board(Xilinx Alveo U50 FPGA board)

Related work and motivation

Mentioning a survey: Computing Graph Neural Networks: A Survey from Algorithms to Accelerators.

Limitation

The most significant is that advanced GNNs cannot be simplified as matrix multiplications(the method is SpMM and GEMM), and edge embeddings cannot be ignored.

Edge embedding

Edge embeddings are used to represent important edge attributes, such as chemical bonds in a molecule.

this make the SpMM and GEMM not useful.

image-20231022095259558

if not use edge embedding, message passing is $\phi (x_i^l)$, and it can be optimized by GEMM and SpMM.However, if use embedding, it will be $\phi (x_i^l,e_{i,j}^l)$, and the demension of node representation and edge representation is not same.

image-20231022095645741

Invalid existing optimizations

SOTA I-GCN merge the same neighbor node as one node to optimize.

image-20231022095925172

However, considering edge embedding, the $e_{ac},e_{bc}$ is not the same as $e_{bc},e_{bd}$

we cannot converge the node a and b

Non-trivial aggregation

the GEMM and SpMM use a certain pattern to optimize. However ,the compute coefficient is dynamic.

Anisotropic GNNs

such as GAT, it has a attention coefficient which is calculated by $x_i^{l},x_j^l$. and it is dynamic, which prevents GAT from being expressed as matrix multiplications to optimize.

Pre-processing

not applicable in RT application

Motivations and Innovations

Innovation:

  • Explicit message passing for generality
  • Multi-queue multi-level parallelism for performance

GENERIC ARCHITECTURE*

Message Passing Mechanism

image-20231022110739588

image-20231022110042383
$$
\mathcal{MP scatter} \
m_{1\to 0}^{l-1}=\phi (x_1^{l-1},e_{0,1}^{l-1}) \
\mathcal{MP gather} \
m_0^l=\mathcal A(m_{1\to0}^{l-1}) \
m_1^l=\mathcal A(m_{0\to1}^{l-1},m_{2\to1}^{l-1},m_{3\to1}^{l-1}) \
\mathcal{NT} \
x_1^l=\gamma (m_1^l,x_1^{l-1})
$$

3 phases:

  1. message passing(gather):

    from message buffer get $m$, and do aggregating $A(m_{2\to1}^l, m_{3\to1}^l, m_{4\to1}^l)$.

    the $\mathcal{A}(.)$ function could be sum, max, mean, std.dev.

  2. node transformation:

    $\gamma (x_1^l, m_1^l)$.

    $\gamma (.)$ could be full-connected layer, MLP

  3. message passing(scatter):

    $\phi (x_1^{l+1},e_{1,2}^{l+1})\to m_{1\to 2}^{l+1}$. Sent it to message buffer for the next layer Message passing(gather)

Baseline Dataflow Architecture

image-20231022130835181

use ping-pang buffer(b => d):

image-20231022192909514

Proposed FlowGNN Architecture

image-20231022135221394

image-20231022144445441