Graph-Informed Transformers for Neural Network Inference Latency Prediction

Asad Khan

← Back to papers

Graph-Informed Transformers for Neural Network Inference Latency Prediction

Asad Khan

CUCAI 2025 Proceedings • 2025

View PDF Download PDF

Published 2025/03/26

Abstract

Deep learning applications such as real-time object detection in autonomous vehicles, interactive voice assistants, and high-frequency trading systems often require strict adherence to inference latency constraints defined by service-level objectives. Ensuring that neural network inference times meet these constraints before deployment presents a significant challenge to developers. In this paper, we introduce a transformer-based approach to predict neural network inference latency in predeployment stages. Our method utilizes a diverse synthetic dataset of feedforward neural networks, characterized at the operation level. These networks are represented as graphs, where the node attributes encode the type of operation and weight count, and the edges define the topology of the network. By treating each node as a token, the transformer leverages multi-head attention to capture structural and attribute relationships that strongly correlate with inference latency. Experimental results demonstrate that the proposed transformer model achieves much better performance when compared to baseline linear regression in predicting standard neural network inference latency across a wide variety of architectures and configurations. Ultimately, our transformer-based solution facilitates the development of latencysensitive deep learning systems by enabling more reliable and efficient architectural optimization prior to deployment.