ONNX: Definition & Meaning — AI Wiki

An open format for representing machine learning models that enables interoperability between frameworks. A model trained in PyTorch can be exported to ONNX and then run using ONNX Runtime, TensorRT, or other inference engines optimized for specific hardware. ONNX acts as a common language between the training world (PyTorch, TensorFlow) and the deployment world (optimized runtimes).

Why it matters

ONNX solves a real production problem: you train in PyTorch (the research standard) but deploy on hardware that runs better with a different runtime. Converting to ONNX lets you use optimized inference engines without rewriting your model. It's especially important for edge deployment where you need maximum performance on limited hardware.

Deep Dive

ONNX defines a computation graph format: nodes represent operations (matrix multiply, convolution, attention), edges represent tensors flowing between operations. The graph includes all the information needed to run the model: architecture, weights, input/output shapes, and operator definitions. ONNX Runtime (Microsoft) is the most popular runtime, supporting CPU, GPU, and specialized accelerators.

When to Use ONNX

ONNX is most useful when: (1) you need to deploy on non-NVIDIA hardware (Intel, AMD, ARM, mobile) where PyTorch CUDA isn't available, (2) you need maximum inference speed and ONNX Runtime's optimizations outperform PyTorch, or (3) you're integrating a model into a non-Python application (ONNX Runtime has C++, C#, Java, and JavaScript bindings). For standard GPU inference with large LLMs, specialized serving frameworks (vLLM, TGI) typically outperform ONNX.

Limitations

Not all PyTorch operations convert cleanly to ONNX, especially custom operators and dynamic architectures. Complex models may require manual intervention to export correctly. ONNX also lags behind cutting-edge architectures — new model types may not be supported until ONNX operators are added. For LLM inference specifically, the GGUF/llama.cpp ecosystem and TensorRT-LLM have become more popular than ONNX for most use cases.

ONNX

Why it matters

Deep Dive

When to Use ONNX

Limitations

Related Concepts