ONNX: Definition & Meaning — AI Wiki

表示机器学习模型的开放格式,实现框架间互操作性。PyTorch 训练的模型可以导出到 ONNX,然后用 ONNX Runtime、TensorRT 或为特定硬件优化的其他推理引擎运行。ONNX 作为训练世界(PyTorch、TensorFlow)和部署世界(优化运行时)之间的通用语言。

为什么重要

ONNX 解决了一个真实的生产问题:你在 PyTorch(研究标准)里训练,但在一个用不同运行时跑得更好的硬件上部署。转到 ONNX 能让你用优化的推理引擎,不用重写你的模型。这对边缘部署特别重要,你需要在有限硬件上的最大性能。

Deep Dive

ONNX defines a computation graph format: nodes represent operations (matrix multiply, convolution, attention), edges represent tensors flowing between operations. The graph includes all the information needed to run the model: architecture, weights, input/output shapes, and operator definitions. ONNX Runtime (Microsoft) is the most popular runtime, supporting CPU, GPU, and specialized accelerators.

When to Use ONNX

ONNX is most useful when: (1) you need to deploy on non-NVIDIA hardware (Intel, AMD, ARM, mobile) where PyTorch CUDA isn't available, (2) you need maximum inference speed and ONNX Runtime's optimizations outperform PyTorch, or (3) you're integrating a model into a non-Python application (ONNX Runtime has C++, C#, Java, and JavaScript bindings). For standard GPU inference with large LLMs, specialized serving frameworks (vLLM, TGI) typically outperform ONNX.

Limitations

Not all PyTorch operations convert cleanly to ONNX, especially custom operators and dynamic architectures. Complex models may require manual intervention to export correctly. ONNX also lags behind cutting-edge architectures — new model types may not be supported until ONNX operators are added. For LLM inference specifically, the GGUF/llama.cpp ecosystem and TensorRT-LLM have become more popular than ONNX for most use cases.

ONNX

为什么重要

Deep Dive

When to Use ONNX

Limitations

相关概念