ONNX: Definition & Meaning — AI Wiki

表示機器學習模型的開放格式,實現框架間互通性。PyTorch 訓練的模型可以匯出到 ONNX,然後用 ONNX Runtime、TensorRT 或為特定硬體優化的其他推理引擎執行。ONNX 作為訓練世界(PyTorch、TensorFlow)和部署世界(優化執行時)之間的通用語言。

為什麼重要

ONNX 解決了一個真實的生產問題:你在 PyTorch(研究標準)裡訓練,但在一個用不同執行時跑得更好的硬體上部署。轉到 ONNX 能讓你用優化的推理引擎,不用重寫你的模型。這對邊緣部署特別重要,你需要在有限硬體上的最大性能。

Deep Dive

ONNX defines a computation graph format: nodes represent operations (matrix multiply, convolution, attention), edges represent tensors flowing between operations. The graph includes all the information needed to run the model: architecture, weights, input/output shapes, and operator definitions. ONNX Runtime (Microsoft) is the most popular runtime, supporting CPU, GPU, and specialized accelerators.

When to Use ONNX

ONNX is most useful when: (1) you need to deploy on non-NVIDIA hardware (Intel, AMD, ARM, mobile) where PyTorch CUDA isn't available, (2) you need maximum inference speed and ONNX Runtime's optimizations outperform PyTorch, or (3) you're integrating a model into a non-Python application (ONNX Runtime has C++, C#, Java, and JavaScript bindings). For standard GPU inference with large LLMs, specialized serving frameworks (vLLM, TGI) typically outperform ONNX.

Limitations

Not all PyTorch operations convert cleanly to ONNX, especially custom operators and dynamic architectures. Complex models may require manual intervention to export correctly. ONNX also lags behind cutting-edge architectures — new model types may not be supported until ONNX operators are added. For LLM inference specifically, the GGUF/llama.cpp ecosystem and TensorRT-LLM have become more popular than ONNX for most use cases.

ONNX

為什麼重要

Deep Dive

When to Use ONNX

Limitations

相關概念