Zubnet AIAprenderWiki › ONNX
Infraestrutura

ONNX

Open Neural Network Exchange
Um formato aberto para representar modelos de machine learning que habilita interoperabilidade entre frameworks. Um modelo treinado em PyTorch pode ser exportado para ONNX e então rodar usando ONNX Runtime, TensorRT ou outros motores de inferência otimizados para hardware específico. ONNX age como uma linguagem comum entre o mundo do treinamento (PyTorch, TensorFlow) e o mundo do deployment (runtimes otimizados).

Por que importa

ONNX resolve um problema real de produção: você treina em PyTorch (o padrão de pesquisa) mas deploya em hardware que roda melhor com um runtime diferente. Converter para ONNX te permite usar motores de inferência otimizados sem reescrever seu modelo. É especialmente importante para deployment edge onde você precisa de performance máxima em hardware limitado.

Deep Dive

ONNX defines a computation graph format: nodes represent operations (matrix multiply, convolution, attention), edges represent tensors flowing between operations. The graph includes all the information needed to run the model: architecture, weights, input/output shapes, and operator definitions. ONNX Runtime (Microsoft) is the most popular runtime, supporting CPU, GPU, and specialized accelerators.

When to Use ONNX

ONNX is most useful when: (1) you need to deploy on non-NVIDIA hardware (Intel, AMD, ARM, mobile) where PyTorch CUDA isn't available, (2) you need maximum inference speed and ONNX Runtime's optimizations outperform PyTorch, or (3) you're integrating a model into a non-Python application (ONNX Runtime has C++, C#, Java, and JavaScript bindings). For standard GPU inference with large LLMs, specialized serving frameworks (vLLM, TGI) typically outperform ONNX.

Limitations

Not all PyTorch operations convert cleanly to ONNX, especially custom operators and dynamic architectures. Complex models may require manual intervention to export correctly. ONNX also lags behind cutting-edge architectures — new model types may not be supported until ONNX operators are added. For LLM inference specifically, the GGUF/llama.cpp ecosystem and TensorRT-LLM have become more popular than ONNX for most use cases.

Conceitos relacionados

← Todos os termos
← Ollama Open vs. Closed →