Zubnet AIAprenderWiki › ONNX
Infraestructura

ONNX

Open Neural Network Exchange
Un formato abierto para representar modelos de machine learning que habilita interoperabilidad entre frameworks. Un modelo entrenado en PyTorch puede ser exportado a ONNX y luego correr usando ONNX Runtime, TensorRT u otros motores de inferencia optimizados para hardware específico. ONNX actúa como un lenguaje común entre el mundo del entrenamiento (PyTorch, TensorFlow) y el mundo del despliegue (runtimes optimizados).

Por qué importa

ONNX resuelve un problema real de producción: entrenas en PyTorch (el estándar de investigación) pero despliegas en hardware que corre mejor con un runtime distinto. Convertir a ONNX te permite usar motores de inferencia optimizados sin reescribir tu modelo. Es especialmente importante para despliegue edge donde necesitas rendimiento máximo en hardware limitado.

Deep Dive

ONNX defines a computation graph format: nodes represent operations (matrix multiply, convolution, attention), edges represent tensors flowing between operations. The graph includes all the information needed to run the model: architecture, weights, input/output shapes, and operator definitions. ONNX Runtime (Microsoft) is the most popular runtime, supporting CPU, GPU, and specialized accelerators.

When to Use ONNX

ONNX is most useful when: (1) you need to deploy on non-NVIDIA hardware (Intel, AMD, ARM, mobile) where PyTorch CUDA isn't available, (2) you need maximum inference speed and ONNX Runtime's optimizations outperform PyTorch, or (3) you're integrating a model into a non-Python application (ONNX Runtime has C++, C#, Java, and JavaScript bindings). For standard GPU inference with large LLMs, specialized serving frameworks (vLLM, TGI) typically outperform ONNX.

Limitations

Not all PyTorch operations convert cleanly to ONNX, especially custom operators and dynamic architectures. Complex models may require manual intervention to export correctly. ONNX also lags behind cutting-edge architectures — new model types may not be supported until ONNX operators are added. For LLM inference specifically, the GGUF/llama.cpp ecosystem and TensorRT-LLM have become more popular than ONNX for most use cases.

Conceptos relacionados

← Todos los términos
← Ollama Open vs. Closed →