RightNow AI released AutoKernel, an open-source framework that uses LLM agents to automatically optimize GPU kernels for PyTorch models. The system runs an autonomous loop: an agent modifies kernel code, benchmarks performance and correctness, then keeps improvements or reverts failures using git commits. Each iteration takes roughly 90 seconds, yielding 300-400 optimization attempts in a 10-hour overnight run. The approach directly addresses findings from KernelBench, where even frontier LLMs matched PyTorch baseline performance in fewer than 20% of GPU kernel problems.

This tackles one of ML engineering's most specialized bottlenecks. Writing high-performance CUDA or Triton kernels requires simultaneous expertise in memory coalescing, register pressure, tensor cores, and dozens of other interdependent parameters—skills that take years to develop and scale poorly as architectures evolve. A single optimized matmul kernel can involve 200+ lines of code. AutoKernel essentially mechanizes the expert workflow: write, test, keep or discard, repeat.

What's notable is the engineering approach rather than the underlying capability. Using git for experiment tracking and plain TSV files for results keeps the system dependency-free and inspectable. The 90-second iteration time—split between correctness checking, performance benchmarking via Triton's do_bench, and agent reasoning—suggests this could actually be practical for real workloads, not just research demos.

For developers, this represents a potential shift from needing specialized CUDA engineers to simply having compute budget for overnight optimization runs. The real test will be whether AutoKernel's optimizations actually beat hand-tuned kernels from experienced engineers, and whether the approach generalizes beyond the specific kernels they've tested. But automating even basic kernel optimization could democratize performance tuning for smaller teams." "tags": ["cuda", "gpu-optimization", "agents", "performance