The AI developer tooling landscape is vast and changes fast, so it helps to break it into layers. At the bottom you have inference engines — the software that actually runs models. vLLM, llama.cpp, TensorRT-LLM, and Ollama handle loading model weights onto GPUs (or CPUs), managing memory, batching requests, and returning outputs. If you are self-hosting models, picking the right inference engine for your hardware is one of the highest-leverage decisions you will make. vLLM dominates for multi-GPU server deployments with its PagedAttention memory management. llama.cpp is the go-to for running quantized models on consumer hardware, including laptops and even phones. The choice depends on your scale, your hardware, and whether you need features like speculative decoding or continuous batching.
One layer up you have orchestration frameworks — LangChain, LlamaIndex, Haystack, and the Vercel AI SDK. These handle the plumbing between your application and the model: prompt templating, tool calling, retrieval-augmented generation, conversation memory, and output parsing. The honest truth about these frameworks is that they are most useful when your use case matches their built-in patterns and most frustrating when it does not. LangChain, for example, makes it trivially easy to build a RAG chatbot but can feel like fighting the framework if you need non-standard control flow. Many experienced developers end up using these frameworks to prototype, then rewriting the critical path in plain code once they understand exactly what they need. That is not a failure of the tools — it is a reasonable workflow. Prototyping speed and production control serve different goals.
Fine-tuning tools form their own ecosystem. Axolotl and Unsloth make it possible to fine-tune open-weights models on a single consumer GPU by using techniques like LoRA and QLoRA, which train a small number of adapter parameters instead of the full model. Hugging Face's transformers library and its Trainer API remain the foundation that most fine-tuning tools build on. On the managed side, providers like OpenAI, Google, and Together offer fine-tuning APIs where you upload your data and get back a custom model without managing any infrastructure. The decision between self-hosted fine-tuning and managed fine-tuning usually comes down to data sensitivity and iteration speed. If your training data cannot leave your network, you self-host. If you want to experiment fast and the data is not sensitive, managed APIs are far less operational overhead.
The biggest risk with AI developer tools is adopting too many of them. Every framework, library, and platform adds a dependency, an abstraction layer, and a point of failure. Teams that try to use LangChain for orchestration, Pinecone for vectors, Weights & Biases for experiment tracking, Braintrust for evaluation, and Vercel for deployment end up spending more time integrating tools than building their product. The pragmatic approach is to start with the minimum viable stack: a model API (or a local inference engine), a simple prompt, and your existing application framework. Add tools only when you hit a specific pain point — retrieval quality is poor, so you add a vector database; evaluation is ad hoc, so you add a framework; latency is too high, so you add caching. Every tool should solve a problem you have already felt, not a problem you think you might have someday.