The simplest form of AI memory is the context window itself — the model "remembers" everything you've said in the current conversation because it's all right there in the input. Early models had 4K-token context windows (about 3,000 words), which meant conversations would "forget" earlier messages once they scrolled past that limit. Today's models have dramatically expanded this: Claude supports up to 200K tokens, Gemini 1.5 handles 1 million tokens, and some models push even further. But context window size and usable memory aren't the same thing. Research consistently shows that models struggle with information buried in the middle of very long contexts (the "lost in the middle" problem), and stuffing the context window full is expensive — you pay for every token on every API call, so a 100K-token conversation history costs real money to maintain.
The distinction between short-term and long-term memory in AI mirrors the same distinction in human cognition, but the implementations are quite different. Short-term memory (also called working memory) is what the model holds during a single session — the context window, any scratchpad or state it maintains during a multi-step task. Long-term memory is information that persists across sessions: your name, your preferences, past projects you've discussed, decisions you've made. Most consumer AI products now offer some form of long-term memory. ChatGPT's "Memory" feature extracts key facts from conversations and stores them as text snippets that get injected into future conversations. Claude's memory works similarly, with users able to save project-level context. These systems typically use a summarization step — an AI model reads the conversation and extracts the important bits — rather than storing raw transcripts, which would quickly overwhelm the context window.
For applications that need to remember large volumes of information — an entire codebase, a company's documentation, years of customer interactions — retrieval-augmented generation (RAG) serves as a form of external memory. Instead of cramming everything into the context window, you store documents as vector embeddings in a database and retrieve only the relevant pieces when needed. This is how most enterprise AI assistants work: when you ask a question, the system searches its knowledge base, pulls the top-k relevant chunks, and feeds them to the model alongside your query. The model doesn't "remember" the full knowledge base, but it has on-demand access to it, which is functionally similar. The tradeoff is latency and relevance — vector search adds 100–500ms per query, and the quality of the response depends entirely on whether the retrieval step found the right documents.
Memory introduces challenges that don't exist in stateless AI systems. Staleness is the most obvious: if you told Claude six months ago that you're working on a Python project, but you've since switched to Rust, that outdated memory becomes misleading. Most memory systems don't have a good mechanism for expiring or updating stored facts — they accumulate information but rarely prune it. Privacy is another minefield: if an AI remembers that you mentioned a health condition, a financial situation, or a confidential business strategy, that information now lives in a system you don't fully control. Who can access it? Can it be deleted? Does it get used to train future models? These questions are why some enterprise deployments explicitly disable memory features. Then there's the coherence problem: when a model draws on memories from many different conversations, it can produce responses that are technically informed by your history but contextually confused — mixing up details from different projects or applying outdated preferences to new situations.
The frontier of AI memory research is moving toward systems that don't just store and retrieve facts but actively organize and update their understanding over time. Google's Infini-attention and similar techniques aim to give transformer models a form of compressed long-term memory within the architecture itself, rather than relying on external databases. Agent memory systems — used by frameworks like AutoGPT and Claude's tool-use agents — maintain structured state across multi-step tasks, tracking what they've done, what they've learned, and what still needs to happen. And personalization is becoming more sophisticated: instead of storing flat facts ("user prefers Python"), future memory systems will build richer user models that capture communication style, expertise level, decision-making patterns, and project context. The goal is an AI that doesn't just remember what you said — it understands who you are and how to work with you, conversation after conversation.