Zubnet AI学习Wiki › Unsupervised 学习ing
Training

Unsupervised 学习ing

一种训练方法,模型在数据中找模式,但没人告诉它要找什么。没有标签,没有正确答案 — 只有原始数据和一个自己发现结构的模型。聚类、降维和异常检测是经典的无监督任务。模型把相似的数据点归组、找压缩表示、或识别异常值。

为什么重要

现实世界里大多数数据都没标签 — 你有几百万笔交易,但没人把每一笔都标为“欺诈”或“非欺诈”。无监督学习能在这些原始数据里找到手动发现不了的模式。它也是 embeddings 的基础,而 embeddings 驱动语义搜索、推荐系统和 RAG。

Deep Dive

Unsupervised learning encompasses a family of techniques. Clustering algorithms like K-means group similar data points together. Autoencoders learn compressed representations by encoding data to a small bottleneck and then reconstructing it. Dimensionality reduction (PCA, t-SNE, UMAP) projects high-dimensional data into 2D or 3D for visualization. What unites them is the absence of labels — the model defines its own notion of "similar" or "important" based on the data's statistical structure.

Where LLMs Fit

LLM pre-training is often called "self-supervised" rather than truly unsupervised, because the training signal comes from the data itself (predict the next token). But the spirit is unsupervised — no human annotator labels each token. The model discovers language structure, factual knowledge, reasoning patterns, and even some world knowledge purely from the statistical patterns in text. This is why pre-training requires such massive datasets: without labels to guide it, the model needs enormous amounts of data to discover meaningful patterns on its own.

相关概念

← 所有术语
← Twelve Labs Upstage →