Classical upscaling (bilinear, bicubic interpolation) produces smooth, blurry results because it averages neighboring pixels. AI super resolution models (ESRGAN, Real-ESRGAN, SwinIR) learn to predict what high-frequency detail (sharp edges, textures, fine patterns) should look like given the low-resolution input. They're trained on pairs of high-res images and their downscaled versions, learning the mapping from low to high resolution.
AI upscaling necessarily invents detail that isn't in the original image. A blurry face gets plausible-looking features that may not match the actual person. Text becomes readable but may contain wrong letters. This is fine for artistic enhancement but problematic for forensic applications (security footage, medical imaging) where invented detail could be mistaken for real evidence. The output looks convincing but isn't faithful.
Many image generation workflows use a two-stage approach: generate at a lower resolution (faster, cheaper) then upscale with a super resolution model. Stable Diffusion's "hires fix" does exactly this. The base generation handles composition and content; the upscaler adds fine detail and sharpness. This is more efficient than generating at high resolution directly, especially for models that are compute-intensive per pixel.