The mechanism: take the input image, encode it to latent space (via the VAE encoder), add noise proportional to a "denoising strength" parameter (0.0 = no change, 1.0 = pure noise = text-to-image), then denoise conditioned on the text prompt. At strength 0.3, the output closely resembles the input with subtle modifications. At strength 0.8, it's largely reimagined but keeps the basic composition.
The denoising strength is the key parameter: it controls how much the output can deviate from the input. Low strength (0.2–0.4): minor style changes, color adjustments, subtle detail additions. Medium strength (0.5–0.7): significant style transformation while preserving composition. High strength (0.8–1.0): major reimagining, only vague structural similarity to the input. Finding the right strength for your use case requires experimentation.
A powerful img2img workflow: draw a rough sketch (even in MS Paint), use it as the input image with medium-high denoising strength, and describe the desired output. The sketch provides spatial layout (where objects are, their relative sizes) while the AI fills in all the artistic detail. This makes AI image generation accessible to anyone who can draw a stick figure — the composition comes from you, the rendering from the AI.