Kristoffer ørum - The Unstable Image

This text is part of my ongoing PhD work at the University of Copenhagen. Like the other texts in this series, it is a working document written from a practice-based position, and it lacks the references it will eventually need. It argues that the image was never stable — that material, digital, and human transformations have always altered images at every stage of their existence, and that generative AI is a continuation of this instability, not an interruption of it. It is about the chain of transformations that produces what I have been calling, in other texts, residue.

An image is not a stable object. This seems obvious when stated, but much contemporary debate about images — particularly about AI-generated images — proceeds as though it were not true. The arguments about copyright, about authorship, about the integrity of the photographic record, about the trustworthiness of images as evidence — all of these assume, at some level, that there is a fixed original against which copies, manipulations, and generations can be measured. The original painting. The unedited photograph. The authentic image. But the history of images is a history of material instability, and the digital is not an interruption of that history. It is a continuation of it. What I am trying to trace here is the chain of transformations that an image undergoes from its material origin through reproduction, compression, and circulation to its absorption into a generative model — and what survives of the specific inside each stage of generalisation. In other texts I have been calling this survival residue. This text is about the conditions that produce it. Start with the painting. A painting is made of materials that change. The pigments fade, darken, yellow, or shift hue over time. The binder cracks. The varnish oxidises. The support warps, tears, or rots. These are not accidents that happen to the painting. They are properties of the materials the painting is made from, and in many cases the artist knew they would happen. Turner used pigments that were known, even at the time, to be chemically unstable — colours that were likely to fade or shift within decades. The brilliant yellows and reds of his late work have shifted significantly since he painted them, and what we see in the gallery today is not what he saw on his easel. Whether he intended the fading as part of the work, or accepted it as a cost of the colour he wanted in the moment, the result is the same: the painting we look at is not the painting he made. It is what his painting became. The same is true, more dramatically, of paintings that used bitumen or tar in their ground layers. Bitumen never fully dries. It continues to move, crack, and darken, pulling the paint layers above it into patterns the artist did not intend. Carl Gustav Pilo’s monumental painting The Coronation of King Gustav III of Sweden used bitumen extensively in the shadowed areas. Today, those dark sections have fractured into deep fissures, physically warping the canvas. Because the painting was also left unfinished — Pilo died before completing it — it doubly defies the concept of a fixed origin. It was never completed and it has never stopped changing. What hangs in the gallery is the painting’s ongoing material life.

This instability is not a failure of the painting. It is what paintings do. They are made of matter, and matter changes. Conservation can slow the process, can stabilise certain conditions, but it is caught in a fundamental paradox: the condition of visibility is the engine of destruction. Institutions limit the light that falls on artworks because light degrades pigments. But humans need light to see. To preserve a painting perfectly is to lock it in the dark. The moment light hits the pigment to make the image visible, the deterioration resumes. Conservation cannot return the painting to a fixed original state, because the painting never had a fixed state. It has always been in motion, always becoming something other than what it was. The “original” is a convenient fiction — useful for insurance, for attribution, for the market — but it does not describe a material reality. It describes a moment in a process, frozen by convention and treated as though it were permanent. Even the biological act of looking at the painting defies the idea of a stable, instantaneous capture. We often imagine human vision functioning like a camera, taking in the world as a seamless, high-resolution snapshot. It does not. Unlike a photographic sensor, which exposes its entire rectangular surface to light simultaneously to record a single, unified fraction of a second, the eye does not take frames. It perceives only a tiny central fraction of its visual field in sharp detail—the fovea. To see a whole painting, the eye must dart across the canvas in rapid, jerky movements called saccades. During these movements, visual perception is briefly suppressed; the brain removes the blur of motion to maintain a coherent sense of reality. What we experience as a complete, stable image is actually a continuous temporal collage. The brain patches together these tiny, sequential glimpses and fills in the vast, unfocused periphery using context, expectation, and memory. Human vision is not a passive, unified recording of a fixed present. It is an active, generative construction. Long before a camera or a compression algorithm intervenes, the “original” image as perceived by the human subject is already an amalgamation of different moments in time, synthesized by a biological processor smoothing over the gaps in its own data. Now photograph the painting. The photograph compresses a three-dimensional object — with texture, with scale, with a particular relationship to the light in the room where it hangs — into a two-dimensional image with different dimensions, different colour values, different resolution, and no surface. The photograph is not a copy of the painting. It is a translation, and like all translations it loses some things and introduces others. It loses the texture of the brushwork, the scale of the canvas, the way the painting changes as you move in front of it. It adds the flatness of the photographic surface, the colour profile of the camera sensor, the lighting conditions of the photography session, the decisions of the photographer about angle, framing, and exposure. For most people, the Mona Lisa is encountered not in the Louvre but through photographs and reproductions. Walter Benjamin understood this in 1936, though he framed it as a loss of aura rather than as a material transformation. The point is not that the reproduction is worse than the original — sometimes it is, sometimes it is not — but that the reproduction is a different object, made of different materials, with different properties. The painting is oil on panel. The photograph is light on a sensor. The printed reproduction is ink on paper. The screen reproduction is backlit pixels. Each of these is a material object with its own instabilities: the ink fades, the paper yellows, the pixels are rendered differently on every screen, the sensor introduces its own noise and colour cast. Now compress the photograph digitally. JPEG compression discards information the algorithm judges to be imperceptible — high-frequency detail, subtle colour gradations, fine texture. Each time the image is saved, opened, and resaved, more information is lost. The image becomes softer, blockier, more generic. This is Hito Steyerl’s poor image — degraded by circulation, compressed by the networks it passes through, carrying the marks of its passage as visible artifacts. But the degradation did not begin with the JPEG. The photograph was already a compression of the painting. The painting was already changing. The poor image is not a fallen version of a rich original. It is the latest stage in a process of material transformation that began with the first brushstroke. The instability runs in every direction. The camera that took the photograph of the painting was already applying its own processing — white balance, noise reduction, sharpening, lens correction. A contemporary smartphone applies far more: computational photography pipelines that composite multiple exposures, suppress noise with neural networks, adjust skin tones, sharpen details that were never in the sensor data. The photograph that enters the dataset is not a passive recording. It is an actively processed image, shaped by the aesthetic assumptions of the camera’s software before any human decision was made about it. Samsung’s moon episode — where the phone’s AI recognised the moon and enhanced it using a neural network trained on images of the moon rather than relying solely on the sensor data — is an extreme case of a general condition. Every digital photograph is partially synthetic. The boundary between the captured and the generated was blurred long before diffusion models existed. The institutions that govern images know this, even if their rules do not fully acknowledge it. Nature photography competitions are a useful case because they take the question of image authenticity more seriously than almost any other photographic context. The World Nature Photography Awards forbid composites and the addition or removal of objects but permit “limited digital manipulations and focus stacking, providing they do not compromise the authenticity of the image.” The Nature Photographer of the Year competition uses specialised software to detect disallowed techniques but concedes that even rotating a JPEG from portrait to landscape can be flagged as manipulation. World Press Photo permits AI-powered enhancement tools “as long as these tools do not lead to significant changes to the image as a whole, introduce new information to the image, nor remove information from the image that was captured by the camera.” These rules are earnest and carefully thought out. They are also difficult to enforce cleanly, because the boundary they are trying to police — between an authentic image and a manipulated one — does not exist as a sharp line. It is a continuum, and every camera is already somewhere along it. The phone that smooths skin, suppresses noise, and composites multiple exposures has already introduced information that was not captured by the sensor in any single frame. The distinction between “capture” and “generation” is a matter of degree, not of kind. In 2024, photographer Miles Astray entered an unmanipulated photograph of a flamingo — its head hidden behind its body during preening, giving it a surreal, headless appearance — into the AI-generated category of the 1839 Color Photography Awards. Judges from The New York Times, Getty Images, and Christie’s, among others, voted it into third place in the juried award and first in the People’s Vote. When Astray revealed that the image was a real photograph, he was disqualified. The year before, Boris Eldagsen had won the Sony World Photography Awards with an AI-generated image entered as photography — the same confusion running in the opposite direction. In both cases, experts could not reliably distinguish between the captured and the generated. This is not a failure of expertise. It is evidence that the categories themselves — “real photograph” and “AI-generated image” — no longer describe materially distinct objects. They describe points on a continuum that runs from the raw sensor data (which no one ever sees, because it is processed before it reaches the screen) through varying degrees of computational enhancement to full generation from a learned distribution. The reception of images and the reproduction of images have always been entangled. We do not encounter most artworks directly. We encounter them as reproductions — in books, on screens, in lecture slides, on postcards. Our sense of what a painting looks like is shaped by the photographs we have seen of it, which are themselves shaped by the cameras, lighting conditions, and compression algorithms that produced them. A student who studies art history primarily through screen-based reproductions is learning not from the artworks but from a chain of translations of the artworks, each translation carrying its own instabilities and its own aesthetic biases. The model that scrapes those same reproductions and ‘learns’ from them is doing something structurally similar, though at a different scale and without the student’s capacity to visit the gallery and compare. What this suggests is that image authenticity has never been a fixed property. It has always been negotiated — between the materials and the conditions, between the artist and the medium, between the camera and its processing, between the reproduction and the original, between the viewer and the context of viewing. Technology has always been part of this negotiation, not as a neutral tool that records what is there but as a collaborator that shapes what the image becomes. The brush is a collaborator. The camera is a collaborator. The JPEG algorithm is a collaborator. The diffusion model is a collaborator. Each one introduces its own instabilities, its own biases, its own material logic into the image. The question is not whether the technology has altered the image — it always has — but whether we are honest about the alteration, and whether we attend to what the collaboration produces rather than pretending that the image arrived from nowhere, untouched. Now scrape that compressed, processed, already partially synthetic photograph from the internet and feed it into a training dataset alongside billions of others, each carrying its own history of material transformation. The images do not enter the dataset cleanly. Behind the scraping are layers of human labour that rarely appear in accounts of how models are built. Just as the camera sensor flattens light into pixels, and the JPEG algorithm flattens detail into blocks, the human workers in the pipeline perform their own forms of compression. Content moderators reduce the full range of human visual production into binary decisions — keep or discard. Data labellers reduce complex visual scenes into bounding boxes and text tags. Crowd workers flatten aesthetic experience into numerical ratings on a scale. Each of these is a translation, and each one loses something and adds something, just as the photograph of the painting did. The difference is that the camera’s compression is acknowledged as a technical process, while the human compression — the moderator’s decision, the labeller’s box, the rater’s score — is hidden behind the interface and presented as though it were not there. The dream of automation is that these human hands can be removed from the process. But they persist, often in places where the user has no reason to suspect them. LinkedIn’s CardMunch service, which operated from 2010 until it was shut down in 2014, is a small but telling case. Users scanned business cards with their phones and received clean, accurate digital contacts. The service was marketed as a technological solution, but the transcription was done by human workers — multiple workers per card, comparing the software’s output against the photograph, correcting the errors, and sending the verified result back. When CardMunch closed, its successors continued using human transcription, because the automated solutions were not accurate enough. The interface was digital, but the labour remained human. What they received was the product of underpaid human attention, invisibly inserted into a process designed to look like it needed no people at all. Even after training, generative models are not left to operate on their own. Large language models carry system prompts — pre-written instructions, invisible to the user, that sit between the model and the conversation, shaping what the model will and will not say. These are human-authored documents, updated regularly, reflecting specific corporate decisions about tone, safety, and liability. The model’s outputs are not the pure products of its training. They are the products of its training as constrained by human-written rules that the user never sees. The system presents itself as a conversation with a machine. What is actually happening is a conversation with a machine supervised by instructions written by people who are not in the room. The ‘gaze’ of the AI system is often described as inhuman — indifferent, totalising, extractive. But it is not inhuman. It is human all the way through, assembled from human images, filtered by human labour, shaped by human instructions, and received by human eyes. The inhumanity is not in the absence of the human but in the concealment of the human — the erasure of the hands, the eyes, the decisions, and the working conditions that the system depends on but does not acknowledge. The imaginaries that surround these systems — the narratives of artificial intelligence, of machine creativity, of autonomous generation, and equally the narratives of existential risk and machine superintelligence — serve, whatever their merits as speculation, to obscure this material reality. Narratives of hype obscure historical continuities. The metaphor of the cloud obscures the physical infrastructure. The language of intelligence obscures the human labour. Whether the narrative is utopian or dystopian, it directs attention away from the actual machinery: the specific datasets, the specific hardware, the specific decisions about filtering and moderation, the specific human workers at every stage of the pipeline. The instability of the image — which this text has been tracing from the painting through the photograph through the compression through the model — is hidden by the same gesture that hides the labour. The outputs are presented as clean, finished products, emerging from a system that the user is not invited to examine. The instability is there, in every output, but the interface is designed to make it invisible. The model itself does not ‘know’ which images are photographs of paintings and which are photographs of the world. It does not ‘know’ which have been heavily compressed and which are high resolution. It does not ‘know’ which have been cropped, colour-corrected, or run through Instagram filters. It treats all of them as data points in a distribution, and it derives its categories from their statistical relationships. The instabilities of all those images — the faded pigments captured by the camera, the JPEG artifacts, the colour casts, the crops that removed context, the moderator’s decisions about what to keep, the labeller’s choices about where to draw the box — are folded into the model’s parameters alongside the content of the images. The model ‘learns’ from the instabilities as much as from the content. A painting that has darkened over three centuries is photographed under gallery lighting, compressed to JPEG, uploaded to a museum website, scraped into a dataset, and compressed again into latent space. Each stage alters the image. None of them is neutral. And the model that emerges from this process is not ‘learning’ from paintings — it is ‘learning’ from the accumulated material and human transformations that the paintings underwent on their way to becoming data. The dataset also introduces a temporal instability of its own. It freezes a particular moment of internet culture — the images that happened to be online, in the formats they happened to be in, with the captions they happened to carry, at the time the scrape was run. Culture moves on. Political symbols shift meaning. Memes evolve and die. The model becomes a time capsule of the visual culture of the moment it was trained on, and its outputs carry that moment’s assumptions forward into a present that has already changed. Just as Pilo’s bitumen continues to crack, the model “ages” as the cultural context of its training data drifts away from the conditions of its use. And when models begin training on data that includes outputs from earlier models — which is increasingly the case as generated images circulate on the same platforms that supply training data — a further instability enters the loop. The model ‘learns’ from its own compressed productions, amplifying its defaults and smoothing away whatever marginal specificity the previous generation retained. The instability feeds on itself. This has consequences for what the model produces. When a diffusion model generates an image that looks like a painting, it is not reproducing the visual qualities of painting. It is reproducing the visual qualities of photographs of paintings that have been digitally compressed and uploaded to the internet. The “painterly” quality of AI-generated images is not the quality of paint on canvas. It is the quality of paint on canvas as filtered through a camera sensor, a JPEG algorithm, a web upload pipeline, and a latent space compression. The model has never ‘seen’ a painting.It has processed images of images of paintings, each layer adding its own distortions. The result is a simulacrum that refers not to the painting but to the chain of mediations the painting passed through. This is not a critique of the model. It is a description of the condition. The model cannot do otherwise, because the data it ‘learned’ from was already unstable, already transformed, already several steps removed from whatever “original” it is taken to represent. What this means for the concept of residue is that the residue inside a generative model is not the trace of stable originals that were compressed. It is the trace of already-unstable images that were compressed further. The specificity that survives inside the model’s parameters was already a transformed specificity — the colour of the painting as captured by this camera under this light, not the colour of the painting as it was. The texture of the brushwork as rendered by this lens at this resolution, not the texture of the brushwork as felt by a hand. The residue is a trace of a trace. It refers back not to an original but to a chain of translations — material, digital, human — each one altering what it passed along. This does not make the residue less real, just more complex. The residue inside the model is not the ghost of a stable thing that was lost. It is the accumulated deposit of a process of transformation that was already underway before the model existed — a process that includes the material life of the painting, the compression of the photograph, the processing of the camera, the degradation of the file format, the labeller’s bounding box, the moderator’s decision, and the statistical reduction of the training pipeline. Each layer contributed something to what the model absorbed, and each layer also took something away. What remains — the residue — is the sum of all these contributions and subtractions, folded together in a form that cannot be unpicked. For the person making images with these models, this changes what it means to work with specificity. You are not trying to recover an original that was lost in compression. There was no stable original. You are working with a material that has always been in motion — that has always been becoming something other than what it was. The bitumen in Pilo’s painting is still moving. The pigments in a Turner are still fading. The JPEG of a photograph of either painting is a snapshot of an ongoing process, and the model that processed that JPEG inherited the process, not the snapshot. Working with generative models is working with this inheritance: the accumulated instability of all the images that came before, compressed into a space where their individual histories are no longer separable but their collective influence persists. This has practical consequences for how prompting works. When I prompt a model for a specific image — a Danish hospital corridor, a particular shade of blue, a welfare system that does not exist — the model’s output reveals the shape of its instabilities. If I prompt for a hospital, I get American corridor layouts, because the English word “hospital” sits in a dense cluster formed by predominantly American images. If I prompt in Danish, the model struggles differently — the cluster is thinner, the results are less coherent, and the linguistic materiality of the latent space becomes visible as a constraint on what the model can produce. Prompting is, in this sense, a diagnostic practice. It does not fix the model’s instabilities. It maps them. Each failed output tells you something about what the model absorbed and how — which images dominated the distribution, which languages structured the labels, which aesthetic conventions were reinforced by the filtering pipeline. The instabilities of the output are not noise to be overcome. They are information about the accumulated history of the images the model processed. This mapping also reveals that the boundaries models are asked to produce rarely match the boundaries their data supports. When nation-states invest in sovereign AI — building national models trained on curated national datasets — they are commanding the model to produce hard categories: Danish, non-Danish, citizen, foreign. But latent space does not produce hard categories. It produces gradients of statistical similarity. A generated image that looks “Danish” may be a composite of Scandinavian architecture, German street furniture, and American stock photography lighting. The sovereign dataset promises borders. The model produces gradients. The instability of the image persists even when the political will demands stability. The instability is not something that can be removed. It is a condition of the image, and it always has been. What the generative model does is make this condition visible at scale — by compressing so many unstable images into a single system that the instabilities themselves become part of what the system produces. The extra finger, the dissolving face, the colour that belongs to no convention — these are not just the model’s ‘failures’. They are the surfacing of the accumulated instabilities of the images the model ‘learned’ from, pushed through one more layer of compression and emerging as visible artifacts in the output. The instability has always been there. The model just makes it difficult to ignore.