How to stop the AI drawing of iPhone in a past era

Table of Contents

How does the AI Image Generator portray the past? New research shows that they drop smartphones in the 18th century, insert laptops in scenes from the 1930s, place vacuum cleaners in 19th century homes, and raise questions about how these models imagine history.

In early 2024, Google’s Gemini Multimodal AI model’s image generation capabilities were criticized for imposing demographic fairness in an inappropriate context.

Demographically unlikely German military personnel, as Google’s Gemini multimodal model assumes in 2024. Source: Gemini AI/Google Via Guardian

This was an example of efforts to correct bias in AI models that failed to take into account historical context. In this case, the issue was addressed soon. However, diffusion-based models tend to produce versions of history that disrupt modern and historical aspects and artifacts.

This is part of the entanglement where the qualities frequently displayed together in the training data become fused in the model’s output. For example, if modern objects, such as smartphones, often collaborate with the act of speaking and listening in a dataset, the model can learn to associate those activities with the latest devices, even if the prompts specify history settings. When these associations are incorporated into the internal representation of the model, it becomes difficult to separate activities from modern contexts, leading to historically inaccurate results.

A new Swiss paper examining the phenomenon of historical generation intertwining in potential diffusion models observes that there is an AI framework You can create photorealistic people Nevertheless, he prefers to portray historical figures in a historical way.

From the new paper, each period is shown in each output with a variety of expressions via LDM in prompts, “photorealistic images of people laughing along with “historical periods.” As you can see, ERA media is associated with content. Source: https://arxiv.org/pdf/2505.17064

For prompts “Photorealistic images of people laughing with friends (historical era)”one of the three models tested often ignores negative prompts “Monochrome” Instead, use color treatments that reflect visual media from a particular era. For example, it mimics the calm tone of celluloid films from the 1950s and 1970s.

When testing three models of ability to create Anachronism (Not a target period or “out of hours” – may be from the target period future As in the past), they found a common temperament that fuses timeless activities (such as “songs” and “cooking”) with modern contexts and equipment.

The diverse activities fully effective in the previous century have been depicted in current or more recent techniques and tools, against the spirit of the image requested.

What is noteworthy is that smartphones are particularly difficult to separate from photo idioms and many other historical contexts. This is because their proliferation and depiction are well represented in influential hyperscale data sets such as general crawls.

In the flux-generated text-to-image model, communication and smartphone are closely related concepts, even if they do not allow it in historical contexts.

To determine the scope of the problem and advance future research efforts with this particular bugbear, the authors of the new paper have developed a custom dataset to test the generator system. Let’s take a look at this new work soon. Synthetic History: Evaluation of past visual representations in diffusion modelsand came from two researchers at the University of Zurich. The dataset and code are publicly available.

The fragile “truth”

Some of the paper’s themes touch on culturally sensitive issues, such as underrepresentation of race and gender in historical representation. The imposition of racial equality in Gemini’s grossly unfair Third Reich is an absurd and disgraceful historical revision, but restoring “traditional” racial expression (recovering that the diffusion model “updated” them” would often effectively “redacted” them.

Recent hit history shows such as Bridgerton have blurred the accuracy of historical demographics in ways that could affect future training data sets, complicating efforts to align images of LLM generation periods with traditional standards. But this is a complicated topic. This is a complicated topic given that (Western) historical trends support wealth and whiteness and prevent so many “less” stories from telling.

With these tricky and ever-changing cultural parameters in mind, let’s take a look at researchers’ new approaches.

Methods and Tests

The authors created it to test how generative models interpret historical contexts. You need to havea dataset of 30,000 images generated from 100 prompts depicting common human activities. Each is rendered over 10 different periods.

Sample Histvis dataset made available to the authors on Face. Source: https://huggingface.co/datasets/latentcanon/histvis

Activities such as cooking, pray or listen to musicselected for universality and phrased in neutral form to prevent the model from being fixed in a particular aesthetic. The duration of the dataset ranges from the 17th century to the present, focusing on the 50 years from the 20th century to the 50 years.

30,000 images were generated using three widely used open source diffusion models. Stable diffusion XL. Stable diffusion 3; and Flux.1. By separating periods as the sole variable, researchers have created a structured foundation for assessing how historical cues are visually encoded or ignored by these systems.

Visual style advantage

The authors first looked into whether the generative model would become a specific default Visual Style When describing historical periods; because models often relate a particular century to distinctive styles, even when the prompts do not mention medium or aesthetics.

From the prompts “who dances with another person (historical period)” (left) and “photorealistic images of people dancing with another person in a historical period”, predict the visual style generated from the photorealistic images of people dancing with another person in a historical period (right) with “monochrome photos” set as the negative prompt (right).

To measure this trend, the authors trained a convolutional neural network (CNN) to classify each image in the convolutional dataset into one of five categories. drawing; Sculpture; figure; Painting;or photograph. These categories are intended to reflect common patterns that emerge over time periods and support structured comparisons.

The classifier was based on a pre-trained VGG16 model with Imagenet and was fine-tuned with 1,500 examples per class from a dataset derived from Wikiart. Wikiart does not distinguish between monochrome and colored photos, so it’s a different one Colorfulness score It was used to label low saturated images as monochrome.

The trained classifier is then applied to the complete dataset, and the results show that all three models impose consistent style defaults for each period. SDXL relates the 17th and 18th centuries to sculpture, while SD3 and Flux.1 tend to be painted. For the 20th century, SD3 prefers black and white photography, while SDXL often returns modern illustrations.

These preferences were found to last despite rapid adjustments, suggesting that the model encodes a anchoring link between style and historical context.

Based on 1,000 samples per model, we predicted the visual style of the images generated over the historical period of each diffusion model.

To quantify how strongly the model links historical periods to a particular period. Visual Stylethe author won the title Visual style advantage (VSD). For each model and duration, VSD is defined as the percentage of output that is expected to share the most common style.

Examples of stylistic bias throughout the model.

A higher score indicates that a single style governs the output for that period, and a lower score results in greater variability. This allows you to compare how closely each model adheres to a particular style of practice over time.

The VSD metrics applied to the complete Histvis dataset help to reveal different levels of convergence and clarify that each model strongly narrows past visual interpretations.

The results table above shows the VSD scores over the historical periods of each model. In the 17th and 18th centuries, SDXL tends to produce highly consistent sculptures, while SD3 and Flux prefer painting. By the 20th and 21st centuries, SD3 and Flux.1 will shift to photography, while SDXL will show more variations, but often default to illustrations.

All three models show a strong preference for monochrome images in the early decades of the 20th century, particularly in the 1910s, 1930s and 1950s.

To test whether these patterns could be alleviated, the authors used rapid engineering, explicitly requesting photorealism, and used negative prompts to block monochrome output. In some cases, the dominance scores have decreased, shifting from, for example, monochrome. Paintingin the 17th and 18th centuries.

However, these interventions rarely produce images in true rays, indicating that the model’s style defaults are deeply embedded.

Historical consistency

The following analysis line was seen Historical consistency: Whether the generated image contained objects that did not fit the period. Instead of using a fixed list of prohibited items, the authors have developed a flexible way to utilize large languages (LLMS) and vision language models (VLMS) to find elements that appear out of place based on historical contexts.

The detection method followed the same format as the Histvis dataset. There, each prompt combined historical periods with human activity. For each prompt, GPT-4o generated a list of objects that were absent for the specified period. For all proposed objects, GPT-4O generated a yes or no A question designed to check whether that object will appear in the generated image.

For example, you will be given a prompt “People who listen to music in the 18th century”GPT-4o may be identified Latest Audio Devices Historically inaccurate and generates questions Has anyone used headphones or smartphones that were not present in the 18th century?.

These questions are passed to the GPT-4o in a setup that responds to the visual questions, and the model reviews the images, yes or no Each answer. This pipeline allowed for the detection of historically incredible content without relying on a predefined taxonomy of modern objects.

Examples of generated images flagged by a two-stage detection method. Shows anachronistic elements: 18th century headphones. 19th century vacuum cleaner. A laptop from the 1930s. Smartphones from the 1950s.

To measure the frequency at which anachronism appears in the generated images, the authors have introduced simple methods for scoring frequency and severity. First, we explained the minor differences in language about how GPT-4o described the same object.

For example, modern audio devices and digital audio devices were treated as equivalent. To avoid double counting, we used a fuzzy matching system to group these surface-level variations without affecting truly distinct concepts.

Once all proposed anachronisms were normalized, two metrics were calculated. frequency We measured how often a particular object appears in the image during a particular period and model. and Severity After being proposed by the model, we measured how reliably the object would be displayed.

I received a severity score of 1.0 when the latest phone was flagged 10 times and displayed in the 10 generated images. If displayed as just 5, the severity score was 0.5. These scores helped us to identify not only whether anachronism occurred, but also how firmly embedded in the output of the model for each period.

The top 15 anachronisms of each model are plotted by the frequency of the X-axis and the severity of the y-axis. The circle marking element ranks at the top 15 in frequency and marks diamonds in both triangles by severity.

Above we see 15 anachronisms, which are most common in each model. These models are ranked by how often they appeared and how consistent they match the prompt.

Clothes were frequently but scattered, but items like audio devices and ironing boards were not often visible, but there are patterns that suggest that the model is often responsive. Prompt Activities Over the period.

SD3 showed the highest anachronism rate, especially in images from the 19th century and 1930s, followed by Flux.1 and SDXL.

To test how well the detection method matched human judgments, the authors performed a user study featuring 1,800 randomly sampled images from SD3 (models with the highest anachronism). Each image was rated by a crowd of three. After filtering out reliable responses, 2,040 judgements from 234 users were included, and the method agreed to the majority vote in 72% of cases.

Yes questions to identify the time inconsistencies in human evaluation research GUIs, task instructions, examples of accurate and anachronistic images, and generated outputs.

Demographics

The final analysis examined how models portray race and gender over time. Using the histvis dataset, the authors compared the model output with baseline estimates generated by the language model. Although these estimates were not accurate, they provided a rough sense of historical validity and helped to clarify whether the model fitted the depiction into the intended period.

To assess these depictions on a large scale, the authors constructed a pipeline that compared model-generated demographics with rough expectations of each time and activity. They first used the Fairface Classifier, a ResNet34-based tool trained with over 100,000 images to detect gender and race on the generated output, allowing faces in each scene to be classified as male or female, allowing racial categories to be measured over time.

Examples of generated images showing demographic overrepresentation across different models, durations, and activities.

We excluded low faith outcomes to reduce noise and averaged predictions across all images linked to a particular time and activity. A second system based on deep faces was used with samples of 5,000 images to confirm the reliability of fair face measurements. The two classifiers showed strong agreement in favour of the consistency of the demographic measurements used in the study.

To compare the output of the model with historical validity, the authors asked GPT-4o to estimate the expected gender and race distribution for each activity and duration. These estimates served as a rough baseline rather than ground truth. Two metrics were then used. Underrated and Overexpression,measuring the amount that the model output deviates from the expectations of LLM.

The results showed a clear pattern: Flux.1 often overrepresented men; cookingwhere women were expected. SD3 and SDXL are work, education and religion;This bias has decreased over a recent period, but the white face has appeared more than expected overall. Additionally, some categories show unexpected spikes in non-white representations, suggesting that model behavior may reflect correlations in datasets rather than historical contexts.

Overrepresentation and underestimation of gender and race in Flux. It is shown as an absolute difference between output and activity over a century, and estimates of GPT-4o demographics.

The author concludes:

‘Our analysis revealed that (text-to-image/TTI) models rely on limited stylistic encodings rather than subtle understandings of historical periods. Each era is strongly linked to a particular visual style, resulting in a one-dimensional depiction of history.

“Photorealistic depictions of people in particular only appear in the 20th century and suggest that, with the rare exceptions in Flux 1 and SD3, models do not flexibly adapt to historical contexts, but reinforce the relevance learned and perpetuate the notion that realism is a contemporary character.

“Furthermore, frequent anachronism suggests that historical periods are not neatly separated in the potential spaces of these models, as modern artifacts often emerge in pre-premodern environments and undermine the reliability of the TTI system in the context of education and cultural heritage.”

Conclusion

During training of the diffusion model, the new concepts do not settle neatly into predefined slots within the latent space. Instead, they form clusters shaped by how often they are displayed and close to related ideas. The result is a loosely organized structure in which the concept exists in relation to its frequency and typical context, rather than clean or empirical separation.

This makes it difficult to isolate what is considered “historical” within a large generic dataset. As the findings of new papers suggest, many periods are Look of the media used to depict them rather than deeper historical details.

This is one of the reasons why it remains difficult to produce light realistic images of 2025 quality of 19th century characters (for example); In most cases, this model depends on visual ratios drawn from films and television. If they do not match the request, there is very little else to compensate. Bridging this gap could depend on future improvements in solving the concept of overlap.

First released on Monday, May 26th, 2025