Comparing Leading Text-to-Image Generation Models for Adding Text to Images

Rifx.Online
Generative AI , Natural Language Processing , Technology/Web
16 Nov, 2024

A comparison of nine leading image generation models’ ability to render accurate text (words and phrases) within an image.

In this post, we will assess the capabilities of nine state-of-the-art text-to-image generation models from multiple providers on different hosting platforms. Specifically, we will evaluate their ability to generate accurate text (words and phrases) within images based on given prompts. The models tested include the following (in alphabetical order):

Adobe Firefly Image 3 (via firefly.adobe.com)
Amazon Titan Image Generator G1 v2 (via Amazon Bedrock)
Black Forest Labs FLUX1.1 [pro] and Ultra Mode (via Replicate)
Google Imagen 3 (via ImageFX)
KLING AI powered by Kwai-Kolors/Kolors (via klingai.com)
Midjourney v6.1 (via midjourney.com)
OpenAI DALL·E 3 (via ChatGPT)
Stability AI Stable Diffusion 3.5 Large (via stability.ai API)
Stability AI Stable Image Ultra 1.0 v1 (via Amazon Bedrock)

Additionally, we will examine three alternative and more reliable techniques for ensuring the accuracy of text in generated images.

Testing the Models

Several tests, using different prompts and varying levels of detail, were run across all models. Examples of prompts included:

A photograph of a smiling scientist holding a sign that reads: “Flawless AI-generated text!”
Vegetable stand with various vegetables, including tomatoes. A black sign with white type reads: “Farm Fresh Tomatoes $2.99/lb.”
A whimsical illustration of a friendly-looking pumpkin on a white background with a Fall motif of assorted gourds and autumn leaves. The words “Happy Halloween” are centered above the pumpkin in large dark brown letters.
A sleek billboard towers above a bustling interstate at rush hour, cars whizzing by in a blur. Against a dynamic, abstract background, the large, bold text “Generative AI: Transforming Digital Advertising”, creates instant readability for passing motorists.

Although the overall image quality and degree of apparent bias varied significantly among the models, only text generation capabilities were assessed. Models that could accurately reproduce the requested text in the prompt at least 50% of the time received a passing grade. Below are results from selected tests that exemplify the models’ capabilities. The results are presented in alphabetical order rather than ranked by quality. For each test, four representative images of average quality are included in the post.

Models

Adobe Firefly Image 3

Adobe announced its Firefly Image 3 Foundation Model in April 2024. According to the press release, Adobe Firefly Image 3 delivers stunning advancements in photorealistic quality, styling capabilities, detail, accuracy, and a greater variety. In addition, significant advancements in the generation speed make the ideation and creation process more productive and efficient. The model is available for use in Adobe Photoshop (beta) and on firefly.adobe.com. Both interfaces are shown below.

🚫 In my tests, Adobe Firefly could not accurately reproduce the text requested in the prompt.

Amazon Titan Image Generator G1 v2

The Amazon Titan Image Generator G1 v2 model was released in August 2024. It was an upgrade to the previous generation, the Amazon Titan Image Generator G1 v1 model, released in November 2023. The Amazon Titan Image Generator G1 v2 model added features, including image conditioning, image guidance with a color palette, background removal, and subject consistency.

The Amazon Titan Image Generator G1 v2 model was tested on Amazon Bedrock, which according to AWS, is “a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.”

🚫 In my tests, Amazon Titan Image Generator G1 v2 could not accurately reproduce the text requested in the prompt.

Black Forest Labs FLUX1.1 [pro] and Ultra Mode

Black Forest Labs released FLUX1.1 [pro] in October 2024. According to Black Forest Labs, “FLUX1.1 [pro] provides six times faster generation than its predecessor FLUX.1 [pro] while also improving image quality, prompt adherence, and diversity. At the same time, we updated FLUX.1 [pro] to generate the same output as before, but two times faster.” The earlier FLUX.1 [pro] model was released in August 2024.

As I prepared this post, Black Forest Labs introduced FLUX1.1 [pro] Ultra and Raw Modes. According to the press release, “Today we are adding new high-resolution capabilities to FLUX1.1 [pro], extending its functionality to support 4x higher image resolutions (up to 4MP) while maintaining an impressive generation time of only 10 seconds per sample.”

Tests of Black Forest Labs FLUX1.1 [pro] and Ultra were run on Replicate. Their website states, “Replicate runs machine learning models in the cloud. We have a library of open-source models that you can run with a few lines of code. If you’re building your own machine learning models, Replicate makes it easy to deploy them at scale.”

✅ In my tests, Black Forest Labs FLUX1.1 [pro] could accurately reproduce the text requested in the prompt more than 50% of the time. It had the best results of all models tested.

Google Imagen 3

Google Imagen 3 was released to all US users in August 2024. According to Google, “Imagen 3 is our highest-quality text-to-image model, capable of generating images with even better detail, richer lighting, and fewer distracting artifacts than our previous models.” Tests of Google Imagen 3 were run on ImageFX, part of Google’s AI Test Kitchen, “a place where people can experience and give feedback on some of Google’s latest AI technologies.”

🚫 In my tests, Google Imagen 3 could not accurately reproduce the text requested in the prompt.

KLING AI powered by Kolors

Kolors powers Kling AI’s image generation capabilities. According to Hugging Face, “Kolors is a large-scale text-to-image generation model based on latent diffusion, developed by the Kuaishou Kolors team. Trained on billions of text-image pairs, Kolors exhibits significant advantages over both open-source and proprietary models in visual quality, complex semantic accuracy, and text rendering for both Chinese and English characters.” According to Kuaishou, Kling AI was released in July 2024.

🚫 In my tests, KLING AI powered by Kolors could not accurately reproduce the text requested in the prompt. The results were the worst of the models tested. Many responses were in Chinese, even when explicitly asked to be displayed in English.

Midjourney v6.1

Midjourney v6.1 was released in July 2024. According to Midjourney, the latest release, v6.1, contained several significant improvements, including more coherent images (arms, legs, hands, bodies, plants, animals, etc.), much better image quality, more precise, detailed, and correct small image features, and improved text accuracy (when drawing words via “quotations” in prompts). Using the — — style raw flag also helps improve text accuracy in some test cases, according to Midjourney.

🚫 ✅ In my tests, Midjourney v6.1 results were mixed. Midjourney could not consistently reproduce the text requested in the prompt more than 50% of the time. The output was correct in some test cases and close to the prompt in others but also repeated words and punctuation just as often.

OpenAI DALL·E 3

OpenAI DALL·E 3 was released over one year ago, in October 2023. According to OpenAI, “DALL·E 3 represents a leap forward in our ability to generate images that exactly adhere to the text you provide. DALL·E 3 understands significantly more nuance and detail than our previous systems [DALL·E 2], allowing you to easily translate your ideas into exceptionally accurate images.”

Tests of OpenAI Imagen 3 were run on ChatGPT. Also, according to OpenAI, “DALL·E 3 is built natively on ChatGPT, which lets you use ChatGPT as a brainstorming partner and refiner of your prompts. Just ask ChatGPT what you want to see in anything from a simple sentence to a detailed paragraph.”

🚫 In my tests, OpenAI DALL·E 3 could not accurately reproduce the text requested in the prompt.

Stability AI Stable Diffusion 3.5 Large

According to Stability AI, the Stable Diffusion 3.5 Large model, released in October 2024, “at 8.1 billion parameters, with superior quality and prompt adherence, this base model is the most powerful in the Stable Diffusion family. This model is ideal for professional use cases at 1 megapixel resolution.” The Stability AI Stable Diffusion 3.5 Large was tested using the StabilityAI REST API and code written in Python within a Jupyter Notebook.

✅ In my tests, Stability AI Stable Diffusion 3.5 Large could accurately reproduce the text requested in the prompt more than 50% of the time, occasionally with slight punctuation errors.

Stability AI Stable Image Ultra

According to Stability AI, the 16 billion-parameter Stable Image Ultra model, released in October 2024, “is our flagship model, blending the power of the SD3 Large with advanced workflows to deliver the highest-quality photorealistic images. This premium model is designed for industries that require unparalleled visual fidelity, such as marketing, advertising, and architecture.” Like Amazon Titan Image Generator, the Stability AI Stable Image Ultra model was also tested using Amazon Bedrock using the Image Playground UI.

✅ In my tests, Stability AI Stable Image Ultra could accurately reproduce the text requested in the prompt more than 50% of the time. Along with Black Forest Labs FLUX1.1 [pro], it was one of the best models tested.

AI Alternatives to Generating Text

The Black Forest Labs FLUX1.1 [pro] and Stability AI Stable Image Ultra models accurately reproduce requested phrases in prompts more frequently than other models. However, users still lack control over many aspects of the images, including the exact position, size, kerning, color, and font style of the text. Several alternative and more reliable techniques exist to guarantee the accuracy of text in generated images.

Replace Generated Text

One alternative approach is to generate the image with the desired text, regardless of spelling mistakes. Subsequently, one can remove the text in Adobe Photoshop and replace it with correct text in the exact position, size, color, and style desired. However, removing and recreating text can be challenging if foreground subjects or shadows partially obscure it, or if the text appears on an irregular surface. To enhance the realism of the new text, one can rasterize the vector type and then add noise, blurring, distortion, lighting, texturing, and layer blending effects.

Below are two examples of images generated with Black Forest Labs FLUX1.1 [pro] Ultra (first image). The text has been removed in Adobe Photoshop (second image), new vector-based text has been added (third image), and finally, the text has been rasterized and distorted to appear more realistic (fourth image).

Start with a Blank Canvas

A second alternative is to generate the image without text and then add your text in the desired color, size, and font style using Adobe Photoshop. This technique is more straightforward than retouching the generated image to remove existing text. The examples were created using the Replicate API with Python from a Jupyter Notebook to call Black Forest Labs’ FLUX1.1 [pro] and Ultra.

Below is an image generated with Black Forest Labs FLUX1.1 [pro] Ultra using the prompt: “A photograph of a smiling female scientist in a lab coat, standing in a lab, holding a white rectangular sign with no wording or other elements.” The generated image (first image) has new text added (second image), and finally, the text is distorted to appear more realistic (third image).

Below is another example that begins with a generated image containing no text, to which text was later added. The initial image was generated with Black Forest Labs FLUX1.1 [pro] Ultra using the prompt: “Vegetable stand with various vegetables, including tomatoes. A small, rectangular, blank, black sign with no text or other elements sits beside the tomatoes.”

One last example using the prompt, “A sleek billboard towers above a bustling interstate at rush hour, cars whizzing by. Against a colorful, dynamic, abstract background fills the billboard.” to generate the original image.

Generate Image and Text Separately

A third and final technique is to generate the image and text separately using your model of choice, then combine the two elements in post-production using Adobe Photoshop. Below is the original image from Midjourney on the left without text, generated using the prompt: “Vegetable stand with various vegetables, including tomatoes. A empty, blank blackboard-like sign. — ar 1:1”

The white type on a black background in the center was also generated in Midjourney, using the prompt: “The phrase “Farm Fresh Tomatoes $2.99/lb.” written in white chalk letters on a solid jet black background. — no tomatoes or other objects — ar 3:2 — style raw — stylize 0”

The text-only image is then easily overlaid on top of the first image using the Lighten blending mode for the text-only layer. Additional distortions can be applied to make the text look more natural in final image.

Conclusion

In this post, we explored the capabilities of nine different state-of-the-art text-to-image generation models from various providers to generate accurate text within images from prompts. We discovered that Black Forest Labs FLUX1.1 [pro] and Stability AI’s Stable Image Ultra were more successful at accurately reproducing requested text in images compared to other models. Finally, we examined three alternative and more reliable techniques for ensuring the accuracy of text in generated images.