Synthetic Data Generation with Language Models: A Practical Guide

Rifx.Online
Programming , Data Science , Generative AI
15 Dec, 2024

In the evolving landscape of artificial intelligence, data remains the fuel that powers innovation. But what happens when acquiring real-world data becomes challenging, expensive, or even impossible?

Enter synthetic data generation — a groundbreaking technique that leverages language models to create high-quality, realistic datasets. Consider training a language model on medical records without breaching privacy laws, or developing a customer interaction model without access to private conversation logs, or designing autonomous driving systems where collecting data on rare edge cases is nearly impossible. Synthetic data bridges gaps in data availability while maintaining the realism needed for effective AI training.

Beyond addressing data shortages, synthetic data enhances AI development by balancing imbalanced datasets (e.g., in fraud detection or rare medical conditions), simulating rare events, and augmenting limited data with realistic variations. Companies can accelerate development, improve model robustness, and experiment with datasets otherwise unavailable.

While the benefits of synthetic data — such as scalability, privacy preservation, and the ability to simulate hard-to-capture scenarios — are clear, it also has limitations, including limited real-world credibility, overfitting, and bias, which require careful consideration.

In this article, we’ll explore synthetic data generation, discuss its limitations and ways to overcome them, and show you how to implement your own synthetic data generator in Python.

How to Overcome the Limitations of Synthetic Data

1. Lack of Real-World Authenticity

Synthetic data may not fully capture the nuances and variability of real-world data, leading to models that perform well in controlled environments but fail in real-world applications.

How to Overcome:

Hybrid Approach: Use synthetic data to augment real data, not replace it. A combination ensures that the model can generalize to unseen, real-world scenarios.
Validation on Real Data: Always validate models on real-world datasets, even if training is done with synthetic data, to assess performance in practical applications and to ensure robustness.

2. Overfitting and Bias

Models trained on synthetic data might overfit to the patterns in that data, which may not exist in real-world data. This can lead to poor generalization when deployed. Also, Synthetic data can inherit or amplify biases present in the models used to generate it. This can result in biased predictions.

How to Overcome:

Data Regularization: Apply data augmentation techniques and introduce noise in synthetic data to mimic the randomness and variability of real-world data.
Diverse Data Generation: Ensure diversity in the synthetic data by using multiple models and methods to generate data from different perspectives.

In addition, keep in mind that ensuring the quality and representativeness of synthetic data can be difficult and often a little experimentation with few-shot learning (FSL) and chain-of-thought (CoT) prompting in prompt engineering can go a long way. We shall illustrate these in more detail below.

Synthetic Data Generator Implementation

You can run this tutorial on the Intel® Tiber™ AI Cloud environment, which is equipped with an Intel® Xeon® CPU. This platform provides ample computing resources ensuring smooth execution of our code.

Environment Setup

Let’s begin with importing the necessary libraries. In our demo we shall use Llama 3.1 and you will need a Hugging Face token to access this model’s gated repository. You may create and access your tokens directly from your Hugging Face account. To do so, select “Access Tokens” from your settings menu and create a token with the “write” permission.

Now, you can insert your token in your Python script. (Do not share your Access Tokens with anyone; Hugging Face removes any leaked Access Tokens.)

import random
import pandas as pd
from transformers import pipeline
from huggingface_hub import login

login("your_token")

Next, go to meta-llama/Meta-Llama-3.1–8B-Instruct and read the license before providing your information and submitting the Llama 3.1 access request.

Implementation

Let’s say we want to generate synthetic customer service texts classified by the following labels

labels = ["polite", "somewhat polite", "neutral", "impolite"]

in these contexts

categories_types = {
    "travel": ["air", "train"],
    "stores": ["appliances", "toys and games"]
}

We shall randomly select labels and categories and instruct the language model to generate synthetic data based on the specified categories and labels.

Randomness will ensure data regularization; see the second challenge (Overfitting and Bias) above. Once we have selected a context category, we randomly choose a corresponding type from our dictionary as follows.

random.choices(categories_types[category])

Here’s how we go about the full implementation: we generate data in batches and our function randomly assigns labels and categories to the batch’s samples. For each sample in the batch, the sdg function:

Creates a prompt that instructs the language model to generate a synthetic customer service response based on the assigned label and category.
Uses the language model to generate a response to the prompt.
Extracts the relevant text from the generated response. You can leave the text_extraction function as an identity function for now, since its exact definition depends on factors like the prompt. It can be easily handled with regular expressions, for example.

Finally, each batch of the generated responses, along with their labels and the model used is appended to a CSV file.

def sdg(
    sample_size,
    labels,
    categories_types,
    batch_size,
    output_path="./output.csv",
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
):
    """
    Generates synthetic data based on specified categories and labels.

    Args:
        sample_size (int): The number of synthetic data samples to generate.
        labels (list of str): The labels used to classify the synthetic data.
        categories_types (dict): The categories and their types for data generation and diversification.
        batch_size (int): The number of samples per batch to append to the output file.
        output_dir (str): The directory path where the output file will be saved.
        model (str): The large language model used for generating the synthetic data.
    """
    
    categories = list(categories_types.keys())

    # If sample_size is not divisible by batch_size, an extra batch is added
    num_batches = (sample_size + batch_size - 1) // batch_size

    print(f"Synthetic data will be appended to {output_path} in {num_batches} batches.")

    for batch in range(num_batches):
        # Calculate the start and end indices for the current batch
        start = batch * batch_size
        end = min(start + batch_size, sample_size)

        # Store results of the current batch
        batch_data = []

        # Assign random labels to the current batch
        batch_random_labels = random.choices(labels, k=batch_size)

        # Assign random categories to the current batch
        batch_random_categories = random.choices(categories, k=batch_size)

        for i in range(start, end):
            random_type = random.choices(
                categories_types[batch_random_categories[i - start]]
            )
            prompt = f"""I am creating synthetic OUTPUT to fine-tune
            my BERT model. The use case is customer service chatbots.
            You should generate only one OUTPUT for the classification
            LABEL: {batch_random_labels[i - start]} in CATEGORY:
            {batch_random_categories[i - start]} and TYPE
            {random_type}. 

            Examples. 
            OUTPUT: The fee you’re seeing is likely related
            to our standard account maintenance charges. I can provide
            more details if needed.

            OUTPUT: You can return it, but only if you have the
            receipt and it’s within the return window.

            OUTPUT: It's not our fault your baggage didn't make it.
            What do you expect us to do about it now?

            OUTPUT: I apologize for the trouble you’ve had with the
            heater. We can certainly look into a return or exchange.
            Please bring in your receipt, and we’ll take care of it
            for you.

            Only return one OUTPUT and not the LABEL or the CATEGORY.
            """
            messages = [
                {
                    "role": "system",
                    "content": f"You are a helpful assistant designed to generate synthetic customer service data with labels {labels} in categories {categories}.",
                },
                {"role": "user", "content": prompt},
            ]
            generator = pipeline("text-generation", model=model)
            result = generator(messages, max_new_tokens=128)[0]["generated_text"][-1][
                "content"
            ]

            result = text_extraction(result)
            batch_data.append(
                {
                    "text": result,
                    "label": batch_random_labels[i - start],
                    "model": model,
                }
            )

        # Convert the batch results to a DataFrame
        batch_df = pd.DataFrame(batch_data)

        # Append the DataFrame to the CSV file
        if batch == 0:
            # If it's the first batch, write headers
            batch_df.to_csv(output_path, mode="w", index=False)
        else:
            # For subsequent batches, append without headers
            batch_df.to_csv(output_path, mode="a", header=False, index=False)
        print(f"Saved batch number {batch + 1}/{num_batches}")

Here’s a sample output.

| text                                                                                                                                                                                       | label           | model                                 |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+---------------------------------------|
| You're still whining about your membership renewal fee? It's not like we're the ones who raised the prices, it's the board's decision. You should just deal with it and stop complaining.  | impolite        | meta-llama/Meta-Llama-3.1-8B-Instruct |
| I'm not sure why our membership fees are higher this quarter, but I can check on the pricing for our tennis courts and see if there's a way to adjust your plan to fit your budget better. | somewhat polite | meta-llama/Meta-Llama-3.1-8B-Instruct |

Further Improvements

To improve the quality of the outputs of our data generator, we could modify the prompt and diversify the model. We discuss each of these briefly.

Prompt

It’s good practice to pass explicit label descriptions to the model through the prompt. For instance, we could add the lines

polite: Text is considerate and shows respect and good manners, often including courteous phrases and a friendly tone.
somewhat polite: Text is generally respectful but lacks warmth or formality, communicating with a decent level of courtesy.
neutral: Text is straightforward and factual, without emotional undertones or specific attempts at politeness.
impolite: Text is disrespectful or rude, often blunt or dismissive, showing a lack of consideration for the recipient's feelings.

to our prompt. Additionally, we could require the language model to provide its reasoning to support the text generation for the specified label. Here is such an improved prompt.

prompt = f"""You should create synthetic data for specified labels and categories. 
            This is especially useful for developing customer service chatbots.
              
            Label descriptions:
            - polite: Text is considerate and shows respect and good manners, often including courteous phrases and a friendly tone.
            - somewhat polite: Text is generally respectful but lacks warmth or formality, communicating with a decent level of courtesy.
            - neutral: Text is straightforward and factual, without emotional undertones or specific attempts at politeness.
            - impolite: Text is disrespectful or rude, often blunt or dismissive, showing a lack of consideration for the recipient's feelings.

            Examples.

            LABEL: somewhat polite
            CATEGORY: travel
            TYPE: train
            OUTPUT: I understand your concern about your booking, and I'll check what options we have for you.
            REASONING: This text would be classified as "somewhat polite."
            The acknowledgment of the customer's concern shows a basic level of respect.
            The sentence is direct and lacks additional warmth or formality, but it communicates a willingness to help.
            The use of "I'll check" is a straightforward commitment to action without additional courteous phrases that would make it fully polite.
            
            LABEL: neutral
            CATEGORY: stores
            TYPE: appliances
            OUTPUT: Your TV will be delivered within three to five business days.
            REASONING: This text would be classified as "neutral."
            The sentence is purely informational, providing the facts about delivery time without any emotional undertones.
            There are no phrases that express politeness or rudeness; it's a straightforward statement.
            The tone is impersonal and focused solely on conveying the necessary information.
            ####################
            You should generate one OUTPUT for the classification below.
            Only return the OUTPUT and REASONING. 
            Do not return the LABEL, CATEGORY, or TYPE.

            LABEL: {batch_random_labels[i - start]}
            CATEGORY: {batch_random_categories[i - start]}
            TYPE: {random_type}
            OUTPUT:
            REASONING:
            """

Diversity

To further diversify the output data, one can pass multiple different language models to the synthetic data generator. When we used identical generators and prompts on Llama-3.1–8B-Instruct, gemma-2–9b-it, and Mixtral-8x7B-Instruct-v0.1, we observed the following percentages of duplicated data.

Llama: 0.04%
Gemma: 94.6%(Note: This model wasn’t trained with any system instructions, so you need to modify messages accordingly.)
Mixtral: 7%

Gotcha Alert In some edge cases the language model might generate the same text for different labels! For instance, when we ran the generator with Llama 3.1, the following output was generated for both neutral and somewhat polite labels.

I'm afraid the toy you're looking for is currently out of stock, but we do have a similar product that might interest you. Would you like me to check availability?

Conclusion

Synthetic data generation with language models is a powerful tool that has the potential to reshape the future of AI. Whether you’re a researcher, developer, or business leader, understanding this technology could provide a competitive edge in the evolving AI landscape.

If you’re interested in exploring how synthetic data can revolutionize your AI projects, consider diving deeper into language models, writing your custom data generators, and experimenting with existing data generation tools to unlock new possibilities.

For more AI development how-to content, visit Intel® AI Development Resources.