Overcoming LLM Challenges in Healthcare: Practical Strategies for Development in Production

Rifx.Online
Health , Generative AI , Machine Learning
08 Nov, 2024

Generative AI

An article on the most common LLM development challenges I’ve encountered, effective mitigation strategies, and a career-defining interview mistake

Introduction

I’ve always been the type to dive deep into a subject and specialize to obsession. When I graduated from my master’s in data science, the obsession I had was with computer vision; specifically, computer vision to apply towards neuroscience or mental health applications. I was set on becoming a “computer vision engineer” (but “machine learning engineer” would be okay too) in the mental health field, despite my mentors urging me to broaden my scope and get my foot in the door. I silenced my own wary voices, convinced that the right team would recognize my “expertise”.

Luckily, my theory seemed to work; I landed interviews with several mental health companies. But then came one of my biggest interview mistakes. In the final round for my top choice — a company I loved — I made an error that still makes me internally cringe when I reflect. The role was NLP-focused, working with text data, but I couldn’t help expressing my interest in imaging data. Cries in recollection. I vividly recall the interviewer’s expression transforming from one of excitement to one of concern the moment I asked about imaging data availability, as I was still drawn to computer vision. Later that day, I received a polite rejection: they loved my passion but needed someone fully committed to NLP.

Ironically, I soon joined another mental health company and shifted fully to NLP work, creating anxiety and depression symptom detectors that improved clinical care and developing recommendation systems that boosted content discoverability by 12%. Fast-forward a few years, and I’m now the NLP/LLM data scientist on my team, with 6 information extraction tasks, 5 classification tasks, and 5 conditional summarization tasks deployed across 15+ hospitals and five clients.

A couple of weeks ago, I was asked to present “LLM development 101” to my larger data team. Initially, imposter syndrome crept in — what could I share for 45 minutes on LLM development? But as I created my slides, I realized how much I had to say and grew excited about sharing the depth of knowledge I’ve learned. This excitement led to the article you’re reading right now. In this article, I’ll walk through some common challenges I’ve encountered with LLMs in production and the strategies that have helped me solve them.

1. Output Format Errors

This is surprisingly probably the most frequent issue I encounter. Output format reliability can vary significantly depending on the model I’m working with. For example, GPT-4 Turbo generally provides consistent JSON outputs, but GPT-4o tends to be less reliable in this regard. With GPT-4o, I’ve encountered everything from lists and strings to incomplete dictionaries when a structured JSON output was explicitly requested. If these format issues aren’t caught and the model isn’t re-run, I risk having incomplete data coverage.

Impact of Format Errors

Inconsistent output formats can have a significant impact on downstream processes. If the data structure is incorrect, it could lead to failures in subsequent processing steps, skew reporting accuracy, or even result in incomplete insights if left undetected. In high-stakes fields like healthcare, where my work applies, incomplete or mis-structured data can have real implications, making format consistency essential.

Mitigation

To handle this, I’ve implemented format-checking logic that validates the output structure. If it’s incorrect, I re-run the model until it matches the expected format. Additionally, I use logging to capture format-related errors. Re-running the model, however, comes with trade-offs, such as increased latency and higher API costs. I establish thresholds for re-running based on the criticality of the data coverage and cost limitations. If re-running isn’t feasible, I sometimes apply post-processing to “repair” the output structure, though this approach carries its own risks of introducing errors or inconsistencies.

To illustrate this approach, here’s a sample code snippet that requests patient data in JSON format with specific keys like "name", "age", and "insurance". This code demonstrates a method to verify that the model’s response includes all required fields and adheres to the expected structure. By implementing retry logic, the code aims to ensure data consistency, reducing the risks associated with format errors in critical workflows.

def get_llm_response(prompt: str, required_keys: Set[str], retries: int = 3) -> Optional[Dict[str, Any]]:
    """
    Calls the language model to get a response in JSON format. If the response 
    is not in the expected JSON format or lacks required keys, retries the call 
    up to `retries` times.
    Parameters:
        prompt (str): The prompt sent to the language model.
        required_keys (Set[str]): A set of required keys that must be present in the JSON response.
        retries (int): The maximum number of retries if the output format is invalid.
    Returns:
        Optional[Dict[str, Any]]: Parsed JSON response if successful; None if retries are exhausted.
    """
    
    for attempt in range(retries):
        try:
            response = openai.Completion.create(
                model="gpt-4o",
                prompt=prompt,
                max_tokens=100,
                temperature=0.7
            )
            
            # Attempt to parse the response as JSON
            response_text = response.choices[0].text.strip()
            parsed_response = json.loads(response_text)
            
            # Check if parsed_response is in the expected structure and contains required keys
            if isinstance(parsed_response, dict) and required_keys.issubset(parsed_response.keys()):
                return parsed_response
            else:
                print(f"Attempt {attempt + 1}: Output format invalid or missing required keys, retrying...")
        except (json.JSONDecodeError, KeyError) as e:
            print(f"Attempt {attempt + 1}: Error parsing JSON - {str(e)}, retrying...")
    print("Max retries exceeded: Unable to get valid JSON output with required keys.")
    return None

2. Hallucinations

Hallucinations happen when the model invents information that sounds plausible but isn’t actually there. For instance, when I’m trying to pull quotes from source text, sometimes the model decides to “get creative” and produces similar-sounding but completely fabricated phrases. In fields where accuracy is crucial, like healthcare, small hallucinations can lead to large issues.

Mitigation

I address hallucinations by implementing post-processing logic to validate that, for any information extraction tasks, the context pulled matches the source text exactly. To ensure that minor variations don’t lead to missed matches, I standardize the text by stripping punctuation and converting everything to lowercase when comparing the source and retrieved text. Additionally, several other strategies help minimize hallucinations. For instance, chain-of-thought prompting, where the model explains each step of its reasoning, can produce more grounded outputs and reduce the likelihood of inaccurate output. In high-stakes applications (such as healthcare use cases), human-in-the-loop checks are important as an extra layer of review, helping catch hallucinations that automated processes might miss. Lastly, prompts that emphasize factual accuracy, such as instructing the model to “only use exact phrases from the source,” can guide the model toward more precise responses.

3. Outdated Information

Outdated information can be challenging to manage, especially in applications where accuracy and timeliness are essential. Sometimes, a model might retrieve information from older sections of a document and surface it as if it’s current. With Retrieval-Augmented Generation (RAG), this issue can become even more complex, as RAG retrieves content based solely on relevance rather than timeliness or specific document sections. The absence of section labels or timestamps means RAG may pull from parts of a document that seem relevant without discerning if they’re outdated, which risks mixing older and current information. Another challenge with using a vector database is that if we store entire documents, we can’t easily remove specific sections without clearly defined labels, making it hard to filter out irrelevant information effectively.

Mitigation

To address this, I specify “current” or “most recent” data directly in the prompt and use preprocessing steps to remove any outdated sections before passing data to the model. This extra preprocessing step ensures that only the latest, most relevant information is retained, helping the model focus on providing timely and accurate responses. This step not only ensures more accurate outputs, but it also reduces the cost of the call. By implementing these filters in advance, I can maintain consistency and relevance in the model’s outputs.

4. Over-Reliance and Ethics

As much as I would love for the work I do to be used and useful, my biggest fear is that users would trust the model predictions a bit too much — especially in the healthcare space, where generative AI is often producing summaries or extracting specific patient details, not just making predictions. Experts may hold differing views on certain definitions, so diversity and dialogue is important to reach a consensus. Over-reliance on these predictions, could lead care teams to limit these conversations and overlook errors they might otherwise examine more closely.

Mitigation

I prioritize educating the team on the model’s limitations, including its tendency for errors, and encourage them to see AI as a complement to human expertise. In healthcare, where nuance is critical, human-in-the-loop oversight is essential for high-impact cases, allowing experts to review AI outputs and reduce risks from over-reliance. This collaborative approach allows AI to amplify expert insights, maintaining the reliability and ethical integrity that high-stakes applications demand.

5. Rapid Model Deprecation

With the rapid pace of development in AI, model and API versions are updated frequently, and it’s common for versions to be deprecated faster than expected. If you’ve ever had a workflow break unexpectedly because a model version was retired, you’ll know how disruptive this can be. This has happened several times in the past year, requiring us to quickly re-do analyses to ensure the newer model versions still perform as expected.

Mitigation

Make it a priority to do regular check-ins to monitor model versions and stay ahead of deprecation warnings. This proactive approach would enable us to plan transitions in advance, saving the last-minute scramble. While it’s a small step, it makes a significant difference in maintaining smooth operations.

6. Rate Limiting with APIs

API rate limits are a subtle but significant challenge, especially when working with high volumes of requests. Hitting a rate cap can create delays, slow down real-time workflows, or even halt entire processes. In cases where we’re processing time-sensitive data, reaching the limit can be highly disruptive, as workflows come to an unexpected stop. This is especially problematic in healthcare settings, where timing can directly impact operations and patient care.

Mitigation

To mitigate this, we’ve implemented a proactive approach by tracking API usage patterns to identify peak times and reduce non-essential calls. By staggering requests and batching calls, I can distribute the load more evenly and avoid exceeding limits. In situations where demand is high and rate limits are consistently reached, requesting additional quota from the provider can offer a practical solution. Balancing usage has been essential, and understanding our peak times and usage patterns ahead of time has proven crucial for maintaining a stable, uninterrupted workflow.

Concluding Remarks

These are just six of the common issues I’ve faced while working with LLMs. I didn’t expect to find myself here, but taking a step back to reflect, I realize how much expertise I’ve developed in this space — and I’m incredibly excited to continue to share these learnings in upcoming articles. I’d love to hear from others about the challenges they’ve encountered and the mitigation strategies or workarounds they’ve found effective, whether related to these issues or new ones entirely. I hope these insights are helpful and spark further conversation around best practices in this quickly evolving field (where model versions and API versions deprecate a little too quickly).