Exploring LLMs for ICD Coding — Half 1 | by Anand Subramanian

Machine Learning

Exploring LLMs for ICD Coding — Half 1 | by Anand Subramanian | Might, 2024

hhhhm

2024年5月17日

Exploring LLMs for ICD Coding — Half 1 | by Anand Subramanian | Might, 2024

[ad_1]

On this state of affairs, the LLM identifies each ICD Code 1 and ICD Code 2 as related to the medical notice. The algorithm then examines the kid nodes of every code. Every mother or father code has two baby nodes representing extra particular ICD codes. Beginning with ICD Code 1, the LLM makes use of the descriptions of ICD Code 1.1 and ICD Code 1.2 together with the medical notice to find out the related codes. The LLM concludes that ICD Code 1.1 is related, whereas ICD Code 1.2 just isn’t. Since ICD Code 1.1 has no additional baby nodes, the algorithm checks whether it is an assignable code and assigns it to the doc. Subsequent, the algorithm evaluates the kid nodes of ICD Code 2. Invoking the LLM once more, it determines that solely ICD Code 2.1 is related. It is a simplified instance; in actuality, the ICD tree is intensive and deeper, that means the algorithm will proceed to traverse the youngsters of every related node till it reaches the top of the tree or exhausts legitimate traversals.

Highlights

This technique doesn’t require any fine-tuning of the LLM; it leverages the LLM’s potential to contextually perceive the medical notice and dynamically establish the related ICD codes based mostly on the supplied descriptions.
Moreover, this paper reveals that LLMs can successfully adapt to a big output house when given related data within the immediate, outperforming PLM-ICD [6] on uncommon codes when it comes to macro-average metrics.
This system additionally outperforms the baseline of straight asking the LLM to foretell the ICD codes for a medical notice based mostly on its parametric data. This highlights the potential in integrating LLMs with instruments or exterior data for fixing scientific coding duties.

Drawbacks

The algorithm invokes the LLM at each stage within the tree. That results in a excessive variety of LLM invocations as you traverse the tree, compounded by the vastness of the ICD tree. This results in excessive latency and prices in processing a single doc.
Because the authors additionally notice within the paper, to be able to appropriately predict a related code, the LLM should appropriately establish its mother or father nodes in any respect ranges. Even when a mistake is made at one stage, the LLM might be unable to achieve the ultimate related code.
The authors had been unable to guage their technique utilizing datasets like MIMIC-III because of limitations that prohibit the switch of information to exterior companies reminiscent of OpenAI’s GPT endpoints. As an alternative, they evaluated the strategy utilizing the take a look at set of the CodiEsp dataset [7,8], which incorporates 250 medical notes. The small measurement of this dataset means that the strategy’s effectiveness on bigger scientific datasets is but to be established.

All code and sources associated to this text are made obtainable at this hyperlink with a mirror of the repo obtainable in my unique blog-related repository. I want to stress that my reimplementation just isn’t precisely similar to the paper and differs in delicate ways in which I’ve documented within the unique repository. I’ve tried to duplicate the prompts used for invoking GPT-3.5 and Llama-70B based mostly on the small print within the unique paper. For translating the datasets from Spanish to English, I created my very own immediate for doing that, as the small print weren’t accessible within the paper.

Let’s implement the method to higher perceive the way it works. As talked about, the paper makes use of the CodiEsp take a look at set for its analysis. This dataset consists of Spanish medical notes together with their ICD codes. Though the dataset contains an English translated model, the authors notice that they translated the Spanish medical notes into English utilizing GPT-3.5, which they declare supplied a modest efficiency enchancment over utilizing the pre-translated model. We replicate this performance and translate the notes into English.

def construct_translation_prompt(medical_note):
"""
Assemble a immediate template for translating spanish medical notes to english.Args:
medical_note (str): The medical case notice.
Returns:
str: A structured template prepared for use as enter for a language mannequin.
"""    
translation_prompt = """You might be an skilled Spanish-to-English translator. You might be supplied with a scientific notice written in Spanish.
You will need to translate the notice into English. You will need to make sure that you correctly translate the medical and technical phrases from Spanish to English with none errors.
Spanish Medical Be aware:
{medical_note}"""
return translation_prompt.format(medical_note = medical_note)

Now that we have now the analysis corpus prepared, let’s implement the core logic for the tree-search algorithm. We outline the performance in get_icd_codes, which accepts the medical notice to course of, the mannequin title, and the temperature setting. The mannequin title should be both “gpt-3.5-turbo-0613” for GPT-3.5 or “meta-llama/Llama-2–70b-chat-hf” for Llama-2 70B Chat. This specification determines the LLM that the tree-search algorithm will invoke throughout its processing.

Evaluating GPT-4 is feasible utilizing the identical code-base by offering the suitable mannequin title, however we select to skip it as it’s fairly time-consuming.

def get_icd_codes(medical_note, model_name="gpt-3.5-turbo-0613", temperature=0.0):
"""
Identifies related ICD-10 codes for a given medical notice by querying a language mannequin.This operate implements the tree-search algorithm for ICD coding described in https://openreview.web/discussion board?id=mqnR8rGWkn.
Args:
medical_note (str): The medical notice for which ICD-10 codes are to be recognized.
model_name (str): The identifier for the language mannequin used within the API (default is 'gpt-3.5-turbo-0613').
Returns:
checklist of str: An inventory of confirmed ICD-10 codes which can be related to the medical notice.
"""
assigned_codes = []
candidate_codes = [x.name for x in CHAPTER_LIST]
parent_codes = []
prompt_count = 0
whereas prompt_count < 50:
code_descriptions = {}
for x in candidate_codes:
description, code = get_name_and_description(x, model_name)
code_descriptions[description] = code
immediate = build_zero_shot_prompt(medical_note, checklist(code_descriptions.keys()), model_name=model_name)
lm_response = get_response(immediate, model_name, temperature=temperature, max_tokens=500)
predicted_codes = parse_outputs(lm_response, code_descriptions, model_name=model_name)
for code in predicted_codes:
if cm.is_leaf(code["code"]):
assigned_codes.append(code["code"])
else:
parent_codes.append(code)
if len(parent_codes) > 0:
parent_code = parent_codes.pop(0)
candidate_codes = cm.get_children(parent_code["code"])
else:
break
prompt_count += 1
return assigned_codes

Just like the paper, we use the simple_icd_10_cm library, which supplies entry to the ICD-10 tree. This enables us to traverse the tree, entry the descriptions for every code, and establish legitimate codes. First, we get the nodes on the first stage of the tree.

import simple_icd_10_cm as cmdef get_name_and_description(code, model_name):
"""
Retrieve the title and outline of an ICD-10 code.
Args:
code (str): The ICD-10 code.
Returns:
tuple: A tuple containing the formatted description and the title of the code.
"""
full_data = cm.get_full_data(code).break up("n")
return format_code_descriptions(full_data[3], model_name), full_data[1]

Contained in the loop, we receive the descriptions corresponding to every of the nodes. Now, we have to assemble the immediate for the LLM based mostly on the medical notice and the code descriptions. We create the prompts for GPT-3.5 and Llama-2 based mostly on the small print supplied within the paper.

prompt_template_dict = {"gpt-3.5-turbo-0613" : """[Case note]:
{notice}
[Example]:
<instance immediate>
Gastro-esophageal reflux illness
Enteropotosis<response>
Gastro-esophageal reflux illness: Sure, Affected person was prescribed omeprazole.
Enteropotosis: No.
[Task]:
Contemplate every of the next ICD-10 code descriptions and consider if there are any associated mentions within the case notice.
Observe the format within the instance exactly.
{code_descriptions}""",
"meta-llama/Llama-2-70b-chat-hf": """[Case note]:
{notice}
[Example]:
<code descriptions>
* Gastro-esophageal reflux illness
* Enteroptosis
* Acute Nasopharyngitis [Common Cold]
</code descriptions>
<response>
* Gastro-esophageal reflux illness: Sure, Affected person was prescribed omeprazole.
* Enteroptosis: No.
* Acute Nasopharyngitis [Common Cold]: No.
</response>
[Task]:
Observe the format within the instance response precisely, together with your entire description earlier than your (Sure|No) judgement, adopted by a newline. 
Contemplate every of the next ICD-10 code descriptions and consider if there are any associated mentions within the Case notice.
{code_descriptions}"""
}

We now assemble the immediate based mostly on the medical notice and code descriptions. A bonus for us, when it comes to prompting and coding, is that we will use the identical openai library to work together with each GPT-3.5 and Llama 2, supplied that Llama-2 is deployed utilizing deepinfra, which additionally helps the openai format for sending requests to the LLM.

def construct_prompt_template(case_note, code_descriptions, model_name):
"""
Assemble a immediate template for evaluating ICD-10 code descriptions towards a given case notice.Args:
case_note (str): The medical case notice.
code_descriptions (str): The ICD-10 code descriptions formatted as a single string.
Returns:
str: A structured template prepared for use as enter for a language mannequin.
"""
template = prompt_template_dict[model_name]
return template.format(notice=case_note, code_descriptions=code_descriptions)
def build_zero_shot_prompt(input_note, descriptions, model_name, system_prompt=""):
"""
Construct a zero-shot classification immediate with system and consumer roles for a language mannequin.
Args:
input_note (str): The enter notice or question.
descriptions (checklist of str): Listing of ICD-10 code descriptions.
system_prompt (str): Non-obligatory preliminary system immediate or instruction.
Returns:
checklist of dict: A structured checklist of dictionaries defining the position and content material of every message.
"""
if model_name == "meta-llama/Llama-2-70b-chat-hf":
code_descriptions = "n".be a part of(["* " + x for x in descriptions])
else:
code_descriptions = "n".be a part of(descriptions)
input_prompt = construct_prompt_template(input_note, code_descriptions, model_name)
return [{"role": "system", "content": system_prompt}, {"role": "user", "content": input_prompt}]

Having constructed the prompts, we now invoke the LLM to acquire the response:

def get_response(messages, model_name, temperature=0.0, max_tokens=500):
"""
Acquire responses from a specified mannequin through the chat-completions API.Args:
messages (checklist of dict): Listing of messages structured for API enter.
model_name (str): Identifier for the mannequin to question.
temperature (float): Controls randomness of response, the place 0 is deterministic.
max_tokens (int): Restrict on the variety of tokens within the response.
Returns:
str: The content material of the response message from the mannequin.
"""
response = shopper.chat.completions.create(
mannequin=model_name,
messages=messages,
temperature=temperature,
max_tokens=max_tokens
)
return response.decisions[0].message.content material

Nice, we’ve obtained the output! From the response, we now parse every code description to establish the nodes that the LLM has deemed related for additional traversal, in addition to these nodes the LLM has rejected. We break the output response into new traces and break up every response to establish the prediction of the LLM for every code description.

def remove_noisy_prefix(textual content):
# Eradicating numbers or letters adopted by a dot and elective house firstly of the string
cleaned_text = textual content.substitute("* ", "").strip()
cleaned_text = re.sub(r"^s*w+.s*", "", cleaned_text)
return cleaned_text.strip()def parse_outputs(output, code_description_map, model_name):
"""
Parse mannequin outputs to verify ICD-10 codes based mostly on a given description map.
Args:
output (str): The mannequin output containing confirmations.
code_description_map (dict): Mapping of descriptions to ICD-10 codes.
Returns:
checklist of dict: An inventory of confirmed codes and their descriptions.
"""
confirmed_codes = []
split_outputs = [x for x in output.split("n") if x]
for merchandise in split_outputs:
attempt:                
code_description, affirmation = merchandise.break up(":", 1)
if model_name == "meta-llama/Llama-2-70b-chat-hf":
code_description = remove_noisy_prefix(code_description)
if affirmation.decrease().strip().startswith("sure"):
attempt:
code = code_description_map[code_description]
confirmed_codes.append({"code": code, "description": code_description})
besides Exception as e:
print(str(e) + " Right here")
proceed
besides:
proceed
return confirmed_codes

Let’s take a look at the rest of the loop now. Up to now, we have now constructed the immediate, obtained the response from the LLM, and parsed the output to establish the codes deemed related by the LLM.

whereas prompt_count < 50:
code_descriptions = {}
for x in candidate_codes:
description, code = get_name_and_description(x, model_name)
code_descriptions[description] = codeimmediate = build_zero_shot_prompt(medical_note, checklist(code_descriptions.keys()), model_name=model_name)
lm_response = get_response(immediate, model_name, temperature=temperature, max_tokens=500)
predicted_codes = parse_outputs(lm_response, code_descriptions, model_name=model_name)
for code in predicted_codes:
if cm.is_leaf(code["code"]):
assigned_codes.append(code["code"])
else:
parent_codes.append(code)
if len(parent_codes) > 0:
parent_code = parent_codes.pop(0)
candidate_codes = cm.get_children(parent_code["code"])
else:
break
prompt_count += 1

Now we iterate by the expected codes and examine if every code is a “leaf” code, which primarily ensures that the code is a sound and assignable ICD code. If the expected code is legitimate, we take into account it as a prediction by the LLM for that medical notice. If not, we add it to our mother or father codes and procure the youngsters nodes to additional traverse the ICD tree. We escape of the loop if there are not any extra mother or father codes to additional traverse.

In principle, the variety of LLM invocations per medical notice may be arbitrarily excessive, resulting in elevated latency if the algorithm traverses many nodes. The authors implement a most of fifty prompts/LLM invocations per medical notice to terminate the processing, a restrict we additionally undertake in our implementation.

Outcomes

We are able to now consider the outcomes of the tree-search algorithm utilizing GPT-3.5 and Llama-2 because the LLMs. We assess the efficiency of the algorithm when it comes to micro-average and macro-average precision, recall, and F1-score.

Whereas the implementation’s outcomes are roughly within the ball-park of the reported scores within the paper, there are some note-worthy variations.

On this implementation, GPT-3.5’s micro-average metrics barely exceed the reported figures, whereas the macro-average metrics fall a bit in need of the reported values.
Equally, Llama-70B’s micro-average metrics both match or barely exceed the reported figures, however the macro-average metrics are decrease than the reported values.

As talked about earlier, this implementation differs from the paper in just a few minor methods, all of which impression the ultimate efficiency. Please check with the linked repository for a extra detailed dialogue of how this implementation differs from the unique paper.

Understanding and implementing this technique was fairly insightful for me in some ways. It allowed me to develop a extra nuanced understanding of the strengths and weaknesses of Giant Language Fashions (LLMs) within the scientific coding case. Particularly, it turned evident that when LLMs have dynamic entry to pertinent details about the codes, they’ll successfully comprehend the scientific context and precisely establish the related codes.

It will be fascinating to discover whether or not using LLMs as brokers for scientific coding might additional enhance efficiency. Given the abundance of exterior data sources for biomedical and scientific texts within the type of papers or data graphs, LLM brokers might doubtlessly be utilized in workflows that analyze medical paperwork at a finer granularity. They may additionally invoke instruments that permit them to check with exterior data on the fly if required, to reach on the ultimate code.

Acknowledgement

Big due to Joseph, the lead creator of this paper, for clarifying my doubts concerning the analysis of this technique!

[1] https://www.who.int/requirements/classifications/classification-of-diseases

[2] Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L. W. H., Feng, M., Ghassemi, M., … & Mark, R. G. (2016). MIMIC-III, a freely accessible essential care database Sci. Knowledge, 3(1), 1.

[3] Agrawal, M., Hegselmann, S., Lang, H., Kim, Y., & Sontag, D. (2022). Giant language fashions are few-shot scientific data extractors. arXiv preprint arXiv:2205.12689.

[4] Zhou, H., Li, M., Xiao, Y., Yang, H., & Zhang, R. (2023). LLM Instruction-Instance Adaptive Prompting (LEAP) Framework for Scientific Relation Extraction. medRxiv : the preprint server for well being sciences, 2023.12.15.23300059. https://doi.org/10.1101/2023.12.15.23300059

[5] Boyle, J. S., Kascenas, A., Lok, P., Liakata, M., & O’Neil, A. Q. (2023, October). Automated scientific coding utilizing off-the-shelf massive language fashions. In Deep Generative Fashions for Well being Workshop NeurIPS 2023.

[6] Huang, C. W., Tsai, S. C., & Chen, Y. N. (2022). PLM-ICD: computerized ICD coding with pretrained language fashions. arXiv preprint arXiv:2207.05289.

[7] Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., & Krallinger, M. (2020). Overview of Computerized Scientific Coding: Annotations, Pointers, and Options for non-English Scientific Instances at CodiEsp Observe of CLEF eHealth 2020. CLEF (Working Notes), 2020.

[8] Miranda-Escalada, A., Gonzalez-Agirre, A., & Krallinger, M. (2020). CodiEsp corpus: gold customary Spanish scientific circumstances coded in ICD10 (CIE10) — eHealth CLEF2020 (1.4) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3837305 (CC BY 4.0)

[ad_2]