
用 Genai 彻底改变文本注释:开启前所未有的 Nlp 性能!
本文探讨了提示工程与大语言模型如何提供一种数字化、快速且优于手动标注的方法
图片来源:作者(灵感来源:https://www.dqlabs.ai/blog/what-is-data-quality-management/)
让我们从一个问题开始吧?你如何向机器学习模型传达某个特定单词的含义,尤其是当它们在数量上智能但在词汇上有挑战时?
数据标注在开发和评估自然语言处理模型中发挥着至关重要的作用,监督学习是机器学习的一个分支,它依赖于所接收到的指导来对一组数据点进行训练。它涉及识别和标记文本,并附加信息以指定单词或句子在给定上下文中的角色。
注释广泛应用于:
- 医学研究(文献中的大量医学术语/行话)
- 财务和法律(情感及其他指标)
- 客户服务和聊天机器人(识别特定关键词并采取行动)
- 社交媒体
- 零售和电子商务
这无疑估算了文本注释在各个行业满足人工智能革命需求的使用范围。
传统上,这曾经是一个手动工作,但现在不再是。显然,生成式人工智能已经为我们提供了这种文本智能的巨大优势,现在我们需要明智地使用它。
在本文中,我展示了用于以下目的的提示:
- 零样本词性标注及其局限性
- 少样本提示用于词性标注
- 少样本提示用于命名实体识别注释
- 少样本提示用于命名实体识别低字母文本标注(显然比传统命名实体识别模型更具优势)
- 少样本提示用于自定义标注
注释中的挑战,以及为什么选择大语言模型?
手动或半自动的标注技术,根据语料库的庞大和任务的复杂性,通常用于应对这一极具挑战性但被低估的任务。尽管数据标注有广泛的应用,但由于数据的复杂性、主观性和多样性,它对当前的机器学习模型提出了重大挑战。采用这些技术中的任何一种都挑战了数据质量的准确性、一致性、完整性和有效性,导致基于这些数据构建的数据驱动解决方案效率低下。
大语言模型不仅解决了手动劳动密集型标记和注释大量文本的问题,而且在数据质量评估框架中满足了所有要求,除了独特性和及时性,这超出了本讨论的范围,但可以通过解决方案设计来实现。它符合以下要求:
- ✅ 准确性 - 反映在提示中描述的标签
- ✅ 完整性 - 确保数据库中的所有文本都被标记
- ✅ 一致性 - 确保整个数据库中的标记一致
- ✅ 有效性 - 确保注释文本的类别和语法始终遵循提示中列出的规则
大语言模型驱动的文本注释技术
生成式人工智能可以为文本注释提供相对一致且一致的解决方案,无论是使用零样本还是少样本学习。在实践中,由于奇特的业务需求,我们被鼓励使用少样本学习而不是零样本学习,仅仅是为了举例说明某些场景,其中生成预训练变换器的解释可能不一致(但不一定错误)。
所以让我们从标注过程开始,首先是零样本……老实说,这并不是最佳方法,因为如果你需要标记几个数据集,你不妨做一点手动标注,以向大语言模型展示你所需的格式。然而,让我们一步一步来。
我的 GPT 设置:
from openai import OpenAI
client = OpenAI(api_key=api_key)
def get_response(prompt, ret=False):
""" 从开放人工智能的聊天完成API获取响应。 """
completion = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
messages=[
{"role": "system", "content": "你是一个有帮助的助手。"},
{
"role": "user",
"content": prompt
}
]
)
print(completion.choices[0].message.content)
以下是我将用于标记的三个文本列表,来自 IMDB 数据集(CC0 许可证):
text = f"""
1. 当被称为小丑的威胁在哥谭市造成混乱和破坏时,蝙蝠侠必须接受他对抗不公正能力的最大心理和身体考验之一。
2. 甘道夫和阿拉贡带领人类世界对抗索伦的军队,以将他的目光从弗罗多和山姆身上转移,因为他们正带着魔戒接近末日山。
3. 在第二次世界大战期间被德国占领的波兰,工业家奥斯卡·辛德勒在目睹纳粹对他的犹太员工的迫害后,逐渐开始关心他们的处境。
"""
零样本词性标注
这里是一个零样本标注方法,我只要求大语言模型对文本进行标注,而没有明确定义输出。我尝试这样做是因为意识到这种方法的不足,只是想看看我能得到什么输出。
prompt = f"""
## Annotate the following text for POS tagging to train a POS tagger.
Text: {text}
"""
get_response(prompt)
这产生了以下响应:
这是带有词性标注的注释文本:
-
When (WRB) the (DT) menace (NN) known (VBN) as (IN) the (DT) Joker (NNP) wreaks (VBZ) havoc (NN) and (CC) chaos (NN) on (IN) the (DT) people (NNS) of (IN) Gotham (NNP), Batman (NNP) must (MD) accept (VB) one (CD) of (IN) the (DT) greatest (JJS) psychological (JJ) and (CC) physical (JJ) tests (NNS) of (IN) his (PRP$) ability (NN) to (TO) fight (VB) injustice (NN).
-
Gandalf (NNP) and (CC) Aragorn (NNP) lead (VB) the (DT) World (NNP) of (IN) Men (NNP) against (IN) Sauron’s (NNP$) army (NN) to (TO) draw (VB) his (PRP$) gaze (NN) from (IN) Frodo (NNP) and (CC) Sam (NNP) as (IN) they (PRP) approach (VB) Mount (NNP) Doom (NNP) with (IN) the (DT) One (CD) Ring (NNP).
-
In (IN) German-occupied (JJ) Poland (NNP) during (IN) World (NNP) War (NNP) II (NNP), industrialist (NN) Oskar (NNP) Schindler (NNP) gradually (RB) becomes (VBZ) concerned (VBN) for (IN) his (PRP$) Jewish (JJ) workforce (NN) after (IN) witnessing (VBG) their (PRP$) persecution (NN) by (IN) the (DT) Nazis (NNPS).
在此标注中:
- NNP: 专有名词,单数
- NN: 名词,单数或不可数
- NNS: 名词,复数
- JJ: 形容词
- JJS: 形容词,最高级
- VB: 动词,基本形式
- VBZ: 动词,第三人称单数现在时
- VBN: 动词,过去分词
- MD: 情态动词
- CD: 基数词
- IN: 介词或从属连词
- DT: 限定词
- CC: 并列连词
- WRB: Wh-副词
- PRP: 人称代词
- PRP$: 所有格代词
- RB: 副词
- VBG: 动词,动名词或现在分词
- NNPS: 专有名词,复数
显然,这种方法不适合在没有示例和适当格式的情况下进行标注和构建数据集。我进行的多种实验中,这种情况发生了几次。因此,跳过记录我的实验,我们继续进行少样本提示,以获得准确和一致的标注。
少样本提示用于词性标注
这里是一个使用**CARE(上下文 — 行动 — 结果 — 示例)**提示框架变体的提示,构建了一套完善的指令。
prompt = f"""
Annotate the texts for Part-Of-Speech tagging to train a Part-Of-Speech tagger.
Use Penn Treebank project's conventions of tagging.
## Below is an example of the format expected for output which has two sentences POS tagged:
{{
\[
('\`\`', '.'),
('We', 'PRON'),
('have', 'VERB'),
('no', 'DET'),
('useful', 'ADJ'),
('information', 'NOUN'),
('on', 'ADP'),
('whether', 'ADP')
\],
\[
('users', 'NOUN'),
('are', 'VERB'),
('at', 'ADP'),
('risk', 'NOUN')
\]
## }}
使用这些示例,仅为以下文本生成带有词性标注的句子,格式为JSON:
{
"sentences": [
[
("When", "ADV"),
("the", "DET"),
("menace", "NOUN"),
("known", "VERB"),
("as", "ADP"),
("the", "DET"),
("Joker", "NOUN"),
("wreaks", "VERB"),
("havoc", "NOUN"),
("and", "CONJ"),
("chaos", "NOUN"),
("on", "ADP"),
("the", "DET"),
("people", "NOUN"),
("of", "ADP"),
("Gotham", "NOUN"),
(",", "."),
("Batman", "NOUN"),
("must", "VERB"),
("accept", "VERB"),
("one", "NUM"),
("of", "ADP"),
("the", "DET"),
("greatest", "ADJ"),
("psychological", "ADJ"),
("and", "CONJ"),
("physical", "ADJ"),
("tests", "NOUN"),
("of", "ADP"),
("his", "PRON"),
("ability", "NOUN"),
("to", "PART"),
("fight", "VERB"),
("injustice", "NOUN"),
(".", ".")
],
[
("Gandalf", "NOUN"),
("and", "CONJ"),
("Aragorn", "NOUN"),
("lead", "VERB"),
("the", "DET"),
("World", "NOUN"),
("of", "ADP"),
("Men", "NOUN"),
("against", "ADP"),
("Sauron", "NOUN"),
("’", "NOUN"),
("s", "NOUN"),
("army", "NOUN"),
("to", "PART"),
("draw", "VERB"),
("his", "PRON"),
("gaze", "NOUN"),
("from", "ADP"),
("Frodo", "NOUN"),
("and", "CONJ"),
("Sam", "NOUN"),
("as", "ADP"),
("they", "PRON"),
("approach", "VERB"),
("Mount", "NOUN"),
("Doom", "NOUN"),
("with", "ADP"),
("the", "DET"),
("One", "NUM"),
("Ring", "NOUN"),
(".", ".")
],
[
("In", "ADP"),
("German-occupied", "ADJ"),
("Poland", "NOUN"),
("during", "ADP"),
("World", "NOUN"),
("War", "NOUN"),
("II", "NUM"),
(",", "."),
("industrialist", "NOUN"),
("Oskar", "NOUN"),
("Schindler", "NOUN"),
("gradually", "ADV"),
("becomes", "VERB"),
("concerned", "ADJ"),
("for", "ADP"),
("his", "PRON"),
("Jewish", "ADJ"),
("workforce", "NOUN"),
("after", "ADP"),
("witnessing", "VERB"),
("their", "PRON"),
("persecution", "NOUN"),
("by", "ADP"),
("the", "DET"),
("Nazis", "NOUN"),
(".", ".")
]
]
}
少样本提示用于命名实体识别(NER)注释
现在让我们尝试对命名实体识别(NER)进行相同的操作。以下是为人员、组织和位置标记量身定制的修订提示:
prompt = f"""
Annotate the texts for Named-Entity tagging to train a Named-Entity-Recognition (NER) tagger.
Use Stanford's PERSON (PER), ORGANISATION (ORG) and LOCATION (LOC) tags.
## Below is an example of the format expected for output which has two sentences NER tagged:
{{
\[
('\`\`', '0'),
('John', 'PER'),
('have', '0'),
('no', '0'),
('useful', '0'),
('information', '0'),
('on', '0'),
('Medium', 'ORG')
\],
\[
('users', '0'),
('are', '0'),
('in', '0'),
('Dublin', 'LOC')
\]
## }}
{
"sentences": [
[
("当", "0"),
("那个", "0"),
("被称为", "0"),
("小丑", "PER"),
("的", "0"),
("威胁", "0"),
("在", "0"),
("哥谭", "LOC"),
("造成", "0"),
("浩劫", "0"),
("和", "0"),
("混乱", "0"),
("时", "0"),
("蝙蝠侠", "PER"),
("必须", "0"),
("接受", "0"),
("他", "0"),
("能力", "0"),
("面对", "0"),
("不公", "0"),
("的", "0"),
("最大", "0"),
("心理", "0"),
("和", "0"),
("身体", "0"),
("考验", "0"),
(".", "0")
],
[
("甘道夫", "PER"),
("和", "0"),
("阿拉贡", "PER"),
("带领", "0"),
("人类", "0"),
("的", "0"),
("世界", "0"),
("对抗", "0"),
("索伦", "PER"),
("的", "0"),
("军队", "0"),
("以", "0"),
("引开", "0"),
("他的", "0"),
("目光", "0"),
("从", "0"),
("弗罗多", "PER"),
("和", "0"),
("山姆", "PER"),
("身上", "0"),
("当", "0"),
("他们", "0"),
("接近", "0"),
("厄运山", "LOC"),
("时", "0"),
("与", "0"),
("那", "0"),
("个", "0"),
("魔戒", "0"),
(".", "0")
],
[
("在", "0"),
("德国占领", "LOC"),
("的", "0"),
("波兰", "LOC"),
("期间", "0"),
("第二次", "0"),
("世界", "0"),
("大战", "0"),
("中", "0"),
(",", "0"),
("工业家", "0"),
("奥斯卡", "PER"),
("辛德勒", "PER"),
("逐渐", "0"),
("开始", "0"),
("关心", "0"),
("他的", "0"),
("犹太", "0"),
("劳动力", "0"),
("在", "0"),
("目睹", "0"),
("他们", "0"),
("遭受", "0"),
("迫害", "0"),
("后", "0"),
("由", "0"),
("纳粹", "ORG"),
("造成", "0"),
(".", "0")
]
]
}
少样本提示用于命名实体识别标注小写文本
了解传统命名实体识别标注器的缺点后,专有名词如果被标准化为小写或人为错误,性能会下降,通常会错误标注。由于大语言模型更直观且具有语义成熟度,让我们使用以下提示进行测试:
prompt = f"""
Annotate the texts for Named-Entity tagging to train a Named-Entity-Recognition (NER) tagger.
Use Stanford's PERSON (PER), ORGANISATION (ORG) and LOCATION (LOC) tags. Correct cases of proper nouns if needed.
Below is an example of the format expected for output which has two sentences NER tagged:
---
{{
[
('``', '0'),
('john', 'PER'),
('have', '0'),
('no', '0'),
('useful', '0'),
('information', '0'),
('on', '0'),
('medium', 'ORG')
],
[
('users', '0'),
('are', '0'),
('in', '0'),
('dublin', 'LOC')
]
}}
---
Use these examples and generate NER tagged sentences only for the following texts in JSON format:
---
{text.lower()}
"""
get_response(prompt)
这里是最后文本的截断响应,其中仅有一个标签不匹配:
{
[
('in', '0'),
('german', '0'),
('-', '0'),
('occupied', '0'),
('poland', 'LOC'),
('during', '0'),
('world', '0'),
('war', '0'),
('ii', '0'),
(',', '0'),
('industrialist', '0'),
('oskar', 'PER'),
('schindler', 'PER'),
('gradually', '0'),
('becomes', '0'),
('concerned', '0'),
('for', '0'),
('his', '0'),
('jewish', '0'),
('workforce', '0'),
('after', '0'),
('witnessing', '0'),
('their', '0'),
('persecution', '0'),
('by', '0'),
('the', '0'),
('nazis', '0'),
('.', '0')
]
}
这可以通过更新一个组织的示例来纠正,在该示例中,您将任何类似实体标记为ORG。例如,我将示例修改如下:
prompt = f"""
Annotate the texts for Named-Entity tagging to train a Named-Entity-Recognition (NER) tagger.
Use Stanford's PERSON (PER), ORGANISATION (ORG) and LOCATION (LOC) tags. Correct cases of proper nouns if needed.
Below is an example of the format expected for output which has two sentences NER tagged:
---
{{
[
('``', '0'),
('john', 'PER'),
('have', '0'),
('no', '0'),
('useful', '0'),
('information', '0'),
('on', '0'),
('irish', 'ORG')
],
[
('users', '0'),
('are', '0'),
('in', '0'),
('dublin', 'LOC')
]
}}
---
Use these examples and generate NER tagged sentences only for the following texts in JSON format:
---
{text.lower()}
"""
get_response(prompt)
经过这一小的调整后,它正确运行。
最后,我们来到了客户标注器练习的最后示例。
少样本提示用于自定义标注
现在,POS 和 NER 标注器是传统自然语言处理任务的典型模型。对于这些以外的标注,例如行业特定的医疗文档标记,我们需要用特定类别标注的文本——如前所述,这可能既费力又容易出错。
这里使用之前用于 POS 和 NER 标记的相同模板,来定制自定义标注。让我们开始吧!
prompt = f"""
Annotate the texts for fictional character tagging to train a fictional character tagger.
Use FICT-PER to label fictional character and FICT-LOC to label fictional location.
## Below is an example of the format expected for output which has two sentences fictional character tagged:
{{
\[
('\`\`', '0'),
('Jack', 'FICT-PER'),
('Sparrow', 'FICT-PER'),
('stole', '0'),
('the', '0'),
('ship', '0'),
('from', '0'),
('Tortuga', '0'), # Is a real location on Earth
('to', '0'),
('go', '0'),
('Shipwreck', 'FICT-LOC'), # Is not a real location on Earth
('Island', 'FICT-LOC') # Is not a real location on Earth
\],
\[
('Hatter', 'FICT-PER'),
('is', '0'),
('in', '0'),
('Neverland', 'FICT-LOC')
\]
## }}
使用这些示例,仅为以下文本生成虚构角色标记句子,格式为JSON:
{
"annotations": [
[
("When", "0"),
("the", "0"),
("menace", "0"),
("known", "0"),
("as", "0"),
("the", "0"),
("Joker", "FICT-PER"),
("wreaks", "0"),
("havoc", "0"),
("and", "0"),
("chaos", "0"),
("on", "0"),
("the", "0"),
("people", "0"),
("of", "0"),
("Gotham", "FICT-LOC"),
(",", "0"),
("Batman", "FICT-PER"),
("must", "0"),
("accept", "0"),
("one", "0"),
("of", "0"),
("the", "0"),
("greatest", "0"),
("psychological", "0"),
("and", "0"),
("physical", "0"),
("tests", "0"),
("of", "0"),
("his", "0"),
("ability", "0"),
("to", "0"),
("fight", "0"),
("injustice", "0"),
(".", "0")
],
[
("Gandalf", "FICT-PER"),
("and", "0"),
("Aragorn", "FICT-PER"),
("lead", "0"),
("the", "0"),
("World", "0"),
("of", "0"),
("Men", "0"),
("against", "0"),
("Sauron", "FICT-PER"),
("'s", "0"),
("army", "0"),
("to", "0"),
("draw", "0"),
("his", "0"),
("gaze", "0"),
("from", "0"),
("Frodo", "FICT-PER"),
("and", "0"),
("Sam", "FICT-PER"),
("as", "0"),
("they", "0"),
("approach", "0"),
("Mount", "FICT-LOC"),
("Doom", "FICT-LOC"),
("with", "0"),
("the", "0"),
("One", "0"),
("Ring", "0"),
(".", "0")
],
[
("In", "0"),
("German-occupied", "0"),
("Poland", "0"),
("during", "0"),
("World", "0"),
("War", "0"),
("II", "0"),
(",", "0"),
("industrialist", "0"),
("Oskar", "FICT-PER"),
("Schindler", "FICT-PER"),
("gradually", "0"),
("becomes", "0"),
("concerned", "0"),
("for", "0"),
("his", "0"),
("Jewish", "0"),
("workforce", "0"),
("after", "0"),
("witnessing", "0"),
("their", "0"),
("persecution", "0"),
("by", "0"),
("the", "0"),
("Nazis", "0"),
(".", "0")
]
]
}
显然,像其他每个提示工程用例一样,对于标注,我们也需要对被标记的类别有明确的定义,以确保准确性,并在边缘案例中提供完整性示例,以及标记有效性与一致性的语法。此外,没有什么能比在提示基础上制定一个强大的测试计划更有效,这可能包括手动验证或跨多个大语言模型比较输出,或与之前用于标注的标准数据集进行匹配。
结论
在本文中,我展示了如何使用大语言模型和结构化提示来加速准确的数据标注。从理解上下文到识别名称和位置的小写字母——这为专业的自然语言处理模型带来了巨大的优势,这些模型现在可以使用自定义标注器进行训练,只需人类监督,减少了时间和劳动成本,同时提高了效率和生产力。
照片由 Sarah Wolfe 提供,来源于 Unsplash