
评估 llms 在 SQL 查询生成中的表现:数据工程师的准确性洞察和最佳实践
在经历了多篇基于 LLM 的文章之后,我想谈谈人们一直在争论的关于大型语言模型的一个重要方面——LLM 能否接替数据工程团队的工作?
在本文中,我将尝试回答这个问题。
但在深入探讨之前,请让我绝对肯定地说一件事——将你的数据暴露给大型语言模型是一把双刃剑!
我们已经意识到,LLM 生成内容通常是不确定的;也就是说,向 LLM 提供的类似提示可能会给出不同的结果。这对于结构化语言(例如 SQL 查询)是否成立?
嗯,Chase SQL 论文试图回答这个问题。这里有一个指向白皮书的链接(值得你花时间阅读!):
CHASE-SQL Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL
受到该白皮书的启发,我尝试建立自己的基于 LLM 的数据库管理员。
评估 LLM 在 SQL 生成方面的表现
根据 Chase SQL 论文,我正在对 Spider SQLite 数据库的数据集进行评估。
具体来说,我实际上是在针对单个数据库——Department Management进行评估,并且只运行测试平台中的 10 个查询进行评估。
我正在为Open AI和Anthropic进行评估,因为这两个巨头在各个组织中都有广泛的使用。
但在深入了解之前,就查询准确性而言,查询的结果是——查询结果匹配 60%(实际结果与 LLM 生成的查询结果)。
而就生成的 SQL 查询而言,SQL 查询语法匹配百分比约为 75%。
但我也必须声明,我并没有完全遵循 Chase SQL 论文中的指示,因为大多数企业在将 LLM 指向其数据库时都不会这样做。
我将讨论提高 SQL 查询准确性的技术,但在我们开始之前,让我们深入研究代码。
实现
首要任务是下载 Spider 数据集,我通过此链接从 Kaggle 下载了它——
[## Yale University’s Spider 1.0 NLP Dataset 🦄 🤗 🔥
Yale Semantic Parsing and Text-to-SQL Challenge
www.kaggle.com](https://www.kaggle.com/datasets/jeromeblanchet/yale-universitys-spider-10-nlp-dataset)
将数据库放入项目中的本地目录后,我现在想加载单个数据库并从中选择 10 个查询样本。这是执行此操作的代码——
def test_on_spider(spider_path, db_path, spider_test_data):
try:
# Load Spider dataset
spider_data = load_spider_dataset(spider_path)
db_id = spider_data[0]['db_id']
db_path = db_path + db_id + '/' + db_id + ".sqlite"
# Get actual schema for this specific database
actual_schema = get_actual_schema(db_path)
spider_test_data = load_spider_test_dataset(spider_test_data)
# Filter for department_management database
dept_management_data = [item for item in spider_test_data if item['db_id'] == 'department_management']
# Extract questions and actual queries
test_questions = [item['question'] for item in dept_management_data]
test_actual_queries = [item['query'] for item in dept_management_data]
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
results = []
for i, (question, actual_query) in enumerate(zip(test_questions, test_actual_queries)):
if i > 9: # Limit to 10 questions for testing
break
print(f"\nQuestion {i+1}: {question}")
print(f"Actual SQL query: {actual_query}")
# Generate query using Chase SQL
generated_query = divide_and_conquer(question, actual_schema)
print(f"Generated SQL query: {generated_query}")
# Calculate query similarity
similarity = query_similarity(generated_query, actual_query)
print(f"Query similarity: {similarity:.2%}")
# Execute both queries and compare results
try:
cursor.execute(generated_query)
generated_result = cursor.fetchall()
cursor.execute(actual_query)
actual_result = cursor.fetchall()
# Compare results
results_match = (generated_result == actual_result)
results.append({
'question': question,
'generated_query': generated_query,
'actual_query': actual_query,
'results_match': results_match,
'query_similarity': similarity,
'generated_result': generated_result,
'actual_result': actual_result
})
print(f"Results match: {results_match}")
print(f"Generated result: {generated_result}")
print(f"Actual result: {actual_result}")
except sqlite3.Error as e:
print(f"Error executing SQL: {e}")
results.append({
'question': question,
'generated_query': generated_query,
'actual_query': actual_query,
'results_match': False,
'query_similarity': similarity,
'error': str(e)
})
# Compute overall statistics
total_queries = len(results)
correct_queries = sum(1 for r in results if r['results_match'])
accuracy = correct_queries / total_queries if total_queries > 0 else 0
avg_similarity = sum(r['query_similarity'] for r in results) / total_queries
print(f"\nOverall Results:")
print(f"Total Queries: {total_queries}")
print(f"Correct Queries: {correct_queries}")
print(f"Accuracy: {accuracy:.2%}")
print(f"Average Query Similarity: {avg_similarity:.2%}")
# Analyze types of errors
error_types = {}
for r in results:
if not r['results_match']:
if 'error' in r:
error_type = r['error'].split(':')[0]
else:
error_type = 'Incorrect Result'
error_types[error_type] = error_types.get(error_type, 0) + 1
print("\nError Types:")
for error_type, count in error_types.items():
print(f"{error_type}: {count} ({count/total_queries:.2%})")
### 找到表现最好和最差的查询
best_query = max(results, key=lambda r: r['query_similarity'])
worst_query = min(results, key=lambda r: r['query_similarity'])
print("\n表现最好的查询:")
print(f"问题: {best_query['question']}")
print(f"生成的查询: {best_query['generated_query']}")
print(f"实际查询: {best_query['actual_query']}")
print(f"相似度: {best_query['query_similarity']:.2%}")
print("\n表现最差的查询:")
print(f"问题: {worst_query['question']}")
print(f"生成的查询: {worst_query['generated_query']}")
print(f"实际查询: {worst_query['actual_query']}")
print(f"相似度: {worst_query['query_similarity']:.2%}")
except Exception as e:
print(f"在 test_on_spider 中发生了错误: {e}")
import traceback
traceback.print_exc()
finally:
if conn:
conn.close()
return results
这个函数做了很多事情。让我们一步一步来。
1 — 加载部门管理数据集
加载这个特定的数据库。Spider 数据集下载了几个 SQLite 数据库,但我只选择了一个用于评估。
我们首先调用 Get Actual Schema 函数来提取数据库模式 —
def get_actual_schema(db_path):
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = [row[0] for row in cursor.fetchall()]
schema = {'table_names': tables, 'column_names': [], 'column_types': []}
for table in tables:
cursor.execute(f"PRAGMA table_info({table})")
for col in cursor.fetchall():
schema['column_names'].append([tables.index(table), col[1]])
schema['column_types'].append(col[2])
conn.close()
return schema
我们从已经创建的实际数据库中提取它。
2 — 加载实际数据集
加载以 json 格式存储在数据集目录中的实际 spider 数据集 —
def load_spider_test_dataset(spider_test_data):
with open(spider_test_data) as f:
return json.load(f)
这是 json 的示例 —
[
{
"db_id": "department_management",
"query": "SELECT count(*) FROM head WHERE age > 56",
"query_toks": [
"SELECT",
"count",
"(",
"*",
")",
"FROM",
"head",
"WHERE",
"age",
">",
"56"
],
"query_toks_no_value": [
"select",
"count",
"(",
"*",
")",
"from",
"head",
"where",
"age",
">",
"value"
],
"question": "How many heads of the departments are older than 56 ?",
"question_toks": [
"How",
"many",
"heads",
"of",
"the",
"departments",
"are",
"older",
"than",
"56",
"?"
],
如您所见,它具有实际查询和结果,以评估预期结果。
### 过滤 department_management 数据库
dept_management_data = [item for item in spider_test_data if item['db_id'] == 'department_management']
### 提取问题和实际查询
test_questions = [item['question'] for item in dept_management_data]
test_actual_queries = [item['query'] for item in dept_management_data]
这些行获取测试自然问题和实际查询。
3 — 使用 LLM 运行查询
使用自然语言问题,对测试数据库运行查询 —
generated_query = divide_and_conquer(question, actual_schema)
此函数向 LLM 提供模式,并要求它生成 SQL 查询。这是函数定义 —
def divide_and_conquer(question, schema):
# 创建模式的结构化表示
schema_info = "Database Schema:\n"
for table_id, table in enumerate(schema['table_names']):
schema_info += f"Table: {table}\n"
columns = [column[1] for column in schema['column_names'] if column[0] == table_id]
for col in columns:
col_index = columns.index(col)
col_type = schema['column_types'][col_index]
schema_info += f" - {col} ({col_type})"
schema_info += "\n"
schema_info += "\n"
# 生成 SQL 查询
prompt = f"""
{schema_info}
根据上述数据库模式,生成一个与提供的模式相关的 SQL 查询,以回答以下问题:
{question}
重要提示:
1. 不要编造列名。仅使用提供的表名和关联的列。
2. 确保查询在语法上对于 SQLite 来说是正确的和完整的。
3. 根据外键关系在必要时包含连接。
4. 对任何日期计算使用 SQLite 日期函数(例如,strftime、date、julianday)。
5. 仅生成查询,不要添加任何额外的文本或格式。
6. 不要使用 strftime 修改查询中的日期字段,按原样选择日期。
"""
sql_query = llm_generate(prompt)
sql_query = sql_query.replace("```sql", "")
sql_query = sql_query.replace("```", "")
return sql_query
生成的 SQL 查询将返回到 test_on_spider 函数。
4 — 比较查询和结果
接下来的几行对测试 json 中的实际查询和 LLM 生成的查询在数据库上运行,然后比较结果 —
### 计算查询相似度
similarity = query_similarity(generated_query, actual_query)
print(f"查询相似度: {similarity:.2%}")
## 执行两个查询并比较结果
try:
cursor.execute(generated_query)
generated_result = cursor.fetchall()
cursor.execute(actual_query)
actual_result = cursor.fetchall()
## Compare results
results_match = (generated_result == actual_result)
results.append({
'question': question,
'generated_query': generated_query,
'actual_query': actual_query,
'results_match': results_match,
'query_similarity': similarity,
'generated_result': generated_result,
'actual_result': actual_result
})
print(f"Results match: {results_match}")
print(f"Generated result: {generated_result}")
print(f"Actual result: {actual_result}")
最后,我们汇总结果 —
### Compute overall statistics
total_queries = len(results)
correct_queries = sum(1 for r in results if r['results_match'])
accuracy = correct_queries / total_queries if total_queries > 0 else 0
avg_similarity = sum(r['query_similarity'] for r in results) / total_queries
print(f"\nOverall Results:")
print(f"Total Queries: {total_queries}")
print(f"Correct Queries: {correct_queries}")
print(f"Accuracy: {accuracy:.2%}")
print(f"Average Query Similarity: {avg_similarity:.2%}")
## Analyze types of errors
error_types = {}
for r in results:
if not r['results_match']:
if 'error' in r:
error_type = r['error'].split(':')[0]
else:
error_type = 'Incorrect Result'
error_types[error_type] = error_types.get(error_type, 0) + 1
print("\nError Types:")
for error_type, count in error_types.items():
print(f"{error_type}: {count} ({count/total_queries:.2%})")
### 找到表现最好和最差的查询
```python
best_query = max(results, key=lambda r: r['query_similarity'])
worst_query = min(results, key=lambda r: r['query_similarity'])
print("\n表现最佳的查询:")
print(f"问题: {best_query['question']}")
print(f"生成的查询: {best_query['generated_query']}")
print(f"实际查询: {best_query['actual_query']}")
print(f"相似度: {best_query['query_similarity']:.2%}")
print("\n表现最差的查询:")
print(f"问题: {worst_query['question']}")
print(f"生成的查询: {worst_query['generated_query']}")
print(f"实际查询: {worst_query['actual_query']}")
print(f"相似度: {worst_query['query_similarity']:.2%}")
这是 Anthropic (claude-3-sonnet) 的总体结果 —
Question 1: How many heads of the departments are older than 56 ?
Actual SQL query: SELECT count(*) FROM head WHERE age > 56
Generated SQL query: SELECT COUNT(*)
FROM head h
JOIN management m ON h.head_ID = m.head_ID
WHERE h.age > 56;
Query similarity: 62.50%
Results match: False
Generated result: [(3,)]
Actual result: [(5,)]
Question 2: List the name, born state and age of the heads of departments ordered by age.
Actual SQL query: SELECT name , born_state , age FROM head ORDER BY age
Generated SQL query: SELECT h.name, h.born_state, h.age
FROM head h
JOIN management m ON h.head_ID = m.head_ID
ORDER BY h.age;
Query similarity: 64.56%
Results match: False
Generated result: [('Dudley Hart', 'California', 52.0), ('Jeff Maggert', 'Delaware', 53.0), ('Franklin Langham', 'Connecticut', 67.0), ('Billy Mayfair', 'California', 69.0), ('K. J. Choi', 'Alabama', 69.0)]
Actual result: [('Pádraig Harrington', 'Connecticut', 43.0), ('Stewart Cink', 'Florida', 50.0), ('Dudley Hart', 'California', 52.0), ('Jeff Maggert', 'Delaware', 53.0), ('Nick Faldo', 'California', 56.0), ('Tiger Woods', 'Alabama', 67.0), ('Franklin Langham', 'Connecticut', 67.0), ('Sergio García', 'California', 68.0), ('K. J. Choi', 'Alabama', 69.0), ('Billy Mayfair', 'California', 69.0)]
Question 3: List the creation year, name and budget of each department.
Actual SQL query: SELECT creation , name , budget_in_billions FROM department
Generated SQL query: SELECT Creation, Name, Budget_in_Billions
FROM department;
Query similarity: 97.44%
Results match: True
Generated result: [('1789', 'State', 9.96), ('1789', 'Treasury', 11.1), ('1947', 'Defense', 439.3), ('1870', 'Justice', 23.4), ('1849', 'Interior', 10.7), ('1889', 'Agriculture', 77.6), ('1903', 'Commerce', 6.2), ('1913', 'Labor', 59.7), ('1953', 'Health and Human Services', 543.2), ('1965', 'Housing and Urban Development', 46.2), ('1966', 'Transportation', 58.0), ('1977', 'Energy', 21.5), ('1979', 'Education', 62.8), ('1989', 'Veterans Affairs', 73.2), ('2002', 'Homeland Security', 44.6)]
Actual result: [('1789', 'State', 9.96), ('1789', 'Treasury', 11.1), ('1947', 'Defense', 439.3), ('1870', 'Justice', 23.4), ('1849', 'Interior', 10.7), ('1889', 'Agriculture', 77.6), ('1903', 'Commerce', 6.2), ('1913', 'Labor', 59.7), ('1953', 'Health and Human Services', 543.2), ('1965', 'Housing and Urban Development', 46.2), ('1966', 'Transportation', 58.0), ('1977', 'Energy', 21.5), ('1979', 'Education', 62.8), ('1989', 'Veterans Affairs', 73.2), ('2002', 'Homeland Security', 44.6)]
Question 4: What are the maximum and minimum budget of the departments?
Actual SQL query: SELECT max(budget_in_billions) , min(budget_in_billions) FROM department
Generated SQL query: SELECT MAX(Budget_in_Billions) AS max_budget, MIN(Budget_in_Billions) AS min_budget
FROM department;
Query similarity: 83.72%
Results match: True
Generated result: [(543.2, 6.2)]
Actual result: [(543.2, 6.2)]
Question 5: What is the average number of employees of the departments whose rank is between 10 and 15?
Actual SQL query: SELECT avg(num_employees) FROM department WHERE ranking BETWEEN 10 AND 15
Generated SQL query: SELECT AVG(Num_Employees) AS avg_num_employees
FROM department
WHERE Ranking BETWEEN 10 AND 15;
Query similarity: 86.90%
Results match: True
Generated result: [(105468.16666666667,)]
Actual result: [(105468.16666666667,)]
Question 6: What are the names of the heads who are born outside the California state?
Actual SQL query: SELECT name FROM head WHERE born_state != 'California'
Generated SQL query: SELECT h.name
FROM head h
JOIN management m ON h.head_ID = m.head_ID
WHERE h.born_state <> 'California';
Query similarity: 65.82%
Results match: False
Generated result: [('Jeff Maggert',), ('K. J. Choi',), ('Franklin Langham',)]
Actual result: [('Tiger Woods',), ('K. J. Choi',), ('Jeff Maggert',), ('Stewart Cink',), ('Pádraig Harrington',), ('Franklin Langham',)]
Question 7: What are the distinct creation years of the departments managed by a secretary born in state 'Alabama'?
Actual SQL query: SELECT DISTINCT T1.creation FROM department AS T1 JOIN management AS T2 ON T1.department_id = T2.department_id JOIN head AS T3 ON T2.head_id = T3.head_id WHERE T3.born_state = 'Alabama'
Generated SQL query: SELECT DISTINCT strftime('%Y', Creation) AS creation_year
FROM department d
JOIN management m ON d.Department_ID = m.department_ID
JOIN head h ON m.head_ID = h.head_ID
WHERE h.born_state = 'Alabama';
Query similarity: 82.29%
Results match: False
Generated result: [('-4707',)]
Actual result: [('1903',)]
Question 8: What are the names of the states where at least 3 heads were born?
Actual SQL query: SELECT born_state FROM head GROUP BY born_state HAVING count(*) >= 3
Generated SQL query: SELECT born_state, COUNT(*) AS count
FROM head
GROUP BY born_state
HAVING COUNT(*) >= 3;
Query similarity: 87.18%
Results match: False
Generated result: [('California', 4)]
Actual result: [('California',)]
Question 9: In which year were most departments established?
Actual SQL
问题 1:有多少部门负责人的年龄大于 56 岁?
实际 SQL 查询:SELECT count(*) FROM head WHERE age > 56
生成的 SQL 查询:SELECT COUNT(*)
FROM head
WHERE age > 56;
查询相似度:98.77%
结果匹配:True
生成的结果:[(5,)]
实际结果:[(5,)]
问题 2:列出部门负责人的姓名、出生州和年龄,并按年龄排序。
实际 SQL 查询:SELECT name , born_state , age FROM head ORDER BY age
生成的 SQL 查询:SELECT head.name, head.born_state, head.age
FROM head
JOIN management ON head.head_ID = management.head_ID
JOIN department ON management.department_ID = department.Department_ID
ORDER BY head.age;
查询相似度:40.96%
结果匹配:False
生成的结果:[('Dudley Hart', 'California', 52.0), ('Jeff Maggert', 'Delaware', 53.0), ('Franklin Langham', 'Connecticut', 67.0), ('Billy Mayfair', 'California', 69.0), ('K. J. Choi', 'Alabama', 69.0)]
实际结果:[('Pádraig Harrington', 'Connecticut', 43.0), ('Stewart Cink', 'Florida', 50.0), ('Dudley Hart', 'California', 52.0), ('Jeff Maggert', 'Delaware', 53.0), ('Nick Faldo', 'California', 56.0), ('Tiger Woods', 'Alabama', 67.0), ('Franklin Langham', 'Connecticut', 67.0), ('Sergio García', 'California', 68.0), ('K. J. Choi', 'Alabama', 69.0), ('Billy Mayfair', 'California', 69.0)]
```sql
SELECT name, born_state, age
FROM head
ORDER BY age;
Actual Query: SELECT name , born_state , age FROM head ORDER BY age
Similarity: 99.16%
禁忌事项
- 绝对不要依赖 LLM 在数据库上运行变更查询,除非有时间旅行功能(如 Snowflake)或某种回滚机制
- 不要使用基于 LLM 的查询来查询具有过于复杂关系的数据库 - LLM 更容易生成简单的 Select 语句,例如 Counts、SUMS 等,尤其是在单表上。
- 不要承诺 100% 的查询准确性。这是无法保证的。
- 不要认为 LLM 可以解决您的数据工程问题。它可以作为辅助工具,但永远不能取代整个数据管道。
- 永远不要将您的 LLM 指向敏感、机密、私有或专有数据。即使您必须这样做,也要确保 LLM 完全是本地的,并且您有足够的安全保障和防护措施来防止数据泄露。
建议事项
- 始终添加日志和指标以检查 LLM 查询的准确性和性能。
- 使用 LLM 为非关键系统添加简单的查询。
- 如果您真的想使用 LLM 自动化 Select 查询,我建议将 LLM 指向 Graph QL 或 ORM 接口。
- 始终向 LLM 提供您自己的实际查询的上下文,以便它可以理解现有查询的领域和性质。
LLM 的优势领域
- 生成 DDL
- 生成数据目录和数据字典
- 单表简单 Select 语句
- 总结 SQL 查询结果
- 数据库的基准测试和性能测试
总结
本文并非旨在严厉谴责基于 LLM 的 SQL 查询生成。
相反,我希望呈现一个更平衡或细致的观点,也许指出局限性或挑战,而不是完全否定这种方法。
在大多数情况下,LLM 擅长将 2 和 2 加在一起,自动化管道的某些部分(尤其是非关键部分)没有任何坏处。事实上,您甚至可以从查询结果中生成美观的视觉效果,这对于创建有关日志表、审计表等方面的见解很有用。
市场上有很多选择;有了 OpenAI、Meta、Anthropic 和 Google 提供的 LLM 产品,很难做出选择——这就是评估变得重要的原因。
通过评估每个 LLM 在您的特定工作负载上的性能——例如生成 SQL、总结数据或解释特定于领域的查询——您可以确定最适合您需求的产品。在某些情况下,混合方法(将特定查询类型路由到最合适的 LLM)可以提供最佳的准确性和成本效益。