597 lines
18 KiB
Markdown
597 lines
18 KiB
Markdown
# 接口开发说明 - 文档解析接口
|
||
|
||
## 📋 接口基本信息
|
||
|
||
- **接口路径**: `/api/v1/inventory/parse-document`
|
||
- **请求方法**: `POST`
|
||
- **接口功能**: 解析上传的数据字典文档(Excel/Word/PDF),提取表结构信息
|
||
- **涉及页面**: `InventoryStep.vue` - 方案一(已有文档导入)
|
||
- **是否涉及大模型**: ❌ 否
|
||
- **工作量评估**: 5 人日
|
||
- **优先级**: 中
|
||
|
||
---
|
||
|
||
## 🎯 功能描述
|
||
|
||
该接口用于解析用户上传的数据字典文档,支持以下格式:
|
||
- **Excel**: `.xlsx`, `.xls`
|
||
- **Word**: `.doc`, `.docx`
|
||
- **PDF**: `.pdf`(可选)
|
||
|
||
从文档中提取以下信息:
|
||
- 表名(英文)
|
||
- 字段名(英文)
|
||
- 字段类型
|
||
- 字段注释/描述(中文)
|
||
|
||
---
|
||
|
||
## 🔧 技术实现方案
|
||
|
||
### 技术栈
|
||
|
||
```python
|
||
# 核心依赖
|
||
fastapi>=0.104.0 # Web 框架
|
||
pydantic>=2.0.0 # 数据验证
|
||
python-multipart>=0.0.6 # 文件上传支持
|
||
|
||
# 文档处理
|
||
pandas>=2.0.0 # 数据处理
|
||
openpyxl>=3.1.0 # Excel 处理
|
||
python-docx>=1.1.0 # Word 处理
|
||
pdfplumber>=0.10.0 # PDF 处理(可选)
|
||
```
|
||
|
||
### 实现思路
|
||
|
||
1. **文件上传**: 使用 FastAPI 的 `UploadFile` 接收文件
|
||
2. **文件类型识别**: 根据文件扩展名或 MIME 类型识别文件格式
|
||
3. **文档解析**:
|
||
- Excel: 使用 `pandas` 或 `openpyxl` 读取
|
||
- Word: 使用 `python-docx` 解析表格和文本
|
||
- PDF: 使用 `pdfplumber` 提取表格和文本
|
||
4. **表结构提取**: 识别文档中的表结构信息,提取表名、字段名、类型、注释
|
||
5. **数据验证**: 验证提取的数据格式是否正确
|
||
6. **结果返回**: 返回标准化的表结构数据
|
||
|
||
---
|
||
|
||
## 📥 请求格式
|
||
|
||
### 请求方式
|
||
|
||
**Content-Type**: `multipart/form-data` 或 `application/json`
|
||
|
||
### 方式一:文件上传(推荐)
|
||
|
||
```http
|
||
POST /api/v1/inventory/parse-document
|
||
Content-Type: multipart/form-data
|
||
|
||
file: [二进制文件]
|
||
project_id: string
|
||
file_type: excel | word | pdf (可选,自动识别)
|
||
```
|
||
|
||
### 方式二:文件路径(如果文件已上传到服务器)
|
||
|
||
```json
|
||
{
|
||
"file_path": "/path/to/document.xlsx",
|
||
"file_type": "excel | word | pdf",
|
||
"project_id": "project_001"
|
||
}
|
||
```
|
||
|
||
### 请求参数说明
|
||
|
||
| 参数名 | 类型 | 必填 | 说明 |
|
||
|--------|------|------|------|
|
||
| `file` | File | 是 | 上传的文件(方式一) |
|
||
| `file_path` | string | 是 | 文件路径(方式二) |
|
||
| `file_type` | string | 否 | 文件类型:`excel` / `word` / `pdf`,如果不传则根据文件扩展名自动识别 |
|
||
| `project_id` | string | 是 | 项目ID |
|
||
|
||
---
|
||
|
||
## 📤 响应格式
|
||
|
||
### 成功响应
|
||
|
||
```json
|
||
{
|
||
"success": true,
|
||
"code": 200,
|
||
"message": "文档解析成功",
|
||
"data": {
|
||
"tables": [
|
||
{
|
||
"raw_name": "t_user_base_01",
|
||
"display_name": "用户基础信息表",
|
||
"description": "存储用户基本信息的表",
|
||
"fields": [
|
||
{
|
||
"raw_name": "user_id",
|
||
"display_name": "用户ID",
|
||
"type": "varchar(64)",
|
||
"comment": "用户的唯一标识符",
|
||
"is_primary_key": true,
|
||
"is_nullable": false,
|
||
"default_value": null
|
||
},
|
||
{
|
||
"raw_name": "user_name",
|
||
"display_name": "用户名",
|
||
"type": "varchar(50)",
|
||
"comment": "用户登录名",
|
||
"is_primary_key": false,
|
||
"is_nullable": true,
|
||
"default_value": null
|
||
}
|
||
],
|
||
"field_count": 2
|
||
}
|
||
],
|
||
"total_tables": 10,
|
||
"total_fields": 245,
|
||
"parse_time": 1.23,
|
||
"file_info": {
|
||
"file_name": "数据字典.xlsx",
|
||
"file_size": 1024000,
|
||
"file_type": "excel"
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### 失败响应
|
||
|
||
```json
|
||
{
|
||
"success": false,
|
||
"code": 400,
|
||
"message": "文件格式不支持",
|
||
"error": {
|
||
"error_code": "UNSUPPORTED_FILE_TYPE",
|
||
"error_detail": "仅支持 Excel (.xlsx, .xls), Word (.doc, .docx), PDF (.pdf) 格式"
|
||
}
|
||
}
|
||
```
|
||
|
||
### 响应字段说明
|
||
|
||
| 字段名 | 类型 | 说明 |
|
||
|--------|------|------|
|
||
| `success` | boolean | 请求是否成功 |
|
||
| `code` | integer | HTTP 状态码 |
|
||
| `message` | string | 响应消息 |
|
||
| `data` | object | 响应数据 |
|
||
| `data.tables` | array | 解析出的表列表 |
|
||
| `data.tables[].raw_name` | string | 表名(英文/原始名称) |
|
||
| `data.tables[].display_name` | string | 表显示名称(中文,如果文档中有) |
|
||
| `data.tables[].description` | string | 表描述 |
|
||
| `data.tables[].fields` | array | 字段列表 |
|
||
| `data.tables[].fields[].raw_name` | string | 字段名(英文) |
|
||
| `data.tables[].fields[].display_name` | string | 字段显示名称(中文) |
|
||
| `data.tables[].fields[].type` | string | 字段类型 |
|
||
| `data.tables[].fields[].comment` | string | 字段注释 |
|
||
| `data.total_tables` | integer | 总表数 |
|
||
| `data.total_fields` | integer | 总字段数 |
|
||
| `data.parse_time` | float | 解析耗时(秒) |
|
||
|
||
---
|
||
|
||
## 💻 代码实现示例
|
||
|
||
### FastAPI 实现
|
||
|
||
```python
|
||
from fastapi import FastAPI, UploadFile, File, Form, HTTPException
|
||
from fastapi.responses import JSONResponse
|
||
from pydantic import BaseModel
|
||
from typing import Optional, List
|
||
import pandas as pd
|
||
from docx import Document
|
||
import pdfplumber
|
||
import os
|
||
from pathlib import Path
|
||
import time
|
||
|
||
app = FastAPI()
|
||
|
||
class FieldInfo(BaseModel):
|
||
raw_name: str
|
||
display_name: Optional[str] = None
|
||
type: str
|
||
comment: Optional[str] = None
|
||
is_primary_key: bool = False
|
||
is_nullable: bool = True
|
||
default_value: Optional[str] = None
|
||
|
||
class TableInfo(BaseModel):
|
||
raw_name: str
|
||
display_name: Optional[str] = None
|
||
description: Optional[str] = None
|
||
fields: List[FieldInfo]
|
||
field_count: int
|
||
|
||
class ParseDocumentResponse(BaseModel):
|
||
success: bool
|
||
code: int
|
||
message: str
|
||
data: Optional[dict] = None
|
||
error: Optional[dict] = None
|
||
|
||
def parse_excel(file_path: str) -> List[TableInfo]:
|
||
"""解析 Excel 文件"""
|
||
tables = []
|
||
try:
|
||
# 读取 Excel 文件
|
||
df = pd.read_excel(file_path, sheet_name=None) # 读取所有 sheet
|
||
|
||
for sheet_name, df_sheet in df.items():
|
||
# 识别表结构(根据 Excel 格式约定)
|
||
# 假设第一列是字段名,第二列是类型,第三列是注释
|
||
fields = []
|
||
for _, row in df_sheet.iterrows():
|
||
if pd.notna(row.iloc[0]): # 字段名不为空
|
||
field = FieldInfo(
|
||
raw_name=str(row.iloc[0]).strip(),
|
||
display_name=str(row.iloc[2]).strip() if len(row) > 2 and pd.notna(row.iloc[2]) else None,
|
||
type=str(row.iloc[1]).strip() if len(row) > 1 and pd.notna(row.iloc[1]) else "varchar(255)",
|
||
comment=str(row.iloc[2]).strip() if len(row) > 2 and pd.notna(row.iloc[2]) else None
|
||
)
|
||
fields.append(field)
|
||
|
||
if fields:
|
||
table = TableInfo(
|
||
raw_name=sheet_name,
|
||
display_name=sheet_name,
|
||
fields=fields,
|
||
field_count=len(fields)
|
||
)
|
||
tables.append(table)
|
||
|
||
except Exception as e:
|
||
raise Exception(f"Excel 解析失败: {str(e)}")
|
||
|
||
return tables
|
||
|
||
def parse_word(file_path: str) -> List[TableInfo]:
|
||
"""解析 Word 文件"""
|
||
tables = []
|
||
try:
|
||
doc = Document(file_path)
|
||
|
||
# 遍历文档中的表格
|
||
for table_idx, table in enumerate(doc.tables):
|
||
fields = []
|
||
# 假设第一行是表头,后续行是字段信息
|
||
# 约定:第一列字段名,第二列类型,第三列注释
|
||
for row in table.rows[1:]: # 跳过表头
|
||
if len(row.cells) >= 3:
|
||
field_name = row.cells[0].text.strip()
|
||
if field_name: # 字段名不为空
|
||
field = FieldInfo(
|
||
raw_name=field_name,
|
||
display_name=row.cells[2].text.strip() if len(row.cells) > 2 and row.cells[2].text.strip() else None,
|
||
type=row.cells[1].text.strip() if len(row.cells) > 1 and row.cells[1].text.strip() else "varchar(255)",
|
||
comment=row.cells[2].text.strip() if len(row.cells) > 2 and row.cells[2].text.strip() else None
|
||
)
|
||
fields.append(field)
|
||
|
||
if fields:
|
||
table_info = TableInfo(
|
||
raw_name=f"table_{table_idx + 1}",
|
||
display_name=f"表{table_idx + 1}",
|
||
fields=fields,
|
||
field_count=len(fields)
|
||
)
|
||
tables.append(table_info)
|
||
|
||
except Exception as e:
|
||
raise Exception(f"Word 解析失败: {str(e)}")
|
||
|
||
return tables
|
||
|
||
def parse_pdf(file_path: str) -> List[TableInfo]:
|
||
"""解析 PDF 文件"""
|
||
tables = []
|
||
try:
|
||
with pdfplumber.open(file_path) as pdf:
|
||
for page_idx, page in enumerate(pdf.pages):
|
||
# 提取表格
|
||
page_tables = page.extract_tables()
|
||
for table_idx, table in enumerate(page_tables):
|
||
if table and len(table) > 1:
|
||
fields = []
|
||
# 假设第一行是表头,后续行是字段信息
|
||
for row in table[1:]:
|
||
if len(row) >= 3 and row[0]:
|
||
field = FieldInfo(
|
||
raw_name=str(row[0]).strip(),
|
||
display_name=str(row[2]).strip() if len(row) > 2 and row[2] else None,
|
||
type=str(row[1]).strip() if len(row) > 1 and row[1] else "varchar(255)",
|
||
comment=str(row[2]).strip() if len(row) > 2 and row[2] else None
|
||
)
|
||
fields.append(field)
|
||
|
||
if fields:
|
||
table_info = TableInfo(
|
||
raw_name=f"table_{page_idx + 1}_{table_idx + 1}",
|
||
display_name=f"表{page_idx + 1}-{table_idx + 1}",
|
||
fields=fields,
|
||
field_count=len(fields)
|
||
)
|
||
tables.append(table_info)
|
||
|
||
except Exception as e:
|
||
raise Exception(f"PDF 解析失败: {str(e)}")
|
||
|
||
return tables
|
||
|
||
def detect_file_type(file_name: str) -> str:
|
||
"""根据文件扩展名检测文件类型"""
|
||
ext = Path(file_name).suffix.lower()
|
||
if ext in ['.xlsx', '.xls']:
|
||
return 'excel'
|
||
elif ext in ['.docx', '.doc']:
|
||
return 'word'
|
||
elif ext == '.pdf':
|
||
return 'pdf'
|
||
else:
|
||
raise ValueError(f"不支持的文件类型: {ext}")
|
||
|
||
@app.post("/api/v1/inventory/parse-document", response_model=ParseDocumentResponse)
|
||
async def parse_document(
|
||
file: Optional[UploadFile] = File(None),
|
||
file_path: Optional[str] = Form(None),
|
||
file_type: Optional[str] = Form(None),
|
||
project_id: str = Form(...)
|
||
):
|
||
"""
|
||
文档解析接口
|
||
|
||
支持解析 Excel、Word、PDF 格式的数据字典文档,提取表结构信息
|
||
"""
|
||
start_time = time.time()
|
||
|
||
try:
|
||
# 验证参数
|
||
if not file and not file_path:
|
||
raise HTTPException(
|
||
status_code=400,
|
||
detail="必须提供文件或文件路径"
|
||
)
|
||
|
||
# 处理文件上传
|
||
if file:
|
||
# 保存上传的文件到临时目录
|
||
upload_dir = Path("/tmp/uploads")
|
||
upload_dir.mkdir(exist_ok=True)
|
||
file_path = str(upload_dir / file.filename)
|
||
|
||
with open(file_path, "wb") as f:
|
||
content = await file.read()
|
||
f.write(content)
|
||
|
||
file_name = file.filename
|
||
file_size = len(content)
|
||
|
||
# 自动检测文件类型
|
||
if not file_type:
|
||
file_type = detect_file_type(file_name)
|
||
else:
|
||
# 使用提供的文件路径
|
||
if not os.path.exists(file_path):
|
||
raise HTTPException(
|
||
status_code=404,
|
||
detail=f"文件不存在: {file_path}"
|
||
)
|
||
file_name = Path(file_path).name
|
||
file_size = os.path.getsize(file_path)
|
||
|
||
# 自动检测文件类型
|
||
if not file_type:
|
||
file_type = detect_file_type(file_name)
|
||
|
||
# 根据文件类型选择解析方法
|
||
if file_type == 'excel':
|
||
tables = parse_excel(file_path)
|
||
elif file_type == 'word':
|
||
tables = parse_word(file_path)
|
||
elif file_type == 'pdf':
|
||
tables = parse_pdf(file_path)
|
||
else:
|
||
raise HTTPException(
|
||
status_code=400,
|
||
detail=f"不支持的文件类型: {file_type}"
|
||
)
|
||
|
||
# 计算统计信息
|
||
total_fields = sum(table.field_count for table in tables)
|
||
parse_time = time.time() - start_time
|
||
|
||
# 构建响应数据
|
||
response_data = {
|
||
"tables": [table.dict() for table in tables],
|
||
"total_tables": len(tables),
|
||
"total_fields": total_fields,
|
||
"parse_time": round(parse_time, 2),
|
||
"file_info": {
|
||
"file_name": file_name,
|
||
"file_size": file_size,
|
||
"file_type": file_type
|
||
}
|
||
}
|
||
|
||
return ParseDocumentResponse(
|
||
success=True,
|
||
code=200,
|
||
message="文档解析成功",
|
||
data=response_data
|
||
)
|
||
|
||
except HTTPException:
|
||
raise
|
||
except Exception as e:
|
||
return ParseDocumentResponse(
|
||
success=False,
|
||
code=500,
|
||
message="文档解析失败",
|
||
error={
|
||
"error_code": "PARSE_ERROR",
|
||
"error_detail": str(e)
|
||
}
|
||
)
|
||
```
|
||
|
||
---
|
||
|
||
## 🧪 测试用例
|
||
|
||
### 单元测试示例
|
||
|
||
```python
|
||
import pytest
|
||
from fastapi.testclient import TestClient
|
||
from pathlib import Path
|
||
import tempfile
|
||
|
||
client = TestClient(app)
|
||
|
||
def test_parse_excel_document():
|
||
"""测试解析 Excel 文档"""
|
||
# 创建测试 Excel 文件
|
||
test_data = {
|
||
'字段名': ['user_id', 'user_name', 'email'],
|
||
'类型': ['varchar(64)', 'varchar(50)', 'varchar(100)'],
|
||
'注释': ['用户ID', '用户名', '邮箱']
|
||
}
|
||
df = pd.DataFrame(test_data)
|
||
|
||
with tempfile.NamedTemporaryFile(suffix='.xlsx', delete=False) as tmp:
|
||
df.to_excel(tmp.name, index=False)
|
||
|
||
with open(tmp.name, 'rb') as f:
|
||
response = client.post(
|
||
"/api/v1/inventory/parse-document",
|
||
files={"file": ("test.xlsx", f, "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")},
|
||
data={"project_id": "test_project"}
|
||
)
|
||
|
||
assert response.status_code == 200
|
||
data = response.json()
|
||
assert data["success"] is True
|
||
assert len(data["data"]["tables"]) > 0
|
||
assert data["data"]["total_tables"] > 0
|
||
|
||
def test_unsupported_file_type():
|
||
"""测试不支持的文件类型"""
|
||
with tempfile.NamedTemporaryFile(suffix='.txt', delete=False) as tmp:
|
||
tmp.write(b"test content")
|
||
tmp.flush()
|
||
|
||
with open(tmp.name, 'rb') as f:
|
||
response = client.post(
|
||
"/api/v1/inventory/parse-document",
|
||
files={"file": ("test.txt", f, "text/plain")},
|
||
data={"project_id": "test_project"}
|
||
)
|
||
|
||
assert response.status_code == 400
|
||
data = response.json()
|
||
assert data["success"] is False
|
||
```
|
||
|
||
---
|
||
|
||
## ⚠️ 注意事项
|
||
|
||
### 1. 文件大小限制
|
||
|
||
- **Excel**: 建议限制为 50MB
|
||
- **Word**: 建议限制为 50MB
|
||
- **PDF**: 建议限制为 50MB
|
||
|
||
在 FastAPI 中设置:
|
||
|
||
```python
|
||
from fastapi import File, UploadFile
|
||
from fastapi.exceptions import RequestEntityTooLarge
|
||
|
||
@app.exception_handler(RequestEntityTooLarge)
|
||
async def handle_upload_limit(exc):
|
||
return JSONResponse(
|
||
status_code=413,
|
||
content={
|
||
"success": False,
|
||
"message": "文件大小超过限制(最大 50MB)"
|
||
}
|
||
)
|
||
```
|
||
|
||
### 2. 文件格式约定
|
||
|
||
由于不同用户的数据字典文档格式可能不同,建议:
|
||
- **Excel**: 约定格式为第一列字段名,第二列类型,第三列注释
|
||
- **Word**: 约定使用表格格式,第一行表头,后续行字段信息
|
||
- **PDF**: 约定使用表格格式
|
||
|
||
如果格式不统一,需要增加更智能的识别逻辑。
|
||
|
||
### 3. 错误处理
|
||
|
||
- 文件读取失败:返回 400 错误
|
||
- 文件格式错误:返回 400 错误,提示正确的格式
|
||
- 解析失败:返回 500 错误,记录详细错误日志
|
||
- 文件过大:返回 413 错误
|
||
|
||
### 4. 性能优化
|
||
|
||
- 对于大文件,考虑使用异步处理
|
||
- 使用临时文件,处理完成后删除
|
||
- 考虑添加缓存机制(相同文件解析结果缓存)
|
||
|
||
### 5. 安全性
|
||
|
||
- 文件上传路径验证,防止路径遍历攻击
|
||
- 文件类型验证,防止恶意文件上传
|
||
- 文件大小限制,防止 DoS 攻击
|
||
- 临时文件及时清理
|
||
|
||
---
|
||
|
||
## 📝 开发检查清单
|
||
|
||
- [ ] 支持 Excel (.xlsx, .xls) 格式解析
|
||
- [ ] 支持 Word (.doc, .docx) 格式解析
|
||
- [ ] 支持 PDF (.pdf) 格式解析(可选)
|
||
- [ ] 文件类型自动识别
|
||
- [ ] 文件大小限制(50MB)
|
||
- [ ] 错误处理和异常捕获
|
||
- [ ] 单元测试覆盖
|
||
- [ ] 日志记录
|
||
- [ ] 临时文件清理
|
||
- [ ] API 文档生成(Swagger)
|
||
|
||
---
|
||
|
||
## 🔗 相关文档
|
||
|
||
- [接口清单表格](../Python接口清单表格.md)
|
||
- [Python技术人员工作量文档](../Python技术人员工作量文档.md)
|
||
- [FastAPI 官方文档](https://fastapi.tiangolo.com/)
|
||
- [pandas 文档](https://pandas.pydata.org/docs/)
|
||
- [python-docx 文档](https://python-docx.readthedocs.io/)
|
||
|
||
---
|
||
|
||
## 📞 联系方式
|
||
|
||
如有开发问题,请联系:
|
||
- **接口负责人**: [待填写]
|
||
- **技术顾问**: [待填写]
|