548 lines
15 KiB
Markdown
548 lines
15 KiB
Markdown
# 接口开发说明 - 业务表解析接口
|
||
|
||
## 📋 接口基本信息
|
||
|
||
- **接口路径**: `/api/v1/inventory/parse-business-tables`
|
||
- **请求方法**: `POST`
|
||
- **接口功能**: 解析业务人员手动导出的核心业务表(Excel/CSV),支持批量文件解析和表结构识别
|
||
- **涉及页面**: `InventoryStep.vue` - 方案三(业务关键表导入)
|
||
- **是否涉及大模型**: ❌ 否
|
||
- **工作量评估**: 3 人日
|
||
- **优先级**: 中
|
||
|
||
---
|
||
|
||
## 🎯 功能描述
|
||
|
||
该接口用于解析业务人员手动导出的核心业务表文件,支持:
|
||
- **批量文件上传**: 一次可上传多个文件
|
||
- **格式支持**: Excel (.xlsx, .xls)、CSV (.csv)
|
||
- **表结构识别**: 自动识别 Excel 中的表结构(通过 Sheet 名称或文件名)
|
||
- **进度反馈**: 支持批量处理时的进度反馈
|
||
|
||
适用场景:
|
||
- SaaS 系统(如 Salesforce、金蝶、有赞)无法直接连接数据库
|
||
- 业务人员手动导出核心业务表
|
||
- 需要批量处理多个文件
|
||
|
||
---
|
||
|
||
## 🔧 技术实现方案
|
||
|
||
### 技术栈
|
||
|
||
```python
|
||
# 核心依赖
|
||
fastapi>=0.104.0 # Web 框架
|
||
pydantic>=2.0.0 # 数据验证
|
||
celery>=5.3.0 # 异步任务(可选)
|
||
|
||
# 数据处理
|
||
pandas>=2.0.0 # 批量文件处理
|
||
openpyxl>=3.1.0 # Excel 处理
|
||
```
|
||
|
||
### 实现思路
|
||
|
||
1. **批量文件上传**: 接收多个文件
|
||
2. **文件解析**: 使用 `pandas` 批量读取文件
|
||
3. **表结构识别**: 根据文件名或 Sheet 名称识别表名
|
||
4. **字段识别**: 从 Excel/CSV 的表头识别字段名和类型
|
||
5. **进度反馈**: 使用异步任务或进度回调
|
||
6. **结果汇总**: 汇总所有文件的解析结果
|
||
|
||
---
|
||
|
||
## 📥 请求格式
|
||
|
||
### 请求方式
|
||
|
||
**Content-Type**: `multipart/form-data`
|
||
|
||
### 请求参数
|
||
|
||
```http
|
||
POST /api/v1/inventory/parse-business-tables
|
||
Content-Type: multipart/form-data
|
||
|
||
files: [文件1, 文件2, ...] # 多个文件
|
||
project_id: string
|
||
```
|
||
|
||
或
|
||
|
||
```json
|
||
{
|
||
"file_paths": ["/path/to/file1.xlsx", "/path/to/file2.csv", ...],
|
||
"project_id": "project_001"
|
||
}
|
||
```
|
||
|
||
### 请求参数说明
|
||
|
||
| 参数名 | 类型 | 必填 | 说明 |
|
||
|--------|------|------|------|
|
||
| `files` | File[] | 是 | 上传的文件列表(方式一,支持多个) |
|
||
| `file_paths` | string[] | 是 | 文件路径列表(方式二) |
|
||
| `project_id` | string | 是 | 项目ID |
|
||
|
||
---
|
||
|
||
## 📤 响应格式
|
||
|
||
### 成功响应
|
||
|
||
```json
|
||
{
|
||
"success": true,
|
||
"code": 200,
|
||
"message": "业务表解析成功",
|
||
"data": {
|
||
"tables": [
|
||
{
|
||
"raw_name": "orders",
|
||
"display_name": "订单流水明细表",
|
||
"description": "从文件 orders.xlsx 解析",
|
||
"source_file": "orders.xlsx",
|
||
"fields": [
|
||
{
|
||
"raw_name": "order_id",
|
||
"display_name": "订单ID",
|
||
"type": "string",
|
||
"comment": null,
|
||
"inferred_type": "varchar(64)"
|
||
}
|
||
],
|
||
"field_count": 10,
|
||
"row_count": 10000
|
||
}
|
||
],
|
||
"total_tables": 5,
|
||
"total_fields": 150,
|
||
"total_files": 5,
|
||
"success_files": 5,
|
||
"failed_files": [],
|
||
"parse_time": 3.45,
|
||
"file_info": {
|
||
"processed_files": [
|
||
{
|
||
"file_name": "orders.xlsx",
|
||
"file_size": 1024000,
|
||
"tables_extracted": 1,
|
||
"status": "success"
|
||
}
|
||
]
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### 异步任务响应(如果使用异步处理)
|
||
|
||
```json
|
||
{
|
||
"success": true,
|
||
"code": 202,
|
||
"message": "任务已提交,正在处理中",
|
||
"data": {
|
||
"task_id": "task_123456",
|
||
"total_files": 5,
|
||
"status": "processing",
|
||
"estimated_time": 30
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 💻 代码实现示例
|
||
|
||
### FastAPI 实现(同步版本)
|
||
|
||
```python
|
||
from fastapi import FastAPI, UploadFile, File, Form, HTTPException
|
||
from fastapi.responses import JSONResponse
|
||
from pydantic import BaseModel
|
||
from typing import Optional, List, Dict
|
||
import pandas as pd
|
||
import os
|
||
from pathlib import Path
|
||
import time
|
||
from collections import defaultdict
|
||
|
||
app = FastAPI()
|
||
|
||
class FieldInfo(BaseModel):
|
||
raw_name: str
|
||
display_name: Optional[str] = None
|
||
type: str
|
||
comment: Optional[str] = None
|
||
inferred_type: Optional[str] = None
|
||
|
||
class TableInfo(BaseModel):
|
||
raw_name: str
|
||
display_name: Optional[str] = None
|
||
description: Optional[str] = None
|
||
source_file: str
|
||
fields: List[FieldInfo]
|
||
field_count: int
|
||
row_count: Optional[int] = None
|
||
|
||
def infer_field_type(pd_type: str) -> str:
|
||
"""根据 pandas 类型推断数据库字段类型"""
|
||
type_mapping = {
|
||
'object': 'varchar(255)',
|
||
'int64': 'bigint',
|
||
'int32': 'int',
|
||
'float64': 'double',
|
||
'float32': 'float',
|
||
'bool': 'tinyint',
|
||
'datetime64[ns]': 'datetime',
|
||
'date': 'date'
|
||
}
|
||
return type_mapping.get(str(pd_type), 'varchar(255)')
|
||
|
||
def parse_excel_file(file_path: str, file_name: str) -> List[TableInfo]:
|
||
"""解析单个 Excel 文件"""
|
||
tables = []
|
||
|
||
try:
|
||
# 读取所有 Sheet
|
||
excel_file = pd.ExcelFile(file_path)
|
||
|
||
for sheet_name in excel_file.sheet_names:
|
||
df = pd.read_excel(file_path, sheet_name=sheet_name)
|
||
|
||
# 跳过空 Sheet
|
||
if df.empty:
|
||
continue
|
||
|
||
# 识别字段
|
||
fields = []
|
||
for col in df.columns:
|
||
# 推断字段类型
|
||
col_type = str(df[col].dtype)
|
||
inferred_type = infer_field_type(col_type)
|
||
|
||
field = FieldInfo(
|
||
raw_name=str(col).strip(),
|
||
display_name=str(col).strip(),
|
||
type=inferred_type,
|
||
comment=None,
|
||
inferred_type=inferred_type
|
||
)
|
||
fields.append(field)
|
||
|
||
if fields:
|
||
# 使用 Sheet 名称或文件名作为表名
|
||
table_name = sheet_name.lower().replace(' ', '_').replace('-', '_')
|
||
if not table_name:
|
||
table_name = Path(file_name).stem.lower().replace(' ', '_').replace('-', '_')
|
||
|
||
table = TableInfo(
|
||
raw_name=table_name,
|
||
display_name=sheet_name,
|
||
description=f"从文件 {file_name} 的 Sheet '{sheet_name}' 解析",
|
||
source_file=file_name,
|
||
fields=fields,
|
||
field_count=len(fields),
|
||
row_count=len(df)
|
||
)
|
||
tables.append(table)
|
||
|
||
except Exception as e:
|
||
raise Exception(f"解析文件 {file_name} 失败: {str(e)}")
|
||
|
||
return tables
|
||
|
||
def parse_csv_file(file_path: str, file_name: str) -> List[TableInfo]:
|
||
"""解析单个 CSV 文件"""
|
||
tables = []
|
||
|
||
try:
|
||
# 尝试多种编码
|
||
encodings = ['utf-8', 'gbk', 'gb2312', 'latin-1']
|
||
df = None
|
||
|
||
for encoding in encodings:
|
||
try:
|
||
df = pd.read_csv(file_path, encoding=encoding)
|
||
break
|
||
except UnicodeDecodeError:
|
||
continue
|
||
|
||
if df is None:
|
||
raise ValueError("无法解析 CSV 文件,请检查文件编码")
|
||
|
||
if df.empty:
|
||
return tables
|
||
|
||
# 识别字段
|
||
fields = []
|
||
for col in df.columns:
|
||
col_type = str(df[col].dtype)
|
||
inferred_type = infer_field_type(col_type)
|
||
|
||
field = FieldInfo(
|
||
raw_name=str(col).strip(),
|
||
display_name=str(col).strip(),
|
||
type=inferred_type,
|
||
comment=None,
|
||
inferred_type=inferred_type
|
||
)
|
||
fields.append(field)
|
||
|
||
if fields:
|
||
# 使用文件名作为表名
|
||
table_name = Path(file_name).stem.lower().replace(' ', '_').replace('-', '_')
|
||
|
||
table = TableInfo(
|
||
raw_name=table_name,
|
||
display_name=Path(file_name).stem,
|
||
description=f"从文件 {file_name} 解析",
|
||
source_file=file_name,
|
||
fields=fields,
|
||
field_count=len(fields),
|
||
row_count=len(df)
|
||
)
|
||
tables.append(table)
|
||
|
||
except Exception as e:
|
||
raise Exception(f"解析文件 {file_name} 失败: {str(e)}")
|
||
|
||
return tables
|
||
|
||
@app.post("/api/v1/inventory/parse-business-tables")
|
||
async def parse_business_tables(
|
||
files: List[UploadFile] = File(...),
|
||
project_id: str = Form(...)
|
||
):
|
||
"""
|
||
业务表解析接口
|
||
|
||
批量解析业务人员导出的核心业务表文件
|
||
"""
|
||
start_time = time.time()
|
||
upload_dir = Path("/tmp/uploads")
|
||
upload_dir.mkdir(exist_ok=True)
|
||
|
||
all_tables = []
|
||
processed_files = []
|
||
failed_files = []
|
||
|
||
try:
|
||
# 处理每个文件
|
||
for file in files:
|
||
file_name = file.filename
|
||
file_path = str(upload_dir / file_name)
|
||
|
||
try:
|
||
# 保存文件
|
||
with open(file_path, "wb") as f:
|
||
content = await file.read()
|
||
f.write(content)
|
||
|
||
file_size = len(content)
|
||
|
||
# 根据文件扩展名选择解析方法
|
||
ext = Path(file_name).suffix.lower()
|
||
if ext in ['.xlsx', '.xls']:
|
||
tables = parse_excel_file(file_path, file_name)
|
||
elif ext == '.csv':
|
||
tables = parse_csv_file(file_path, file_name)
|
||
else:
|
||
failed_files.append({
|
||
"file_name": file_name,
|
||
"error": f"不支持的文件类型: {ext}"
|
||
})
|
||
continue
|
||
|
||
all_tables.extend(tables)
|
||
processed_files.append({
|
||
"file_name": file_name,
|
||
"file_size": file_size,
|
||
"tables_extracted": len(tables),
|
||
"status": "success"
|
||
})
|
||
|
||
# 清理临时文件
|
||
os.remove(file_path)
|
||
|
||
except Exception as e:
|
||
failed_files.append({
|
||
"file_name": file_name,
|
||
"error": str(e)
|
||
})
|
||
# 清理临时文件
|
||
if os.path.exists(file_path):
|
||
os.remove(file_path)
|
||
|
||
# 计算统计信息
|
||
total_fields = sum(table.field_count for table in all_tables)
|
||
parse_time = time.time() - start_time
|
||
|
||
# 构建响应
|
||
response_data = {
|
||
"tables": [table.dict() for table in all_tables],
|
||
"total_tables": len(all_tables),
|
||
"total_fields": total_fields,
|
||
"total_files": len(files),
|
||
"success_files": len(processed_files),
|
||
"failed_files": failed_files,
|
||
"parse_time": round(parse_time, 2),
|
||
"file_info": {
|
||
"processed_files": processed_files
|
||
}
|
||
}
|
||
|
||
return {
|
||
"success": True,
|
||
"code": 200,
|
||
"message": f"成功解析 {len(processed_files)} 个文件,提取 {len(all_tables)} 个表",
|
||
"data": response_data
|
||
}
|
||
|
||
except Exception as e:
|
||
return JSONResponse(
|
||
status_code=500,
|
||
content={
|
||
"success": False,
|
||
"code": 500,
|
||
"message": "业务表解析失败",
|
||
"error": {
|
||
"error_code": "PARSE_ERROR",
|
||
"error_detail": str(e)
|
||
}
|
||
}
|
||
)
|
||
```
|
||
|
||
### 异步版本(使用 Celery,可选)
|
||
|
||
```python
|
||
from celery import Celery
|
||
|
||
celery_app = Celery('tasks', broker='redis://localhost:6379')
|
||
|
||
@celery_app.task
|
||
def parse_business_tables_async(file_paths: List[str], project_id: str):
|
||
"""异步解析业务表"""
|
||
# 解析逻辑同上
|
||
pass
|
||
|
||
@app.post("/api/v1/inventory/parse-business-tables-async")
|
||
async def parse_business_tables_async_endpoint(
|
||
files: List[UploadFile] = File(...),
|
||
project_id: str = Form(...)
|
||
):
|
||
"""异步业务表解析接口"""
|
||
# 保存文件
|
||
file_paths = []
|
||
for file in files:
|
||
file_path = f"/tmp/uploads/{file.filename}"
|
||
with open(file_path, "wb") as f:
|
||
content = await file.read()
|
||
f.write(content)
|
||
file_paths.append(file_path)
|
||
|
||
# 提交异步任务
|
||
task = parse_business_tables_async.delay(file_paths, project_id)
|
||
|
||
return {
|
||
"success": True,
|
||
"code": 202,
|
||
"message": "任务已提交,正在处理中",
|
||
"data": {
|
||
"task_id": task.id,
|
||
"total_files": len(files),
|
||
"status": "processing",
|
||
"estimated_time": len(files) * 10 # 估算时间(秒)
|
||
}
|
||
}
|
||
|
||
@app.get("/api/v1/inventory/parse-business-tables-status/{task_id}")
|
||
async def get_parse_status(task_id: str):
|
||
"""查询解析任务状态"""
|
||
task = celery_app.AsyncResult(task_id)
|
||
|
||
if task.ready():
|
||
return {
|
||
"success": True,
|
||
"code": 200,
|
||
"data": {
|
||
"task_id": task_id,
|
||
"status": "completed",
|
||
"result": task.result
|
||
}
|
||
}
|
||
else:
|
||
return {
|
||
"success": True,
|
||
"code": 200,
|
||
"data": {
|
||
"task_id": task_id,
|
||
"status": "processing",
|
||
"progress": task.info.get('progress', 0) if task.info else 0
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## ⚠️ 注意事项
|
||
|
||
### 1. 批量处理性能
|
||
|
||
- 对于大量文件,建议使用异步处理
|
||
- 设置合理的文件大小限制
|
||
- 考虑并行处理以提高性能
|
||
|
||
### 2. 表名识别
|
||
|
||
由于是业务人员手动导出,表名识别可能不准确:
|
||
- 优先使用 Excel Sheet 名称
|
||
- 其次使用文件名
|
||
- 提供手动修正功能(可选)
|
||
|
||
### 3. 字段类型推断
|
||
|
||
- 基于 pandas 类型推断,可能不够准确
|
||
- 后续可通过 AI 识别接口进一步优化
|
||
- 记录推断类型,便于后续验证
|
||
|
||
### 4. 错误处理
|
||
|
||
- 单个文件失败不应影响其他文件处理
|
||
- 记录详细的错误信息
|
||
- 提供失败文件列表
|
||
|
||
### 5. 资源管理
|
||
|
||
- 及时清理临时文件
|
||
- 控制并发文件数量
|
||
- 限制单个文件大小
|
||
|
||
---
|
||
|
||
## 📝 开发检查清单
|
||
|
||
- [ ] 支持批量文件上传
|
||
- [ ] 支持 Excel (.xlsx, .xls) 格式
|
||
- [ ] 支持 CSV (.csv) 格式
|
||
- [ ] Excel 多 Sheet 支持
|
||
- [ ] CSV 编码自动检测
|
||
- [ ] 字段类型推断
|
||
- [ ] 进度反馈(异步版本)
|
||
- [ ] 错误处理(单个文件失败不影响其他)
|
||
- [ ] 临时文件清理
|
||
- [ ] 单元测试覆盖
|
||
|
||
---
|
||
|
||
## 🔗 相关文档
|
||
|
||
- [接口清单表格](../Python接口清单表格.md)
|
||
- [接口 1.1 - 文档解析接口](./01-parse-document.md)
|
||
- [接口 1.2 - SQL 结果解析接口](./02-parse-sql-result.md)
|
||
- [接口 1.4 - 数据资产智能识别接口](./04-ai-analyze.md) - 可进一步优化识别结果
|