finyx_data_frontend/docs/api/03-parse-business-tables.md

# 接口开发说明 - 业务表解析接口

## 📋 接口基本信息

- **接口路径**: `/api/v1/inventory/parse-business-tables`
- **请求方法**: `POST`
- **接口功能**: 解析业务人员手动导出的核心业务表（Excel/CSV），支持批量文件解析和表结构识别
- **涉及页面**: `InventoryStep.vue` - 方案三（业务关键表导入）
- **是否涉及大模型**: ❌ 否
- **工作量评估**: 3 人日
- **优先级**: 中

---

## 🎯 功能描述

该接口用于解析业务人员手动导出的核心业务表文件，支持：
- **批量文件上传**: 一次可上传多个文件
- **格式支持**: Excel (.xlsx, .xls)、CSV (.csv)
- **表结构识别**: 自动识别 Excel 中的表结构（通过 Sheet 名称或文件名）
- **进度反馈**: 支持批量处理时的进度反馈

适用场景：
- SaaS 系统（如 Salesforce、金蝶、有赞）无法直接连接数据库
- 业务人员手动导出核心业务表
- 需要批量处理多个文件

---

## 🔧 技术实现方案

### 技术栈

```python
# 核心依赖
fastapi>=0.104.0          # Web 框架
pydantic>=2.0.0           # 数据验证
celery>=5.3.0             # 异步任务（可选）

# 数据处理
pandas>=2.0.0             # 批量文件处理
openpyxl>=3.1.0           # Excel 处理
```

### 实现思路

1. **批量文件上传**: 接收多个文件
2. **文件解析**: 使用 `pandas` 批量读取文件
3. **表结构识别**: 根据文件名或 Sheet 名称识别表名
4. **字段识别**: 从 Excel/CSV 的表头识别字段名和类型
5. **进度反馈**: 使用异步任务或进度回调
6. **结果汇总**: 汇总所有文件的解析结果

---

## 📥 请求格式

### 请求方式

**Content-Type**: `multipart/form-data`

### 请求参数

```http
POST /api/v1/inventory/parse-business-tables
Content-Type: multipart/form-data

files: [文件1, 文件2, ...]  # 多个文件
project_id: string
```

或

```json
{
  "file_paths": ["/path/to/file1.xlsx", "/path/to/file2.csv", ...],
  "project_id": "project_001"
}
```

### 请求参数说明

| 参数名 | 类型 | 必填 | 说明 |
|--------|------|------|------|
| `files` | File[] | 是 | 上传的文件列表（方式一，支持多个） |
| `file_paths` | string[] | 是 | 文件路径列表（方式二） |
| `project_id` | string | 是 | 项目ID |

---

## 📤 响应格式

### 成功响应

```json
{
  "success": true,
  "code": 200,
  "message": "业务表解析成功",
  "data": {
    "tables": [
      {
        "raw_name": "orders",
        "display_name": "订单流水明细表",
        "description": "从文件 orders.xlsx 解析",
        "source_file": "orders.xlsx",
        "fields": [
          {
            "raw_name": "order_id",
            "display_name": "订单ID",
            "type": "string",
            "comment": null,
            "inferred_type": "varchar(64)"
          }
        ],
        "field_count": 10,
        "row_count": 10000
      }
    ],
    "total_tables": 5,
    "total_fields": 150,
    "total_files": 5,
    "success_files": 5,
    "failed_files": [],
    "parse_time": 3.45,
    "file_info": {
      "processed_files": [
        {
          "file_name": "orders.xlsx",
          "file_size": 1024000,
          "tables_extracted": 1,
          "status": "success"
        }
      ]
    }
  }
}
```

### 异步任务响应（如果使用异步处理）

```json
{
  "success": true,
  "code": 202,
  "message": "任务已提交，正在处理中",
  "data": {
    "task_id": "task_123456",
    "total_files": 5,
    "status": "processing",
    "estimated_time": 30
  }
}
```

---

## 💻 代码实现示例

### FastAPI 实现（同步版本）

```python
from fastapi import FastAPI, UploadFile, File, Form, HTTPException
from fastapi.responses import JSONResponse
from pydantic import BaseModel
from typing import Optional, List, Dict
import pandas as pd
import os
from pathlib import Path
import time
from collections import defaultdict

app = FastAPI()

class FieldInfo(BaseModel):
    raw_name: str
    display_name: Optional[str] = None
    type: str
    comment: Optional[str] = None
    inferred_type: Optional[str] = None

class TableInfo(BaseModel):
    raw_name: str
    display_name: Optional[str] = None
    description: Optional[str] = None
    source_file: str
    fields: List[FieldInfo]
    field_count: int
    row_count: Optional[int] = None

def infer_field_type(pd_type: str) -> str:
    """根据 pandas 类型推断数据库字段类型"""
    type_mapping = {
        'object': 'varchar(255)',
        'int64': 'bigint',
        'int32': 'int',
        'float64': 'double',
        'float32': 'float',
        'bool': 'tinyint',
        'datetime64[ns]': 'datetime',
        'date': 'date'
    }
    return type_mapping.get(str(pd_type), 'varchar(255)')

def parse_excel_file(file_path: str, file_name: str) -> List[TableInfo]:
    """解析单个 Excel 文件"""
    tables = []

    try:
        # 读取所有 Sheet
        excel_file = pd.ExcelFile(file_path)

        for sheet_name in excel_file.sheet_names:
            df = pd.read_excel(file_path, sheet_name=sheet_name)

            # 跳过空 Sheet
            if df.empty:
                continue

            # 识别字段
            fields = []
            for col in df.columns:
                # 推断字段类型
                col_type = str(df[col].dtype)
                inferred_type = infer_field_type(col_type)

                field = FieldInfo(
                    raw_name=str(col).strip(),
                    display_name=str(col).strip(),
                    type=inferred_type,
                    comment=None,
                    inferred_type=inferred_type
                )
                fields.append(field)

            if fields:
                # 使用 Sheet 名称或文件名作为表名
                table_name = sheet_name.lower().replace(' ', '_').replace('-', '_')
                if not table_name:
                    table_name = Path(file_name).stem.lower().replace(' ', '_').replace('-', '_')

                table = TableInfo(
                    raw_name=table_name,
                    display_name=sheet_name,
                    description=f"从文件 {file_name} 的 Sheet '{sheet_name}' 解析",
                    source_file=file_name,
                    fields=fields,
                    field_count=len(fields),
                    row_count=len(df)
                )
                tables.append(table)

    except Exception as e:
        raise Exception(f"解析文件 {file_name} 失败: {str(e)}")

    return tables

def parse_csv_file(file_path: str, file_name: str) -> List[TableInfo]:
    """解析单个 CSV 文件"""
    tables = []

    try:
        # 尝试多种编码
        encodings = ['utf-8', 'gbk', 'gb2312', 'latin-1']
        df = None

        for encoding in encodings:
            try:
                df = pd.read_csv(file_path, encoding=encoding)
                break
            except UnicodeDecodeError:
                continue

        if df is None:
            raise ValueError("无法解析 CSV 文件，请检查文件编码")

        if df.empty:
            return tables

        # 识别字段
        fields = []
        for col in df.columns:
            col_type = str(df[col].dtype)
            inferred_type = infer_field_type(col_type)

            field = FieldInfo(
                raw_name=str(col).strip(),
                display_name=str(col).strip(),
                type=inferred_type,
                comment=None,
                inferred_type=inferred_type
            )
            fields.append(field)

        if fields:
            # 使用文件名作为表名
            table_name = Path(file_name).stem.lower().replace(' ', '_').replace('-', '_')

            table = TableInfo(
                raw_name=table_name,
                display_name=Path(file_name).stem,
                description=f"从文件 {file_name} 解析",
                source_file=file_name,
                fields=fields,
                field_count=len(fields),
                row_count=len(df)
            )
            tables.append(table)

    except Exception as e:
        raise Exception(f"解析文件 {file_name} 失败: {str(e)}")

    return tables

@app.post("/api/v1/inventory/parse-business-tables")
async def parse_business_tables(
    files: List[UploadFile] = File(...),
    project_id: str = Form(...)
):
    """
    业务表解析接口

    批量解析业务人员导出的核心业务表文件
    """
    start_time = time.time()
    upload_dir = Path("/tmp/uploads")
    upload_dir.mkdir(exist_ok=True)

    all_tables = []
    processed_files = []
    failed_files = []

    try:
        # 处理每个文件
        for file in files:
            file_name = file.filename
            file_path = str(upload_dir / file_name)

            try:
                # 保存文件
                with open(file_path, "wb") as f:
                    content = await file.read()
                    f.write(content)

                file_size = len(content)

                # 根据文件扩展名选择解析方法
                ext = Path(file_name).suffix.lower()
                if ext in ['.xlsx', '.xls']:
                    tables = parse_excel_file(file_path, file_name)
                elif ext == '.csv':
                    tables = parse_csv_file(file_path, file_name)
                else:
                    failed_files.append({
                        "file_name": file_name,
                        "error": f"不支持的文件类型: {ext}"
                    })
                    continue

                all_tables.extend(tables)
                processed_files.append({
                    "file_name": file_name,
                    "file_size": file_size,
                    "tables_extracted": len(tables),
                    "status": "success"
                })

                # 清理临时文件
                os.remove(file_path)

            except Exception as e:
                failed_files.append({
                    "file_name": file_name,
                    "error": str(e)
                })
                # 清理临时文件
                if os.path.exists(file_path):
                    os.remove(file_path)

        # 计算统计信息
        total_fields = sum(table.field_count for table in all_tables)
        parse_time = time.time() - start_time

        # 构建响应
        response_data = {
            "tables": [table.dict() for table in all_tables],
            "total_tables": len(all_tables),
            "total_fields": total_fields,
            "total_files": len(files),
            "success_files": len(processed_files),
            "failed_files": failed_files,
            "parse_time": round(parse_time, 2),
            "file_info": {
                "processed_files": processed_files
            }
        }

        return {
            "success": True,
            "code": 200,
            "message": f"成功解析 {len(processed_files)} 个文件，提取 {len(all_tables)} 个表",
            "data": response_data
        }

    except Exception as e:
        return JSONResponse(
            status_code=500,
            content={
                "success": False,
                "code": 500,
                "message": "业务表解析失败",
                "error": {
                    "error_code": "PARSE_ERROR",
                    "error_detail": str(e)
                }
            }
        )
```

### 异步版本（使用 Celery，可选）

```python
from celery import Celery

celery_app = Celery('tasks', broker='redis://localhost:6379')

@celery_app.task
def parse_business_tables_async(file_paths: List[str], project_id: str):
    """异步解析业务表"""
    # 解析逻辑同上
    pass

@app.post("/api/v1/inventory/parse-business-tables-async")
async def parse_business_tables_async_endpoint(
    files: List[UploadFile] = File(...),
    project_id: str = Form(...)
):
    """异步业务表解析接口"""
    # 保存文件
    file_paths = []
    for file in files:
        file_path = f"/tmp/uploads/{file.filename}"
        with open(file_path, "wb") as f:
            content = await file.read()
            f.write(content)
        file_paths.append(file_path)

    # 提交异步任务
    task = parse_business_tables_async.delay(file_paths, project_id)

    return {
        "success": True,
        "code": 202,
        "message": "任务已提交，正在处理中",
        "data": {
            "task_id": task.id,
            "total_files": len(files),
            "status": "processing",
            "estimated_time": len(files) * 10  # 估算时间（秒）
        }
    }

@app.get("/api/v1/inventory/parse-business-tables-status/{task_id}")
async def get_parse_status(task_id: str):
    """查询解析任务状态"""
    task = celery_app.AsyncResult(task_id)

    if task.ready():
        return {
            "success": True,
            "code": 200,
            "data": {
                "task_id": task_id,
                "status": "completed",
                "result": task.result
            }
        }
    else:
        return {
            "success": True,
            "code": 200,
            "data": {
                "task_id": task_id,
                "status": "processing",
                "progress": task.info.get('progress', 0) if task.info else 0
            }
        }
```

---

## ⚠️ 注意事项

### 1. 批量处理性能

- 对于大量文件，建议使用异步处理
- 设置合理的文件大小限制
- 考虑并行处理以提高性能

### 2. 表名识别

由于是业务人员手动导出，表名识别可能不准确：
- 优先使用 Excel Sheet 名称
- 其次使用文件名
- 提供手动修正功能（可选）

### 3. 字段类型推断

- 基于 pandas 类型推断，可能不够准确
- 后续可通过 AI 识别接口进一步优化
- 记录推断类型，便于后续验证

### 4. 错误处理

- 单个文件失败不应影响其他文件处理
- 记录详细的错误信息
- 提供失败文件列表

### 5. 资源管理

- 及时清理临时文件
- 控制并发文件数量
- 限制单个文件大小

---

## 📝 开发检查清单

- [ ] 支持批量文件上传
- [ ] 支持 Excel (.xlsx, .xls) 格式
- [ ] 支持 CSV (.csv) 格式
- [ ] Excel 多 Sheet 支持
- [ ] CSV 编码自动检测
- [ ] 字段类型推断
- [ ] 进度反馈（异步版本）
- [ ] 错误处理（单个文件失败不影响其他）
- [ ] 临时文件清理
- [ ] 单元测试覆盖

---

## 🔗 相关文档

- [接口清单表格](../Python接口清单表格.md)
- [接口 1.1 - 文档解析接口](./01-parse-document.md)
- [接口 1.2 - SQL 结果解析接口](./02-parse-sql-result.md)
- [接口 1.4 - 数据资产智能识别接口](./04-ai-analyze.md) - 可进一步优化识别结果