finyx_data_ai/docs/01-parse-document.md

# 接口开发说明 - 文档解析接口

## 📋 接口基本信息

- **接口路径**: `/api/v1/inventory/parse-document`
- **请求方法**: `POST`
- **接口功能**: 解析上传的数据字典文档（Excel/Word/PDF），提取表结构信息
- **涉及页面**: `InventoryStep.vue` - 方案一（已有文档导入）
- **是否涉及大模型**: ❌ 否
- **工作量评估**: 5 人日
- **优先级**: 中

---

## 🎯 功能描述

该接口用于解析用户上传的数据字典文档，支持以下格式：
- **Excel**: `.xlsx`, `.xls`
- **Word**: `.doc`, `.docx`
- **PDF**: `.pdf`（可选）

从文档中提取以下信息：
- 表名（英文）
- 字段名（英文）
- 字段类型
- 字段注释/描述（中文）

---

## 🔧 技术实现方案

### 技术栈

```python
# 核心依赖
fastapi>=0.104.0          # Web 框架
pydantic>=2.0.0           # 数据验证
python-multipart>=0.0.6   # 文件上传支持

# 文档处理
pandas>=2.0.0             # 数据处理
openpyxl>=3.1.0           # Excel 处理
python-docx>=1.1.0        # Word 处理
pdfplumber>=0.10.0        # PDF 处理（可选）
```

### 实现思路

1. **文件上传**: 使用 FastAPI 的 `UploadFile` 接收文件
2. **文件类型识别**: 根据文件扩展名或 MIME 类型识别文件格式
3. **文档解析**:
   - Excel: 使用 `pandas` 或 `openpyxl` 读取
   - Word: 使用 `python-docx` 解析表格和文本
   - PDF: 使用 `pdfplumber` 提取表格和文本
4. **表结构提取**: 识别文档中的表结构信息，提取表名、字段名、类型、注释
5. **数据验证**: 验证提取的数据格式是否正确
6. **结果返回**: 返回标准化的表结构数据

---

## 📥 请求格式

### 请求方式

**Content-Type**: `multipart/form-data` 或 `application/json`

### 方式一：文件上传（推荐）

```http
POST /api/v1/inventory/parse-document
Content-Type: multipart/form-data

file: [二进制文件]
project_id: string
file_type: excel | word | pdf (可选，自动识别)
```

### 方式二：文件路径（如果文件已上传到服务器）

```json
{
  "file_path": "/path/to/document.xlsx",
  "file_type": "excel | word | pdf",
  "project_id": "project_001"
}
```

### 请求参数说明

| 参数名 | 类型 | 必填 | 说明 |
|--------|------|------|------|
| `file` | File | 是 | 上传的文件（方式一） |
| `file_path` | string | 是 | 文件路径（方式二） |
| `file_type` | string | 否 | 文件类型：`excel` / `word` / `pdf`，如果不传则根据文件扩展名自动识别 |
| `project_id` | string | 是 | 项目ID |

---

## 📤 响应格式

### 成功响应

```json
{
  "success": true,
  "code": 200,
  "message": "文档解析成功",
  "data": {
    "tables": [
      {
        "raw_name": "t_user_base_01",
        "display_name": "用户基础信息表",
        "description": "存储用户基本信息的表",
        "fields": [
          {
            "raw_name": "user_id",
            "display_name": "用户ID",
            "type": "varchar(64)",
            "comment": "用户的唯一标识符",
            "is_primary_key": true,
            "is_nullable": false,
            "default_value": null
          },
          {
            "raw_name": "user_name",
            "display_name": "用户名",
            "type": "varchar(50)",
            "comment": "用户登录名",
            "is_primary_key": false,
            "is_nullable": true,
            "default_value": null
          }
        ],
        "field_count": 2
      }
    ],
    "total_tables": 10,
    "total_fields": 245,
    "parse_time": 1.23,
    "file_info": {
      "file_name": "数据字典.xlsx",
      "file_size": 1024000,
      "file_type": "excel"
    }
  }
}
```

### 失败响应

```json
{
  "success": false,
  "code": 400,
  "message": "文件格式不支持",
  "error": {
    "error_code": "UNSUPPORTED_FILE_TYPE",
    "error_detail": "仅支持 Excel (.xlsx, .xls), Word (.doc, .docx), PDF (.pdf) 格式"
  }
}
```

### 响应字段说明

| 字段名 | 类型 | 说明 |
|--------|------|------|
| `success` | boolean | 请求是否成功 |
| `code` | integer | HTTP 状态码 |
| `message` | string | 响应消息 |
| `data` | object | 响应数据 |
| `data.tables` | array | 解析出的表列表 |
| `data.tables[].raw_name` | string | 表名（英文/原始名称） |
| `data.tables[].display_name` | string | 表显示名称（中文，如果文档中有） |
| `data.tables[].description` | string | 表描述 |
| `data.tables[].fields` | array | 字段列表 |
| `data.tables[].fields[].raw_name` | string | 字段名（英文） |
| `data.tables[].fields[].display_name` | string | 字段显示名称（中文） |
| `data.tables[].fields[].type` | string | 字段类型 |
| `data.tables[].fields[].comment` | string | 字段注释 |
| `data.total_tables` | integer | 总表数 |
| `data.total_fields` | integer | 总字段数 |
| `data.parse_time` | float | 解析耗时（秒） |

---

## 💻 代码实现示例

### FastAPI 实现

```python
from fastapi import FastAPI, UploadFile, File, Form, HTTPException
from fastapi.responses import JSONResponse
from pydantic import BaseModel
from typing import Optional, List
import pandas as pd
from docx import Document
import pdfplumber
import os
from pathlib import Path
import time

app = FastAPI()

class FieldInfo(BaseModel):
    raw_name: str
    display_name: Optional[str] = None
    type: str
    comment: Optional[str] = None
    is_primary_key: bool = False
    is_nullable: bool = True
    default_value: Optional[str] = None

class TableInfo(BaseModel):
    raw_name: str
    display_name: Optional[str] = None
    description: Optional[str] = None
    fields: List[FieldInfo]
    field_count: int

class ParseDocumentResponse(BaseModel):
    success: bool
    code: int
    message: str
    data: Optional[dict] = None
    error: Optional[dict] = None

def parse_excel(file_path: str) -> List[TableInfo]:
    """解析 Excel 文件"""
    tables = []
    try:
        # 读取 Excel 文件
        df = pd.read_excel(file_path, sheet_name=None)  # 读取所有 sheet

        for sheet_name, df_sheet in df.items():
            # 识别表结构（根据 Excel 格式约定）
            # 假设第一列是字段名，第二列是类型，第三列是注释
            fields = []
            for _, row in df_sheet.iterrows():
                if pd.notna(row.iloc[0]):  # 字段名不为空
                    field = FieldInfo(
                        raw_name=str(row.iloc[0]).strip(),
                        display_name=str(row.iloc[2]).strip() if len(row) > 2 and pd.notna(row.iloc[2]) else None,
                        type=str(row.iloc[1]).strip() if len(row) > 1 and pd.notna(row.iloc[1]) else "varchar(255)",
                        comment=str(row.iloc[2]).strip() if len(row) > 2 and pd.notna(row.iloc[2]) else None
                    )
                    fields.append(field)

            if fields:
                table = TableInfo(
                    raw_name=sheet_name,
                    display_name=sheet_name,
                    fields=fields,
                    field_count=len(fields)
                )
                tables.append(table)

    except Exception as e:
        raise Exception(f"Excel 解析失败: {str(e)}")

    return tables

def parse_word(file_path: str) -> List[TableInfo]:
    """解析 Word 文件"""
    tables = []
    try:
        doc = Document(file_path)

        # 遍历文档中的表格
        for table_idx, table in enumerate(doc.tables):
            fields = []
            # 假设第一行是表头，后续行是字段信息
            # 约定：第一列字段名，第二列类型，第三列注释
            for row in table.rows[1:]:  # 跳过表头
                if len(row.cells) >= 3:
                    field_name = row.cells[0].text.strip()
                    if field_name:  # 字段名不为空
                        field = FieldInfo(
                            raw_name=field_name,
                            display_name=row.cells[2].text.strip() if len(row.cells) > 2 and row.cells[2].text.strip() else None,
                            type=row.cells[1].text.strip() if len(row.cells) > 1 and row.cells[1].text.strip() else "varchar(255)",
                            comment=row.cells[2].text.strip() if len(row.cells) > 2 and row.cells[2].text.strip() else None
                        )
                        fields.append(field)

            if fields:
                table_info = TableInfo(
                    raw_name=f"table_{table_idx + 1}",
                    display_name=f"表{table_idx + 1}",
                    fields=fields,
                    field_count=len(fields)
                )
                tables.append(table_info)

    except Exception as e:
        raise Exception(f"Word 解析失败: {str(e)}")

    return tables

def parse_pdf(file_path: str) -> List[TableInfo]:
    """解析 PDF 文件"""
    tables = []
    try:
        with pdfplumber.open(file_path) as pdf:
            for page_idx, page in enumerate(pdf.pages):
                # 提取表格
                page_tables = page.extract_tables()
                for table_idx, table in enumerate(page_tables):
                    if table and len(table) > 1:
                        fields = []
                        # 假设第一行是表头，后续行是字段信息
                        for row in table[1:]:
                            if len(row) >= 3 and row[0]:
                                field = FieldInfo(
                                    raw_name=str(row[0]).strip(),
                                    display_name=str(row[2]).strip() if len(row) > 2 and row[2] else None,
                                    type=str(row[1]).strip() if len(row) > 1 and row[1] else "varchar(255)",
                                    comment=str(row[2]).strip() if len(row) > 2 and row[2] else None
                                )
                                fields.append(field)

                        if fields:
                            table_info = TableInfo(
                                raw_name=f"table_{page_idx + 1}_{table_idx + 1}",
                                display_name=f"表{page_idx + 1}-{table_idx + 1}",
                                fields=fields,
                                field_count=len(fields)
                            )
                            tables.append(table_info)

    except Exception as e:
        raise Exception(f"PDF 解析失败: {str(e)}")

    return tables

def detect_file_type(file_name: str) -> str:
    """根据文件扩展名检测文件类型"""
    ext = Path(file_name).suffix.lower()
    if ext in ['.xlsx', '.xls']:
        return 'excel'
    elif ext in ['.docx', '.doc']:
        return 'word'
    elif ext == '.pdf':
        return 'pdf'
    else:
        raise ValueError(f"不支持的文件类型: {ext}")

@app.post("/api/v1/inventory/parse-document", response_model=ParseDocumentResponse)
async def parse_document(
    file: Optional[UploadFile] = File(None),
    file_path: Optional[str] = Form(None),
    file_type: Optional[str] = Form(None),
    project_id: str = Form(...)
):
    """
    文档解析接口

    支持解析 Excel、Word、PDF 格式的数据字典文档，提取表结构信息
    """
    start_time = time.time()

    try:
        # 验证参数
        if not file and not file_path:
            raise HTTPException(
                status_code=400,
                detail="必须提供文件或文件路径"
            )

        # 处理文件上传
        if file:
            # 保存上传的文件到临时目录
            upload_dir = Path("/tmp/uploads")
            upload_dir.mkdir(exist_ok=True)
            file_path = str(upload_dir / file.filename)

            with open(file_path, "wb") as f:
                content = await file.read()
                f.write(content)

            file_name = file.filename
            file_size = len(content)

            # 自动检测文件类型
            if not file_type:
                file_type = detect_file_type(file_name)
        else:
            # 使用提供的文件路径
            if not os.path.exists(file_path):
                raise HTTPException(
                    status_code=404,
                    detail=f"文件不存在: {file_path}"
                )
            file_name = Path(file_path).name
            file_size = os.path.getsize(file_path)

            # 自动检测文件类型
            if not file_type:
                file_type = detect_file_type(file_name)

        # 根据文件类型选择解析方法
        if file_type == 'excel':
            tables = parse_excel(file_path)
        elif file_type == 'word':
            tables = parse_word(file_path)
        elif file_type == 'pdf':
            tables = parse_pdf(file_path)
        else:
            raise HTTPException(
                status_code=400,
                detail=f"不支持的文件类型: {file_type}"
            )

        # 计算统计信息
        total_fields = sum(table.field_count for table in tables)
        parse_time = time.time() - start_time

        # 构建响应数据
        response_data = {
            "tables": [table.dict() for table in tables],
            "total_tables": len(tables),
            "total_fields": total_fields,
            "parse_time": round(parse_time, 2),
            "file_info": {
                "file_name": file_name,
                "file_size": file_size,
                "file_type": file_type
            }
        }

        return ParseDocumentResponse(
            success=True,
            code=200,
            message="文档解析成功",
            data=response_data
        )

    except HTTPException:
        raise
    except Exception as e:
        return ParseDocumentResponse(
            success=False,
            code=500,
            message="文档解析失败",
            error={
                "error_code": "PARSE_ERROR",
                "error_detail": str(e)
            }
        )
```

---

## 🧪 测试用例

### 单元测试示例

```python
import pytest
from fastapi.testclient import TestClient
from pathlib import Path
import tempfile

client = TestClient(app)

def test_parse_excel_document():
    """测试解析 Excel 文档"""
    # 创建测试 Excel 文件
    test_data = {
        '字段名': ['user_id', 'user_name', 'email'],
        '类型': ['varchar(64)', 'varchar(50)', 'varchar(100)'],
        '注释': ['用户ID', '用户名', '邮箱']
    }
    df = pd.DataFrame(test_data)

    with tempfile.NamedTemporaryFile(suffix='.xlsx', delete=False) as tmp:
        df.to_excel(tmp.name, index=False)

        with open(tmp.name, 'rb') as f:
            response = client.post(
                "/api/v1/inventory/parse-document",
                files={"file": ("test.xlsx", f, "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")},
                data={"project_id": "test_project"}
            )

    assert response.status_code == 200
    data = response.json()
    assert data["success"] is True
    assert len(data["data"]["tables"]) > 0
    assert data["data"]["total_tables"] > 0

def test_unsupported_file_type():
    """测试不支持的文件类型"""
    with tempfile.NamedTemporaryFile(suffix='.txt', delete=False) as tmp:
        tmp.write(b"test content")
        tmp.flush()

        with open(tmp.name, 'rb') as f:
            response = client.post(
                "/api/v1/inventory/parse-document",
                files={"file": ("test.txt", f, "text/plain")},
                data={"project_id": "test_project"}
            )

    assert response.status_code == 400
    data = response.json()
    assert data["success"] is False
```

---

## ⚠️ 注意事项

### 1. 文件大小限制

- **Excel**: 建议限制为 50MB
- **Word**: 建议限制为 50MB
- **PDF**: 建议限制为 50MB

在 FastAPI 中设置：

```python
from fastapi import File, UploadFile
from fastapi.exceptions import RequestEntityTooLarge

@app.exception_handler(RequestEntityTooLarge)
async def handle_upload_limit(exc):
    return JSONResponse(
        status_code=413,
        content={
            "success": False,
            "message": "文件大小超过限制（最大 50MB）"
        }
    )
```

### 2. 文件格式约定

由于不同用户的数据字典文档格式可能不同，建议：
- **Excel**: 约定格式为第一列字段名，第二列类型，第三列注释
- **Word**: 约定使用表格格式，第一行表头，后续行字段信息
- **PDF**: 约定使用表格格式

如果格式不统一，需要增加更智能的识别逻辑。

### 3. 错误处理

- 文件读取失败：返回 400 错误
- 文件格式错误：返回 400 错误，提示正确的格式
- 解析失败：返回 500 错误，记录详细错误日志
- 文件过大：返回 413 错误

### 4. 性能优化

- 对于大文件，考虑使用异步处理
- 使用临时文件，处理完成后删除
- 考虑添加缓存机制（相同文件解析结果缓存）

### 5. 安全性

- 文件上传路径验证，防止路径遍历攻击
- 文件类型验证，防止恶意文件上传
- 文件大小限制，防止 DoS 攻击
- 临时文件及时清理

---

## 📝 开发检查清单

- [ ] 支持 Excel (.xlsx, .xls) 格式解析
- [ ] 支持 Word (.doc, .docx) 格式解析
- [ ] 支持 PDF (.pdf) 格式解析（可选）
- [ ] 文件类型自动识别
- [ ] 文件大小限制（50MB）
- [ ] 错误处理和异常捕获
- [ ] 单元测试覆盖
- [ ] 日志记录
- [ ] 临时文件清理
- [ ] API 文档生成（Swagger）

---

## 🔗 相关文档

- [接口清单表格](../Python接口清单表格.md)
- [Python技术人员工作量文档](../Python技术人员工作量文档.md)
- [FastAPI 官方文档](https://fastapi.tiangolo.com/)
- [pandas 文档](https://pandas.pydata.org/docs/)
- [python-docx 文档](https://python-docx.readthedocs.io/)

---

## 📞 联系方式

如有开发问题，请联系：
- **接口负责人**: [待填写]
- **技术顾问**: [待填写]