DocumentToolkit
Document toolkit for parsing documents and support Q&A.
Support backends:
- Chunkr: https://github.com/lumina-ai-inc/chunkr
- pymupdf: https://github.com/pymupdf/PyMuPDF
-
unstructured: https://github.com/Unstructured-IO/unstructured
-
[ ] unify the filepath cache logic (also suppoort audio_toolkit, image_toolkit)
DocumentToolkit
Bases: AsyncBaseToolkit
Source code in utu/tools/document_toolkit.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | |
tools_map
property
tools_map: dict[str, Callable]
Lazy loading of tools map. - collect tools registered by @register_tool
__init__
__init__(config: ToolkitConfig = None) -> None
Initialize the DocumentToolkit, with configed parser and llm.
Source code in utu/tools/document_toolkit.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | |
document_parse
async
document_parse(
document_path: str,
chunk_size: int = None,
chunk_id: int = None,
) -> str
Parse document and return the processed text. - Supported file types: pdf, docx, pptx, xlsx, xls, ppt, doc - If the document is too large, it will be truncated to the first chunk_size characters. - If pass chunk_id, it will return the chunk text begin with chunk_id * chunk_size.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_path
|
str
|
Local path or URL to a document. |
required |
chunk_size
|
int
|
Number of characters to process at once. Defaults to 10_000. |
None
|
chunk_id
|
int
|
Chunk ID to start from. Defaults to 0. |
None
|
Source code in utu/tools/document_toolkit.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | |
document_qa
async
document_qa(
document_path: str, question: str | None = None
) -> str
Get file content summary or answer questions about attached document.
Supported file types: pdf, docx, pptx, xlsx, xls, ppt, doc
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_path
|
str
|
Local path or URL to a document. |
required |
question
|
str
|
The question to answer. If not provided, return a summary of the document. |
None
|
Source code in utu/tools/document_toolkit.py
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | |
setup_env
setup_env(env: _BaseEnv) -> None
Setup env and workspace.
Source code in utu/tools/base.py
35 36 37 38 39 40 | |
setup_workspace
setup_workspace(workspace_root: str = None)
Setup workspace. Implemented inside specific toolkits.
Source code in utu/tools/base.py
42 43 44 | |
build
async
build() -> None
Build/initialize the toolkit. Override in subclasses that need async initialization.
Source code in utu/tools/base.py
46 47 48 | |
cleanup
async
cleanup() -> None
Cleanup toolkit resources. Override in subclasses that need cleanup.
Source code in utu/tools/base.py
50 51 52 | |
get_tools_map_func
get_tools_map_func() -> dict[str, Callable]
Get tools map. It will filter tools by config.activated_tools if it is not None.
Source code in utu/tools/base.py
68 69 70 71 72 73 74 75 76 77 | |
get_tools_in_agents
get_tools_in_agents() -> list[FunctionTool]
Get tools in openai-agents format.
Source code in utu/tools/base.py
79 80 81 82 83 84 85 86 87 88 89 90 | |
get_tools_in_openai
get_tools_in_openai() -> list[dict]
Get tools in OpenAI format.
Source code in utu/tools/base.py
92 93 94 95 | |
get_tools_in_mcp
get_tools_in_mcp() -> list[Tool]
Get tools in MCP format.
Source code in utu/tools/base.py
97 98 99 100 | |
call_tool
async
call_tool(name: str, arguments: dict) -> str
Call a tool by its name.
Source code in utu/tools/base.py
102 103 104 105 106 107 108 | |