Skip to content

feat: add LMDB support for multimodal resources #3938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions docs/source/Customization/自定义数据集.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,20 +109,26 @@ query-response格式:

### 多模态

对于多模态数据集,和上述任务的格式相同。区别在于增加了`images`, `videos`, `audios`几个key,分别代表多模态资源的url或者path(推荐使用绝对路径),`<image>` `<video>` `<audio>`标签代表了插入图片/视频/音频的位置,ms-swift支持多图片/视频/音频的情况。这些特殊tokens将在预处理的时候进行替换,参考[这里](https://github.com/modelscope/ms-swift/blob/main/swift/llm/template/template/qwen.py#L198)。下面给出的四条示例分别展示了纯文本,以及包含图像、视频和音频数据的数据格式。
对于多模态数据集,和上述任务的格式相同。区别在于增加了`images`, `videos`, `audios`几个key,分别代表多模态资源的url或者path(推荐使用绝对路径),`<image>` `<video>` `<audio>`标签代表了插入图片/视频/音频的位置,ms-swift支持多图片/视频/音频的情况。这些特殊tokens将在预处理的时候进行替换,参考[这里](https://github.com/modelscope/ms-swift/blob/main/swift/llm/template/template/qwen.py#L198)。下面给出的示例分别展示了纯文本,以及包含图像、视频和音频数据的数据格式。

SWIFT 支持从 LMDB 数据库加载多模态资源,使用格式为 `lmdb://key@path_to_lmdb`。这对于存储和访问大量图像、视频、音频等资源非常有效,特别适合训练和推理时处理大规模多模态数据集。使用前请确保已安装 LMDB:`pip install lmdb`。

预训练:
```
```jsonl
{"messages": [{"role": "assistant", "content": "预训练的文本在这里"}]}
{"messages": [{"role": "assistant", "content": "<image>是一只小狗,<image>是一只小猫"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
{"messages": [{"role": "assistant", "content": "<image>是一只从LMDB加载的小兔子"}], "images": ["lmdb://rabbit_img@/path/to/animals_lmdb"]}
{"messages": [{"role": "assistant", "content": "<audio>描述了今天天气真不错"}], "audios": ["/xxx/x.wav"]}
{"messages": [{"role": "assistant", "content": "<image>是一个大象,<video>是一只狮子在跑步"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
{"messages": [{"role": "assistant", "content": "<video>展示了太空中的星系"}], "videos": ["lmdb://space_video@/path/to/videos_lmdb"]}
```

微调:
```jsonl
{"messages": [{"role": "user", "content": "浙江的省会在哪?"}, {"role": "assistant", "content": "浙江的省会在杭州。"}]}
{"messages": [{"role": "user", "content": "<image><image>两张图片有什么区别"}, {"role": "assistant", "content": "前一张是小猫,后一张是小狗"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
{"messages": [{"role": "user", "content": "<image>这个动物是什么?"}, {"role": "assistant", "content": "这是一只棕色的熊猫,很罕见的物种。"}], "images": ["lmdb://panda_img@/path/to/wildlife_lmdb"]}
{"messages": [{"role": "user", "content": "<image>和<image>这两种动物有什么区别?"}, {"role": "assistant", "content": "第一张图是老虎,第二张图是狮子。"}], "images": ["lmdb://tiger_img@/path/to/animals_lmdb", "lmdb://lion_img@/path/to/animals_lmdb"]}
{"messages": [{"role": "user", "content": "<audio>语音说了什么"}, {"role": "assistant", "content": "今天天气真好呀"}], "audios": ["/xxx/x.mp3"]}
{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "<image>图片中是什么,<video>视频中是什么"}, {"role": "assistant", "content": "图片中是一个大象,视频中是一只小狗在草地上奔跑"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
```
Expand Down
7 changes: 6 additions & 1 deletion docs/source_en/Customization/Custom-dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,22 +113,27 @@ Please refer to [embedding训练文档](../BestPractices/Embedding.md#dataset-fo

### Multimodal

For multimodal datasets, the format is the same as the aforementioned tasks. The difference lies in the addition of several keys: `images`, `videos`, and `audios`, which represent the URLs or paths (preferably absolute paths) of multimodal resources. The tags `<image>`, `<video>`, and `<audio>` indicate where to insert images, videos, or audio. MS-Swift supports multiple images, videos, and audio files. These special tokens will be replaced during preprocessing, as referenced [here](https://github.com/modelscope/ms-swift/blob/main/swift/llm/template/template/qwen.py#L198). The four examples below respectively demonstrate the data format for plain text, as well as formats containing image, video, and audio data.
For multimodal datasets, the format is the same as the aforementioned tasks. The difference lies in the addition of several keys: `images`, `videos`, and `audios`, which represent the URLs or paths (preferably absolute paths) of multimodal resources. The tags `<image>`, `<video>`, and `<audio>` indicate where to insert images, videos, or audio. MS-Swift supports multiple images, videos, and audio files. These special tokens will be replaced during preprocessing, as referenced [here](https://github.com/modelscope/ms-swift/blob/main/swift/llm/template/template/qwen.py#L198). The examples below demonstrate the data format for plain text, as well as formats containing image, video, and audio data.

SWIFT supports loading multimodal resources from LMDB databases using the format `lmdb://key@path_to_lmdb`. This is highly effective for storing and accessing large collections of images, videos, audio files, and other resources, especially when training and inferencing with large-scale multimodal datasets. Make sure to install LMDB first: `pip install lmdb`.

Pre-training:
```jsonl
{"messages": [{"role": "assistant", "content": "Pre-trained text goes here"}]}
{"messages": [{"role": "assistant", "content": "<image>is a puppy, <image>is a kitten"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
{"messages": [{"role": "assistant", "content": "<image>is a rabbit loaded from LMDB"}], "images": ["lmdb://rabbit_img@/path/to/animals_lmdb"]}
{"messages": [{"role": "assistant", "content": "<audio>describes how nice the weather is today"}], "audios": ["/xxx/x.wav"]}
{"messages": [{"role": "assistant", "content": "<image>is an elephant, <video>is a lion running"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
{"messages": [{"role": "assistant", "content": "<video>shows galaxies in space"}], "videos": ["lmdb://space_video@/path/to/videos_lmdb"]}
```

Supervised Fine-tuning:

```jsonl
{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "The capital of Zhejiang is Hangzhou."}]}
{"messages": [{"role": "user", "content": "<image><image>What is the difference between the two images?"}, {"role": "assistant", "content": "The first one is a kitten, and the second one is a puppy."}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
{"messages": [{"role": "user", "content": "<image>What is this animal?"}, {"role": "assistant", "content": "This is a brown panda, a very rare species."}], "images": ["lmdb://panda_img@/path/to/wildlife_lmdb"]}
{"messages": [{"role": "user", "content": "<image>and<image>What's the difference between these two animals?"}, {"role": "assistant", "content": "The first image is a tiger, and the second image is a lion."}], "images": ["lmdb://tiger_img@/path/to/animals_lmdb", "lmdb://lion_img@/path/to/animals_lmdb"]}
{"messages": [{"role": "user", "content": "<audio>What did the audio say?"}, {"role": "assistant", "content": "The weather is really nice today."}], "audios": ["/xxx/x.mp3"]}
{"messages": [{"role": "system", "content": "You are a helpful and harmless assistant."}, {"role": "user", "content": "<image>What is in the image, <video>What is in the video?"}, {"role": "assistant", "content": "The image shows an elephant, and the video shows a puppy running on the grass."}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
```
Expand Down
44 changes: 43 additions & 1 deletion swift/llm/template/vision_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import os
import re
from io import BytesIO
from typing import Any, Callable, List, TypeVar, Union
from typing import Any, Callable, Dict, List, Optional, TypeVar, Union

import numpy as np
import requests
Expand All @@ -13,6 +13,13 @@

from swift.utils import get_env_args

# Try to import lmdb, but don't fail if it's not available
try:
import lmdb
LMDB_AVAILABLE = True
except ImportError:
LMDB_AVAILABLE = False

# >>> internvl
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
Expand Down Expand Up @@ -99,6 +106,9 @@ def rescale_image(img: Image.Image, max_pixels: int) -> Image.Image:

_T = TypeVar('_T')

# Cache for LMDB environments and read transactions to avoid reopening
_LMDB_ENV_CACHE: Dict[str, Any] = {}
_LMDB_TXN_CACHE: Dict[str, Any] = {}

def load_file(path: Union[str, bytes, _T]) -> Union[BytesIO, _T]:
res = path
Expand All @@ -111,6 +121,38 @@ def load_file(path: Union[str, bytes, _T]) -> Union[BytesIO, _T]:
request_kwargs['timeout'] = timeout
content = requests.get(path, **request_kwargs).content
res = BytesIO(content)
elif path.startswith('lmdb://'):
if not LMDB_AVAILABLE:
raise ImportError(
"LMDB support requires the 'lmdb' package to be installed. "
"Please install it with 'pip install lmdb'."
)
# Parse LMDB path format: lmdb://key@path_to_lmdb
_, _, lmdb_url = path.partition('lmdb://')
key, sep, lmdb_dir = lmdb_url.partition('@')

# Verify format validity with a single check
if not sep or not key or not lmdb_dir or '@' in lmdb_dir:
raise ValueError("LMDB path must be in format: lmdb://key@path_to_lmdb (with exactly one '@')")

# Use cached environment or create a new one
env = _LMDB_ENV_CACHE.get(lmdb_dir)
if env is None:
env = lmdb.open(lmdb_dir, readonly=True, lock=False, max_readers=1024, max_spare_txns=2)
_LMDB_ENV_CACHE[lmdb_dir] = env

# Get or create read transaction
txn = _LMDB_TXN_CACHE.get(lmdb_dir)
if txn is None:
txn = env.begin(write=False)
_LMDB_TXN_CACHE[lmdb_dir] = txn

# Get data using the cached transaction
encoded_key = key.encode()
data = txn.get(encoded_key)
if data is None:
raise KeyError(f"Key '{key}' not found in LMDB at '{lmdb_dir}'")
res = BytesIO(data)
elif os.path.exists(path) or (not path.startswith('data:') and len(path) <= 200):
path = os.path.abspath(os.path.expanduser(path))
with open(path, 'rb') as f:
Expand Down