Model Loading

X-AnyLabeling currently includes many built-in general-purpose models. For a detailed list, please refer to the Model Zoo.

Loading Built-in Models

Before enabling the AI-assisted labeling feature, you need to load a model. This can be done via the AI icon button in the left sidebar or by using the shortcut Ctrl+A.

Typically, when you select a model from the dropdown list, the application checks if the corresponding model files exist in the user's directory at ~/xanylabeling_data/models/${model_name}. If they exist, the model is loaded directly. Otherwise, the application automatically downloads the files from the network to the specified directory.

Note: All built-in models are hosted by default on GitHub Releases. Therefore, you need a stable internet connection with access to GitHub; otherwise, the download might fail.

For users who fail to load models due to network issues, options include downloading the model offline and loading it manually, or modifying the model download source.

Offline Model Download

Open the model_zoo.md file and find the configuration file corresponding to the desired model.
Edit the configuration file, modify the model path, and optionally adjust other hyperparameters as needed.
In the tool's interface, click Load Custom Model and select the path to the configuration file.

Modify Model Download Source

For details, please refer to section 7.7 Model Download Source Configuration in user_guide.md.

Loading Adapted User Custom Models

Adapted Models refer to models that have already been integrated into X-AnyLabeling, requiring no custom inference code from the user. A list of adapted models can be found in the Model Zoo.

In this tutorial, we will use the YOLOv5s model as an example to detail how to load a custom adapted model.

a. Model Conversion

Suppose you have trained a model locally. First, you can convert the PyTorch trained model to X-AnyLabeling's default ONNX file format (optional). Specifically, execute:

python export.py --weights yolov5s.pt --include onnx

Note: The current version does not support dynamic inputs, so do not set the --dynamic parameter.

Additionally, it is highly recommended to import the exported *.onnx file using the Netron online tool to check the input and output node information, ensuring dimensions and other details are as expected.

b. Model Configuration

Once the onnx file is ready, you can browse the Model Zoo file to find and copy the configuration file for the corresponding model.

Taking yolov5s.yaml as an example, let's look at its content:

type: yolov5
name: yolov5s-r20230520
provider: Ultralytics
display_name: YOLOv5s
model_path: https://github.com/CVHub520/X-AnyLabeling/releases/download/v0.1.0/yolov5s.onnx
iou_threshold: 0.45
conf_threshold: 0.25
classes:
  - person
  - bicycle
  - car
  ...

Field	Description	Modifiable
`type`	Model type identifier, cannot be customized.	❌
`name`	Index name of the model configuration file, keep default.	❌
`provider`	Model provider, can be modified based on actual situation.	✔️
`display_name`	Name shown in the model dropdown list in the UI, customizable.	✔️
`model_path`	Model loading path, supports relative and absolute paths.	✔️
`iou_threshold`	IoU threshold for Non-Maximum Suppression (NMS).	✔️
`conf_threshold`	Confidence threshold for NMS.	✔️
`classes`	List of model labels, must match the training labels.	✔️

Note that not all fields apply to every model. Refer to the definition of the specific model.

For example, looking at the implementation of the YOLO base model, it offers additional optional configuration items:

Field	Description
`filter_classes`	Specify classes used during inference.
`agnostic`	Use class-agnostic NMS.

Here's a typical example:

type: yolov5
name: yolov5s-r20230520
provider: Ultralytics
display_name: YOLOv5s
model_path: /path/to/your/custom_yolov5s.onnx # Modified path
iou_threshold: 0.60
conf_threshold: 0.25
agnostic: True
filter_classes:
  - person
  - car
classes:
  - person
  - bicycle
  - car
  - ...

Specifically, only when using older versions of YOLOv5 (v5.0 and below), you need to specify the anchors and stride fields in the configuration file. Otherwise, do not specify these fields to avoid inference errors. Example:

type: yolov5
...
stride: 32
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

Tip: For segmentation models, you can specify the epsilon_factor parameter to control the smoothness of the output contour points. The default value is 0.005.

c. Model Loading

After understanding the above, modify the model_path field in the configuration file and optionally adjust other hyperparameters as needed.

The software currently supports both relative paths and absolute paths for model loading. When entering the model path, be mindful of escape characters.

Finally, in the model dropdown list in the top menu bar of the interface, find the ...Load Custom Model option, and then import the prepared configuration file to complete the custom model loading process.

Loading Unadapted User Custom Models

Unadapted Models refer to models that have not yet been integrated into X-AnyLabeling. Users need to follow the implementation steps below for integration.

Here, we use a multi-class semantic segmentation model, U-Net, as an example. Follow these implementation steps:

a. Training and Exporting Model

Export the ONNX model, ensuring the output node dimension is [1, C, H, W], where C is the total number of classes (including the background class).

Friendly Reminder: Exporting to ONNX is optional. You can choose other model formats like PyTorch, OpenVINO, or TensorRT based on your needs. For an example using Segment-Anything-2 for video object tracking, refer to the Installation Guide, the configuration file definition sam2_hiera_base_video.yaml, and the corresponding implementation segment_anything_2_video.py.

b. Define Configuration File

First, create a new configuration file, e.g., unet.yaml, in the configuration directory:

type: unet
name: unet-r20250101
display_name: U-Net (ResNet34)
provider: xxx
conf_threshold: 0.5
model_path: /path/to/best.onnx
classes:
  - cat
  - dog
  - _background_

Where:

Field	Description
`type`	Specifies the model type. Ensure it's unique from existing types to maintain identifier uniqueness.
`name`	Defines the model index for internal reference and management. Avoid conflicts with existing indices.
`display_name`	The model name displayed in the UI for easy identification. Ensure uniqueness.

These three fields are mandatory. Add other fields as needed, such as provider, model path, hyperparameters, etc.

c. Add Configuration File

Next, add the above configuration file to the model management file:

...

- model_name: "unet-r20250101"
  config_file: ":/unet.yaml"
...

d. Configure UI Components

This step can add UI components as needed. Simply add the model_type to the corresponding list in the file.

e. Define Inference Service

A key step in defining the inference service is inheriting the Model base class, which allows you to implement model-specific forward inference logic.

Specifically, create a new file unet.py in the model inference service directory. Here's an example:

import logging
import os

import cv2
import numpy as np
from PyQt5 import QtCore
from PyQt5.QtCore import QCoreApplication

from anylabeling.app_info import __preferred_device__
from anylabeling.views.labeling.shape import Shape
from anylabeling.views.labeling.utils.opencv import qt_img_to_rgb_cv_img
from .model import Model
from .types import AutoLabelingResult
from .engines.build_onnx_engine import OnnxBaseModel


class UNet(Model):
    """Semantic segmentation model using UNet"""

    class Meta:
        required_config_names = [
            "type",
            "name",
            "display_name",
            "model_path",
            "classes",
        ]
        widgets = ["button_run"]
        output_modes = {
            "polygon": QCoreApplication.translate("Model", "Polygon"),
        }
        default_output_mode = "polygon"

    def __init__(self, model_config, on_message) -> None:
        # Run the parent class's init method
        super().__init__(model_config, on_message)
        model_name = self.config["type"]
        model_abs_path = self.get_model_abs_path(self.config, "model_path")
        if not model_abs_path or not os.path.isfile(model_abs_path):
            raise FileNotFoundError(
                QCoreApplication.translate(
                    "Model",
                    f"Could not download or initialize {model_name} model.",
                )
            )
        self.net = OnnxBaseModel(model_abs_path, __preferred_device__)
        self.classes = self.config["classes"]
        self.input_shape = self.net.get_input_shape()[-2:]

    def preprocess(self, input_image):
        input_h, input_w = self.input_shape
        image = cv2.resize(input_image, (input_w, input_h))
        image = np.transpose(image, (2, 0, 1))
        image = image.astype(np.float32) / 255.0
        image = (image - 0.5) / 0.5
        image = np.expand_dims(image, axis=0)
        return image

    def postprocess(self, image, outputs):
        n, c, h, w = outputs.shape
        image_height, image_width = image.shape[:2]
        # Obtain the category index of each pixel
        # target shape: (1, h, w)
        outputs = np.argmax(outputs, axis=1)
        results = []
        for i in range(c):
            # Skip the background label
            if self.classes[i] == '_background_':
                continue
            # Get the category index of each pixel for the first batch by adding [0].
            mask = outputs[0] == i
            # Rescaled to original shape
            mask_resized = cv2.resize(mask.astype(np.uint8), (image_width, image_height))
            # Get the contours
            contours, _ = cv2.findContours(mask_resized, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
            # Append the contours along with their respective class labels
            results.append((self.classes[i], [np.squeeze(contour).tolist() for contour in contours]))
        return results

    def predict_shapes(self, image, image_path=None):
        if image is None:
            return []

        try:
            image = qt_img_to_rgb_cv_img(image, image_path)
        except Exception as e:  # noqa
            logging.warning("Could not inference model")
            logging.warning(e)
            return []

        blob = self.preprocess(image)
        outputs = self.net.get_ort_inference(blob)
        results = self.postprocess(image, outputs)
        shapes = []
        for item in results:
            label, contours = item
            for points in contours:
                # Make sure to close
                points += points[0]
                shape = Shape(flags={})
                for point in points:
                    shape.add_point(QtCore.QPointF(point[0], point[1]))
                shape.shape_type = "polygon"
                shape.closed = True
                shape.fill_color = "#000000"
                shape.line_color = "#000000"
                shape.label = label
                shape.selected = False
                shapes.append(shape)

        result = AutoLabelingResult(shapes, replace=True)
        return result

    def unload(self):
        del self.net

Here:

In the Meta class:
- required_config_names: Specifies mandatory fields in the model config file for proper initialization.
- widgets: Specifies controls (buttons, dropdowns, etc.) to display for this service. See this file for definitions.
- output_modes: Specifies the output shape types supported (e.g., polygon, rectangle, rotated box).
- default_output_mode: Specifies the default output shape type.
predict_shapes and unload are abstract methods that must be implemented. They define the inference process and resource release logic, respectively.

f. Add to Model Management

After the above steps, open the model configuration file. Add the corresponding model type field (e.g., unet) to the _CUSTOM_MODELS list and, if necessary, add the model name to relevant configuration sections.

Tip: If you don't know how to implement specific widgets, use the search panel, enter relevant keywords, and examine the implementation logic of available widgets.

Finally, go to the Model Manager class file. In the _load_model method, initialize your instance as follows:

...

class ModelManager(QObject):
    """Model manager"""

    def __init__(self):
        ...
    ...
    def _load_model(self, model_id):
        """Load and return model info"""
        if self.loaded_model_config is not None:
            self.loaded_model_config["model"].unload()
            self.loaded_model_config = None
            self.auto_segmentation_model_unselected.emit()

        model_config = copy.deepcopy(self.model_configs[model_id])
        if model_config["type"] == "yolov5":
            ...
        elif model_config["type"] == "unet":
            from .unet import UNet

            try:
                model_config["model"] = UNet(
                    model_config, on_message=self.new_model_status.emit
                )
                self.auto_segmentation_model_unselected.emit()
            except Exception as e:  # noqa
                self.new_model_status.emit(
                    self.tr(
                        "Error in loading model: {error_message}".format(
                            error_message=str(e)
                        )
                    )
                )
                print(
                    "Error in loading model: {error_message}".format(
                        error_message=str(e)
                    )
                )
                return
          ...
    ...

⚠️Note:

The model type field must match the type field defined in the configuration file (Step b. Define Configuration File).
If the model is based on SAM (Segment Anything Model) interaction patterns, replace self.auto_segmentation_model_unselected.emit() with self.auto_segmentation_model_selected.emit() to trigger the corresponding functionality. (Or better, use a configuration flag as shown in the example code).

Model Export

This section provides specific examples of converting custom models to the ONNX format for quick integration into X-AnyLabeling.

Classification

InternImage

InternImage introduces a large-scale Convolutional Neural Network (CNN) model utilizing deformable convolutions as core operators. This achieves large effective receptive fields, adaptive spatial aggregation, and reduced inductive bias, enabling the learning of stronger, more robust patterns from extensive data. It surpasses current CNNs and Vision Transformers on benchmarks.

Attribute	Value
Paper Title	InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Publishing Units	Shanghai AI Laboratory, Tsinghua University, Nanjing University, etc.
Publication Date	CVPR'23

Please refer to this tutorial.

PersonAttribute

This tutorial provides users with a method to quickly build lightweight, high-precision, and practical person attribute classification models using PaddleClas PULC (Practical Ultra-Lightweight image Classification). The model can be widely used in pedestrian analysis, tracking scenarios, etc.

Attribute	Value
Publishing Units	PaddlePaddle Team (Baidu)

Please refer to this tutorial.

VehicleAttribute

This tutorial provides users with a method to quickly build lightweight, high-precision, and practical vehicle attribute classification models using PaddleClas PULC. The model is suitable for vehicle recognition, road monitoring, etc.

Attribute	Value
Publishing Units	PaddlePaddle Team (Baidu)

Please refer to this tutorial.

Object Detection

RF-DETR

RF-DETR is the first real-time model to exceed 60 AP on the Microsoft COCO benchmark alongside competitive performance at base sizes. It also achieves state-of-the-art performance on RF100-VL, an object detection benchmark that measures model domain adaptability to real world problems. RF-DETR is comparable speed to current real-time objection models.

Organization: Roboflow

Please refer to this tutorial.

YOLOv5_OBB

Author: Kaixuan Hu

Please refer to this tutorial.

YOLOv7

Attribute	Value
Paper Title	YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
Publishing Units	Institute of Information Science, Academia Sinica, Taiwan

python export.py --weights yolov7.pt --img-size 640 --grid

Note: The --grid parameter must be included when running this command.

Gold-YOLO

Attribute	Value
Paper Title	Gathering Information Helps Explain The Locality In Structured Object Detection (Preprint includes Gold-YOLO)
Publishing Units	Huawei Noah's Ark Lab
Publication Date	NeurIPS'23

# Clone the repository first
git clone https://github.com/huawei-noah/Efficient-Computing.git
cd Efficient-Computing/Detection/Gold-YOLO
# Run export for desired model weight
python deploy/ONNX/export_onnx.py --weights Gold_n_dist.pt --simplify --ort
# Or other weights: Gold_s_pre_dist.pt, Gold_m_pre_dist.pt, Gold_l_pre_dist.pt

DAMO-YOLO

DAMO-YOLO is a fast and accurate object detection method developed by the TinyML team at Alibaba DAMO Academy's Data Analytics and Intelligence Lab. It achieves state-of-the-art performance by incorporating new techniques, including a Neural Architecture Search (NAS) backbone, an efficient re-parameterized Generalized-FPN (RepGFPN), a lightweight head, AlignedOTA label assignment, and distillation enhancement.

Attribute	Value
Paper Title	DAMO-YOLO: A Report on Real-Time Object Detection
Publishing Units	Alibaba Group
Publication Date	Arxiv'22

# Clone the repository first
git clone https://github.com/tinyvision/DAMO-YOLO.git
cd DAMO-YOLO
# Run converter for a specific config and checkpoint
python tools/converter.py -f configs/damoyolo_tinynasL25_S.py -c damoyolo_tinynasL25_S.pth --batch_size 1 --img_size 640

RT-DETR

Real-Time Detection Transformer (RT-DETR) is the first known real-time end-to-end object detector. RT-DETR-L achieves 53.0% AP on COCO val2017 at 114 FPS on a T4 GPU, while RT-DETR-X achieves 54.8% AP at 74 FPS, surpassing all YOLO detectors of the same scale in speed and accuracy. RT-DETR-R50 achieves 53.1% AP at 108 FPS, outperforming DINO-Deformable-DETR-R50 by 2.2% AP with about 21x faster FPS.

Attribute	Value
Paper Title	RT-DETR: DETRs Beat YOLOs on Real-time Object Detection
Publishing Units	Baidu Inc.
Publication Date	Arxiv'22 (Accepted to ICCV 2023)

Please refer to external tutorials or the official repository for ONNX export instructions, as direct commands might vary. Example article (Chinese): https://zhuanlan.zhihu.com/p/628660998.

Hyper-YOLO

Hyper-YOLO is a novel object detection method that integrates hypergraph computation to capture complex high-order associations between visual features. It introduces a Hypergraph Computation-enhanced Semantic Collection and Scattering (HGC-SCS) framework, transforming visual feature maps into semantic space and constructing hypergraphs for high-order information propagation.

Attribute	Value
Paper Title	Hyper-YOLO: When Visual Object Detection Meets Hypergraph Computation
Publishing Units	Tsinghua University, Xi'an Jiaotong University
Publication Date	TAPMI'25 (Preprint available)

Download the model, install dependencies, then modify the Hyper-YOLO/ultralytics/export.py file (or a similar export script within that project), setting batch=1 and half=False:

# Example modification within export_onnx.py or a similar script
# Ensure necessary imports (Path, YOLO, torch, os) are present
from pathlib import Path
from ultralytics import YOLO
import torch
import os

if __name__ == '__main__':
    model_path = 'hyper-yolon-seg.pt' # Or your specific model weight file
    if isinstance(model_path, (str, Path)):
        model = YOLO(model_path)

    # Ensure export arguments are set correctly
    output_filename = model.export(
        imgsz=640,
        batch=1,         # Set batch size to 1
        format='onnx',   # Specify ONNX format
        int8=False,
        half=False,      # Set half to False
        device="0",      # Or "cpu"
        verbose=False
    )
    print(f"Model exported to {output_filename}")

Then run the export script (adjust path as needed):

python3 Hyper-YOLO/ultralytics/export.py

Segment Anything

SAM

The Segment Anything Model (SAM) generates high-quality object masks from input prompts like points or boxes. It can produce masks for all objects in an image and was trained on a dataset of 11 million images and 1.1 billion masks. SAM demonstrates strong zero-shot performance on various segmentation tasks.

Attribute	Value
Paper Title	Segment Anything
Publishing Units	Meta AI Research, FAIR
Publication Date	ICCV'23

For ONNX export, refer to community exporters like https://github.com/vietanhdev/samexporter#sam-exporter or the official repository for potential tools.

Efficient-SAM

EfficientViT (underlying EfficientSAM) is a family of vision models designed for efficient high-resolution dense prediction. It uses a novel lightweight multi-scale linear attention module as its core building block, achieving global receptive fields and multi-scale learning with hardware-efficient operations. EfficientSAM adapts this for promptable segmentation.

Attribute	Value
Paper Title	EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction
Publishing Units	MIT HAN Lab
Publication Date	ICCV'23

For ONNX export, refer to the specific EfficientSAM project (e.g., linked via EfficientViT or search directly) or check benchmarking scripts like those mentioned in the EfficientViT repo: https://github.com/microsoft/Cream/tree/main/EfficientViT#benchmarking-with-onnxruntime (Note: Original provided link CVHub520/efficientvit seems like a fork, official repo might differ).

SAM-Med2D

SAM-Med2D is a specialized model developed to address the challenges of applying state-of-the-art image segmentation techniques to medical images.

Attribute	Value
Paper Title	SAM-Med2D
Publishing Units	OpenGVLab
Publication Date	Arxiv'23

Refer to the deployment instructions in the official repository: https://github.com/OpenGVLab/SAM-Med2D#%EF%B8%8F-deploy. (Note: Original provided link CVHub520/SAM-Med2D seems like a fork).

HQ-SAM

HQ-SAM is an enhanced version of the Segment Anything Model (SAM) designed to improve mask prediction quality, especially for complex structures, while maintaining SAM's efficiency and zero-shot capabilities. It achieves this through an improved decoding process and additional training on a specialized dataset.

Attribute	Value
Paper Title	Segment Anything in High Quality
Publishing Units	ETH Zurich, HKUST
Publication Date	NeurIPS'23

Refer to the official HQ-SAM repository or potentially forks like https://github.com/CVHub520/sam-hq for ONNX export tutorials or scripts.

EdgeSAM

EdgeSAM is an accelerated variant of the Segment Anything Model (SAM), optimized for efficient execution on edge devices with minimal performance compromise. It claims significant speedups over the original SAM and MobileSAM on edge hardware.

Attribute	Value
Paper Title	EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM
Publishing Units	S-Lab, Nanyang Technological University; Shanghai AI Laboratory
Publication Date	Arxiv'23

Refer to the official repository's export script: https://github.com/chongzhou96/EdgeSAM/blob/main/scripts/export_onnx_model.py.

Grounding

Grounding DINO

Grounding DINO is a state-of-the-art (SOTA) zero-shot object detection model excelling at detecting objects not defined during training. Its ability to adapt to new objects and scenes makes it highly versatile for real-world applications. It performs well in Referring Expression Comprehension (REC), identifying and locating specific objects or regions in images based on text descriptions. Grounding DINO simplifies object detection by eliminating hand-designed components like Non-Maximum Suppression (NMS).

Attribute	Value
Paper Title	Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Publishing Units	IDEA-CVR, IDEA-Research
Publication Date	Arxiv'23

Please refer to this tutorial.

YOLO-World

YOLO-World enhances the YOLO series by incorporating vision-language modeling, enabling efficient open-vocabulary object detection that excels in various tasks.

Attribute	Value
Paper Title	YOLO-World: Real-Time Open-Vocabulary Object Detection
Publishing Units	Tencent AI Lab, ARC Lab, Tencent PCG, Huazhong University of Science and Technology
Publication Date	Arxiv'24

# Ensure ultralytics package is installed and updated
# pip install -U ultralytics
# Clone the ultralytics repo if needed for specific export scripts, otherwise use the pip package
# git clone https://github.com/ultralytics/ultralytics.git
# cd ultralytics
# Use the yolo command line interface
yolo export model=yolov8s-worldv2.pt format=onnx opset=13 simplify

GeCo

GeCo is a unified architecture for few-shot counting, achieving high-precision object detection, segmentation, and counting through novel dense queries and a counting loss.

Attribute	Value
Paper Title	GeCo: Query-Based Anchors for Fine-Grained Multi-Object Counting, Detection, and Segmentation
Publishing Units	University of Ljubljana
Publication Date	NeurIPS'24

Please refer to this tutorial.

Image Tagging

Recognize Anything (RAM)

RAM (Recognize Anything Model) is a robust image tagging model known for its exceptional image recognition capabilities. RAM excels in zero-shot generalization, is cost-effective, reproducible, and relies on open-source, annotation-free datasets. Its flexibility makes it suitable for a wide range of applications.

Attribute	Value
Paper Title	Recognize Anything: A Strong Image Tagging Model
Publishing Units	OPPO Research Institute, IDEA-Research, AI Robotics
Publication Date	Arxiv'23

Please refer to this tutorial. (Note: Original linked repo Tag2Text seems related but RAM is often associated with recognize-anything).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom_model.md

custom_model.md

Model Loading

Loading Built-in Models

Offline Model Download

Modify Model Download Source

Loading Adapted User Custom Models

Loading Unadapted User Custom Models

Model Export

Classification

InternImage

PersonAttribute

VehicleAttribute

Object Detection

RF-DETR

YOLOv5_OBB

YOLOv7

Gold-YOLO

DAMO-YOLO

RT-DETR

Hyper-YOLO

Segment Anything

SAM

Efficient-SAM

SAM-Med2D

HQ-SAM

EdgeSAM

Grounding

Grounding DINO

YOLO-World

GeCo

Image Tagging

Recognize Anything (RAM)

Files

custom_model.md

Latest commit

History

custom_model.md

File metadata and controls

Model Loading

Loading Built-in Models

Offline Model Download

Modify Model Download Source

Loading Adapted User Custom Models

Loading Unadapted User Custom Models

Model Export

Classification

Object Detection

Segment Anything

Grounding

Image Tagging