Add PEFT benchmarking script in `thunder/benchmarks` #1978

wprazuch · 2025-04-22T10:57:01Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements) -> Discussed with @IvanYashchuk
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests? -> To verify what kind of tests to add

What does this PR do?

It introduces the benchmarking script for PEFT finetuning scenario, which supports:

compiler setup (inductor/thunder/eager)
single/multi-gpu setup (fsdp2)

To execute:

python thunder/benchmarks/benchmark_peft.py     --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B     \
    --devices 1  --trust-remote-code      --attn-implementation sdpa  \
 --max-steps 10        --mbs 1  \
   --seq-length 4096        --jit-backend thunder

for multi-gpu:

torchrun --nproc_per_node=8 --master_port=12345 thunder/benchmarks/benchmark_peft.py  \
   --model meta-llama/CodeLlama-34b-Instruct-hf     --strategy fsdp2   \
     --devices 8     --mbs 1     --seq-length 1024     --max-steps 10  \
        --jit-backend eager     --attn-implementation sdpa  \
           --trust-remote-code

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

for more information, see https://pre-commit.ci

kshitij12345 · 2025-04-23T10:24:04Z

thunder/benchmarks/benchmark_peft.py

+    # Configure model for static shapes before FSDP2
+    if hasattr(model, "config"):
+        model.config.use_cache = True
+        model.config.max_position_embeddings = args.seq_length


I think we can move setting model to static to a function as I see this in multiple different places.

kshitij12345 · 2025-04-23T10:24:55Z

thunder/benchmarks/benchmark_peft.py

+    return model
+
+
+def setup_fsdp2(model: torch.nn.Module, devices: int, verbose: bool = False) -> torch.nn.Module:


cc: @crcrpar for review.

kshitij12345 · 2025-04-23T10:26:55Z

thunder/benchmarks/benchmark_peft.py

+        dynamo_config.cache_size_limit = 64
+        # Disable gradient checkpointing for Thunder
+        if hasattr(model, "gradient_checkpointing_enable"):
+            model.gradient_checkpointing_disable()


What happens if this is not called?

kshitij12345 · 2025-04-23T10:28:04Z

thunder/benchmarks/benchmark_peft.py

+    """Parse command line arguments."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model", default="meta-llama/Llama-3.2-1B")
+    parser.add_argument("--strategy", type=str, default="auto", choices=["auto", "ddp", "fsdp2"])


I think ddp is not supported with this script.

kshitij12345 · 2025-04-23T10:30:28Z

thunder/benchmarks/benchmark_peft.py

+    )
+    logger.info(f"Base model loaded on meta device")
+
+    # Configure model for static shapes


What happens if this is not done?

kshitij12345 · 2025-04-23T10:32:22Z

thunder/benchmarks/benchmark_peft.py

+        logger.info(f"Configured model for static shapes with sequence length: {args.seq_length}")
+
+    # Materialize the model on CUDA
+    model = model.to_empty(device=f"cuda:{LOCAL_RANK}")


In case of FSDP, I think materialization should happen after setup_fsdp2 step. Otherwise, we will get OOM for a model which would have worked with FSDP.

Yes, I think that is the correct. However, when materializing it after setup_fsdp2, I see about 30% slowdown to throughput compared to current ordering. Is it "normal"?

kshitij12345 · 2025-04-23T10:34:20Z

thunder/benchmarks/benchmark_peft.py

+        if "lora" in name.lower():
+            if not param.requires_grad:
+                if args.verbose:
+                    logger.warning(f"LoRA parameter {name} does not require grad!")


Should these be asserts instead? I think if the requires_grad was setup wrong, we shouldn't proceed with getting the numbers.

crcrpar · 2025-04-24T08:57:26Z

requirements/peft.txt

how about using one of existing requirements txt files, maybe https://github.com/Lightning-AI/lightning-thunder/blob/main/requirements/devel.txt?

crcrpar · 2025-04-24T08:57:55Z

thunder/benchmarks/benchmark_peft.py

@@ -0,0 +1,724 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.


I'm not quite sure about the license (also, shouldn't the year be 2025?).

crcrpar · 2025-04-24T08:59:42Z

thunder/benchmarks/benchmark_peft.py

+from torch.distributed import DeviceMesh, init_process_group
+from torch.distributed._composable.fsdp import fully_shard
+from torch.nn.attention import SDPBackend, sdpa_kernel
+from tqdm import tqdm


I'm not sure if we have tqdm included in any of requirements at the moment. Would transformers or some others install it as their dependency?

crcrpar · 2025-04-24T09:00:36Z

thunder/benchmarks/benchmark_peft.py

+import random
+import time
+from contextlib import contextmanager
+from distutils.version import LooseVersion


nit-picking

Suggested change

from distutils.version import LooseVersion

from looseversion import LooseVersion

as we do in

lightning-thunder/thunder/executors/nvfuserex.py

Line 2 in 3aa706a

from looseversion import LooseVersion

crcrpar · 2025-04-24T09:01:41Z

thunder/benchmarks/benchmark_peft.py

+    return args
+
+
+def get_tokenizer(model_name: str, trust_remote_code: bool, fallback_model: str = "gpt2") -> Any:


Can we refine the type annotation of return value and then remove from typing import Any?

crcrpar · 2025-04-24T09:02:03Z

thunder/benchmarks/benchmark_peft.py

+import time
+from contextlib import contextmanager
+from distutils.version import LooseVersion
+from typing import Any, List, Optional


Suggested change

from typing import Any, List, Optional

Any has one use but I guess we can do away with it

crcrpar · 2025-04-24T09:05:29Z

thunder/benchmarks/benchmark_peft.py

+import numpy as np
+import torch
+import torch.nn.functional as F
+import transformers
+from datasets import Dataset
+from loguru import logger
+from peft import LoraConfig, get_peft_model
+from torch.distributed import DeviceMesh, init_process_group
+from torch.distributed._composable.fsdp import fully_shard
+from torch.nn.attention import SDPBackend, sdpa_kernel
+from tqdm import tqdm
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer


Could you clean up these imports? At glance, I'm not quite convinced with the imports of numpy and transformers.
For transformers import, we do import three from it so I think it'd be a bit cleaner to avoid import transformers

crcrpar · 2025-04-24T09:06:50Z

thunder/benchmarks/benchmark_peft.py

+        if isinstance(module, (torch.nn.Linear, torch.nn.Embedding)):
+            if verbose:
+                logger.info(f"Wrapping layer {name} with FSDP2")
+            fully_shard(module, mesh=mesh)


just for the consistency with the below

Suggested change

fully_shard(module, mesh=mesh)

fully_shard(module, mesh=mesh, reshard_after_forward=True)

crcrpar · 2025-04-24T09:09:02Z

thunder/benchmarks/benchmark_peft.py

+            logger.info(f"Set static cache size to sequence length: {args.seq_length}")
+
+        executors = thunder.get_default_executors()
+        xforms: list = [NvtxProfileTransform()]


Why do we always use this one? I guess there's some overhead with this

crcrpar · 2025-04-24T09:11:59Z

thunder/benchmarks/benchmark_peft.py

+    if backend == "torchjit":
+        logger.info("Compiling model with torch.compile")
+        dist_print("Resetting cache size for torch.compile")
+        import torch._dynamo.config as dynamo_config
+
+        # Fixes recompilation issues with inductor
+        dynamo_config.cache_size_limit = 64
+        model = torch.compile(model)
+    elif backend == "thunder":
+        import thunder
+        import thunder.dynamo
+        import torch._dynamo.config as dynamo_config
+        from thunder.dev_utils.nvtx_profile_transform import NvtxProfileTransform
+        from thunder.executors.transformer_engineex import transformer_engine_ex


I possibly miss something innegligible, but could you have all the imports at the beginning of the file?

wprazuch · 2025-04-24T11:27:24Z

@kshitij12345 @crcrpar Thanks a lot for the review and for many valid points above, which skipped my attention. I wanted to let you know that I will be OOTO starting tomorrow and I am not sure if anyone from my team will take care of this PR during my absence. I will implement the fixes for the above points with the highest priority once I come back

Wojciech Prazuch added 2 commits April 22, 2025 02:53

Add benchmark_peft script

5b13aad

Remove unused param

986dcb6

wprazuch requested review from mruberry, lantiga and t-vi as code owners April 22, 2025 10:57

github-actions bot added the dependencies label Apr 22, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

12c5e7a

for more information, see https://pre-commit.ci

kshitij12345 reviewed Apr 23, 2025

View reviewed changes

crcrpar reviewed Apr 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PEFT benchmarking script in `thunder/benchmarks` #1978

Add PEFT benchmarking script in `thunder/benchmarks` #1978

wprazuch commented Apr 22, 2025 •

edited

Loading

kshitij12345 Apr 23, 2025

kshitij12345 Apr 23, 2025

crcrpar Apr 24, 2025

kshitij12345 Apr 23, 2025

kshitij12345 Apr 23, 2025

kshitij12345 Apr 23, 2025

kshitij12345 Apr 23, 2025

wprazuch Apr 24, 2025

kshitij12345 Apr 23, 2025

crcrpar Apr 24, 2025

crcrpar Apr 24, 2025 •

edited

Loading

crcrpar Apr 24, 2025

crcrpar Apr 24, 2025

crcrpar Apr 24, 2025

crcrpar Apr 24, 2025

crcrpar Apr 24, 2025

crcrpar Apr 24, 2025

crcrpar Apr 24, 2025

crcrpar Apr 24, 2025

wprazuch commented Apr 24, 2025

		return model


		def setup_fsdp2(model: torch.nn.Module, devices: int, verbose: bool = False) -> torch.nn.Module:

		@@ -0,0 +1,724 @@
		# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

	from distutils.version import LooseVersion
	from looseversion import LooseVersion

		return args


		def get_tokenizer(model_name: str, trust_remote_code: bool, fallback_model: str = "gpt2") -> Any:

	fully_shard(module, mesh=mesh)
	fully_shard(module, mesh=mesh, reshard_after_forward=True)

Add PEFT benchmarking script in thunder/benchmarks #1978

Are you sure you want to change the base?

Add PEFT benchmarking script in thunder/benchmarks #1978

Conversation

wprazuch commented Apr 22, 2025 • edited Loading

What does this PR do?

PR review

Did you have fun?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crcrpar Apr 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wprazuch commented Apr 24, 2025

Add PEFT benchmarking script in `thunder/benchmarks` #1978

Add PEFT benchmarking script in `thunder/benchmarks` #1978

wprazuch commented Apr 22, 2025 •

edited

Loading

crcrpar Apr 24, 2025 •

edited

Loading