Skip to content
This repository was archived by the owner on Oct 9, 2024. It is now read-only.

Add configs to run int4 inference #37

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 16 additions & 3 deletions bloom-inference-scripts/bloom-ds-inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@
parser = ArgumentParser()

parser.add_argument("--name", required=True, type=str, help="model_name")
parser.add_argument("--dtype", type=str, help="float16 or int8", choices=["int8", "float16"], default="float16")
parser.add_argument("--dtype", type=str, help="float16 or int8 or int4", choices=["int8", "float16", "int4"], default="float16")
parser.add_argument("--local_rank", required=False, type=int, help="used by dist launchers")
parser.add_argument("--batch_size", default=1, type=int, help="batch size")
parser.add_argument("--benchmark", action="store_true", help="additionally run benchmark")
Expand Down Expand Up @@ -100,7 +100,7 @@ def get_checkpoint_files(model_name_or_path):


model_name = args.name
infer_dtype = args.dtype
infer_dtype = args.dtype if args.dtype != 'int4' else 'int8'
Copy link
Contributor

@stas00 stas00 Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make for a more user-friendly API to

  1. keep the dtype intact
  2. drop quantization_bits
  3. let deepspeed.init_inference derive the number of bits from dtype?

not only the currently suggested override is confusing, I fail to see what purpose serves carrying the same information in dtype and and quantization_bits twice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, wait, torch.init4 still doesn't exist, does it?

let's find the feature request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not implemented pytorch/pytorch#74627

so that's why you had to do the odd workarounds, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can drop it once its implemented @stas00 ?
For now, this might be the best way to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's pointless to wait, since they won't have int3 and int12

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make for a more user-friendly API to

  1. keep the dtype intact
  2. drop quantization_bits
  3. let deepspeed.init_inference derive the number of bits from dtype?

not only the currently suggested override is confusing, I fail to see what purpose serves carrying the same information in dtype and and quantization_bits twice

@stas00 and @RezaYazdaniAminabadi - just clarifying that we have introduced a new DeepSpeedInferenceConfig that can be passed to init_inference. We are keeping it backwards compatible but if we are okay to make changes to this file, I would advocate for writing a config dictionary for DeepSpeed and pass that to init_inference instead of the various kwargs. Please see here for an example: https://gist.github.com/awan-10/6e3d5c756be3a876522e860c6bbf702d#file-bloom-ds-inference-py-L173

Also, see the docs for the new config: https://deepspeed.readthedocs.io/en/latest/inference-init.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That definitely works.

@awan-10, may I suggest you make the inference config accept dict_or_path just like zero does? it might be for some users easier to write out a separate file.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stas00 - thanks for the suggestion. Created an issue so we can track it: deepspeedai/DeepSpeed#2532. Mike and I will work on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much, @awan-10


tp_presharded_mode = True if model_name in tp_presharded_models else False

Expand Down Expand Up @@ -171,7 +171,19 @@ def write_checkponts_json():
deepspeed.runtime.utils.see_memory_usage("pre-ds-inference-init", force=True)

if kernel_inject:
kwargs = dict(replace_with_kernel_inject=True)
if args.dtype == 'int8':
bits = 4
if args.dtype == 'int4':
bits = 8
ds_config = {
"replace_with_kernel_inject" : True,
"quant" : {
"enabled" : True,
"weight" : {
"num_bits" : bits
}
}
}
else:
kwargs = dict(injection_policy={BloomBlock: ("self_attention.dense", "mlp.dense_4h_to_h")})

Expand All @@ -188,6 +200,7 @@ def write_checkponts_json():
# checkpoints_json=None
model = deepspeed.init_inference(
model,
config=ds_config,
mp_size=world_size,
base_dir=repo_root,
dtype=getattr(torch, infer_dtype),
Expand Down