Skip to content

Continuously optimize AutoScheme RAM consumption#1703

Open
lvliang-intel wants to merge 2 commits intomainfrom
lvl/autoscheme_ram_opt
Open

Continuously optimize AutoScheme RAM consumption#1703
lvliang-intel wants to merge 2 commits intomainfrom
lvl/autoscheme_ram_opt

Conversation

@lvliang-intel
Copy link
Copy Markdown
Contributor

Description

Continuously optimize AutoScheme RAM consumption. Applied the following optimizations:
1 Meta-skeleton loading
2 Selective non-block materialization
3 Block-wise offloading and reload
4 One-block-at-a-time execution
5 Release-before-reload around AutoScheme
keep structure, not weights; only materialize the small always-needed parts; stream blocks from checkpoint on demand.

Test Result with Llama-3.1-8B:
compare_low_cpu_mem_usage.py

=== Summary ===
case exit peak_ram_gb peak_vram_gb wall_time_s quant_tune_s

disable_low_cpu_mem_usage 0 29.51 14.37 1417.22 1154.88
default_low_cpu_mem_usage 0 12.02 14.67 2057.75 1842.60

=== Delta (default - disable) ===
peak_ram_gb: 17.49
peak_vram_gb: -0.30
wall_time_sec: -640.53

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Copilot AI review requested due to automatic review settings April 17, 2026 05:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces peak CPU RAM during AutoScheme by avoiding holding full model weights in memory and instead using a meta-device “skeleton” plus on-demand block reload.

Changes:

  • Adds meta-skeleton loading and selective non-block layer materialization utilities in the offload module.
  • Updates AutoScheme (DeltaLoss) flow to support meta-skeleton models and more aggressive per-block memory release.
  • Updates compressor AutoScheme generation to optionally release and reload the model around scheme generation.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
auto_round/utils/offload.py Introduces meta-skeleton + non-block materialization helpers and refactors checkpoint-loading helpers used by OffloadManager.
auto_round/auto_scheme/delta_loss.py Loads meta skeleton when low CPU+GPU mem usage is enabled; materializes non-block layers; frees stored block inputs earlier.
auto_round/compressors/base.py Releases model before AutoScheme and reloads afterward to reduce peak RAM in combined low_cpu_mem_usage + low_gpu_mem_usage mode.
test/test_cpu/schemes/test_auto_scheme_low_cpu_mem.py Adjusts test to use the renamed/private block reload helper.

Comment on lines +644 to +646
if _model_path is not None and os.path.isdir(_model_path):
_need_reload = True

Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The low-CPU-memory reload path is gated on os.path.isdir(_model_path). For many Hugging Face loads, config._name_or_path is a repo id (not a directory), so this optimization will silently not activate. Consider resolving repo ids to a local snapshot directory (e.g., via huggingface_hub.snapshot_download(local_files_only=True) or an existing helper) rather than requiring _name_or_path to already be a local dir.

Suggested change
if _model_path is not None and os.path.isdir(_model_path):
_need_reload = True
if isinstance(_model_path, str) and _model_path:
if os.path.isdir(_model_path):
_need_reload = True
else:
try:
from huggingface_hub import snapshot_download
_resolved_model_path = snapshot_download(_model_path, local_files_only=True)
if os.path.isdir(_resolved_model_path):
_model_path = _resolved_model_path
_need_reload = True
except Exception:
pass

Copilot uses AI. Check for mistakes.
Comment on lines 671 to +690
layer_config = self.scheme_generator.get_layer_config()

if _need_reload:
logger.info("Reloading model after AutoScheme")
self.model, self.tokenizer = llm_load_model(
_model_path,
device="cpu",
trust_remote_code=self.trust_remote_code,
)
self.model = self.model.eval()
check_and_mark_quantized_module(self.model)
# Re-apply module structure updates that quantize() applied before AutoScheme
formats = self.formats if hasattr(self, "formats") else None
if not self.diffusion and formats is not None:
self.model = update_module(
self.model, formats=formats, trust_remote_code=self.trust_remote_code, cleanup_original=False
)
for n, m in self.model.named_modules():
m.global_name = n
self.shared_cache_keys = get_shared_keys(self.model)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When _need_reload is true, self.model is deleted/set to None before calling get_layer_config(), and reloaded afterward. If get_layer_config() raises (OOM, dataset error, etc.), the reload block is skipped and the compressor instance is left in a broken state (self.model is None). Wrap the AutoScheme call in a try/finally so the model is reliably restored (or at least the object ends up in a consistent state) even on exceptions.

Suggested change
layer_config = self.scheme_generator.get_layer_config()
if _need_reload:
logger.info("Reloading model after AutoScheme")
self.model, self.tokenizer = llm_load_model(
_model_path,
device="cpu",
trust_remote_code=self.trust_remote_code,
)
self.model = self.model.eval()
check_and_mark_quantized_module(self.model)
# Re-apply module structure updates that quantize() applied before AutoScheme
formats = self.formats if hasattr(self, "formats") else None
if not self.diffusion and formats is not None:
self.model = update_module(
self.model, formats=formats, trust_remote_code=self.trust_remote_code, cleanup_original=False
)
for n, m in self.model.named_modules():
m.global_name = n
self.shared_cache_keys = get_shared_keys(self.model)
try:
layer_config = self.scheme_generator.get_layer_config()
finally:
if _need_reload:
logger.info("Reloading model after AutoScheme")
self.model, self.tokenizer = llm_load_model(
_model_path,
device="cpu",
trust_remote_code=self.trust_remote_code,
)
self.model = self.model.eval()
check_and_mark_quantized_module(self.model)
# Re-apply module structure updates that quantize() applied before AutoScheme
formats = self.formats if hasattr(self, "formats") else None
if not self.diffusion and formats is not None:
self.model = update_module(
self.model, formats=formats, trust_remote_code=self.trust_remote_code, cleanup_original=False
)
for n, m in self.model.named_modules():
m.global_name = n
self.shared_cache_keys = get_shared_keys(self.model)

Copilot uses AI. Check for mistakes.
Comment on lines 302 to +325
@@ -322,7 +322,7 @@ def test_clear_and_load_model_block(self, tiny_opt_model_path):
assert current_params < original_params

# Load back from model files
load_block_from_model_files(tiny_opt_model_path, block_name, block)
_load_block_from_model_files(tiny_opt_model_path, block_name, block)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test now imports and calls the private helper _load_block_from_model_files. Tests typically should exercise the public API surface; relying on a leading-underscore symbol makes refactors harder. If this helper is intended to be stable, consider exposing a public load_block_from_model_files (even as a thin alias) and use that in the test instead.

Copilot uses AI. Check for mistakes.
Comment on lines 251 to 252

Selectively loads only tensors belonging to *block_name* without loading
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_block_from_model_files appears to have been replaced by the private _load_block_from_model_files. Since the old name looked like a public helper, removing/renaming it can break downstream imports. Consider keeping a public load_block_from_model_files wrapper/alias (optionally with a deprecation warning) that forwards to _load_block_from_model_files to preserve backward compatibility.

Copilot uses AI. Check for mistakes.
Comment on lines +205 to +209
# Collect all matching tensor names
matching: dict[str, str] = {} # tensor_name -> shard_file
for tensor_name, shard_file in weight_map.items():
for layer_name in layer_names:
if tensor_name == layer_name or tensor_name.startswith(layer_name + "."):
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_load_layers_from_model_files builds matching via a nested loop over weight_map.items() and layer_names, which is O(num_tensors × num_layers). For large models this can be a noticeable startup cost (and duplicates work already done when computing non_block_layer_names). Consider restructuring to avoid the nested scan (e.g., precompute a prefix set / trie, or generate exact tensor-name lists for the requested layers and look them up in weight_map).

Suggested change
# Collect all matching tensor names
matching: dict[str, str] = {} # tensor_name -> shard_file
for tensor_name, shard_file in weight_map.items():
for layer_name in layer_names:
if tensor_name == layer_name or tensor_name.startswith(layer_name + "."):
# Collect all matching tensor names without a nested scan over all
# requested layers for each tensor. A tensor matches if its full name is
# requested directly or if any dotted module prefix is requested.
requested_layers = set(layer_names)
matching: dict[str, str] = {} # tensor_name -> shard_file
for tensor_name, shard_file in weight_map.items():
if tensor_name in requested_layers:
matching[tensor_name] = shard_file
continue
prefix_end = -1
while True:
prefix_end = tensor_name.find(".", prefix_end + 1)
if prefix_end == -1:
break
if tensor_name[:prefix_end] in requested_layers:

Copilot uses AI. Check for mistakes.
@lvliang-intel
Copy link
Copy Markdown
Contributor Author

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants