Continuously optimize AutoScheme RAM consumption by lvliang-intel · Pull Request #1703 · intel/auto-round

lvliang-intel · 2026-04-17T05:43:28Z

Description

Continuously optimize AutoScheme RAM consumption. Applied the following optimizations:
1 Meta-skeleton loading
2 Selective non-block materialization
3 Block-wise offloading and reload
4 One-block-at-a-time execution
5 Release-before-reload around AutoScheme
keep structure, not weights; only materialize the small always-needed parts; stream blocks from checkpoint on demand.

Test Result with Llama-3.1-8B:
compare_low_cpu_mem_usage.py

=== Summary ===
case exit peak_ram_gb peak_vram_gb wall_time_s quant_tune_s

disable_low_cpu_mem_usage 0 29.51 14.37 1417.22 1154.88
default_low_cpu_mem_usage 0 12.02 14.67 2057.75 1842.60

=== Delta (default - disable) ===
peak_ram_gb: 17.49
peak_vram_gb: -0.30
wall_time_sec: -640.53

Type of Change

Related Issues

Fixes or relates to #

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

for more information, see https://pre-commit.ci

Copilot

Pull request overview

Reduces peak CPU RAM during AutoScheme by avoiding holding full model weights in memory and instead using a meta-device “skeleton” plus on-demand block reload.

Changes:

Adds meta-skeleton loading and selective non-block layer materialization utilities in the offload module.
Updates AutoScheme (DeltaLoss) flow to support meta-skeleton models and more aggressive per-block memory release.
Updates compressor AutoScheme generation to optionally release and reload the model around scheme generation.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`auto_round/utils/offload.py`	Introduces meta-skeleton + non-block materialization helpers and refactors checkpoint-loading helpers used by OffloadManager.
`auto_round/auto_scheme/delta_loss.py`	Loads meta skeleton when low CPU+GPU mem usage is enabled; materializes non-block layers; frees stored block inputs earlier.
`auto_round/compressors/base.py`	Releases model before AutoScheme and reloads afterward to reduce peak RAM in combined low_cpu_mem_usage + low_gpu_mem_usage mode.
`test/test_cpu/schemes/test_auto_scheme_low_cpu_mem.py`	Adjusts test to use the renamed/private block reload helper.

Copilot · 2026-04-17T05:47:37Z

+            if _model_path is not None and os.path.isdir(_model_path):
+                _need_reload = True
+


The low-CPU-memory reload path is gated on os.path.isdir(_model_path). For many Hugging Face loads, config._name_or_path is a repo id (not a directory), so this optimization will silently not activate. Consider resolving repo ids to a local snapshot directory (e.g., via huggingface_hub.snapshot_download(local_files_only=True) or an existing helper) rather than requiring _name_or_path to already be a local dir.

Suggested change

if _model_path is not None and os.path.isdir(_model_path):

_need_reload = True

if isinstance(_model_path, str) and _model_path:

if os.path.isdir(_model_path):

_need_reload = True

else:

try:

from huggingface_hub import snapshot_download

_resolved_model_path = snapshot_download(_model_path, local_files_only=True)

if os.path.isdir(_resolved_model_path):

_model_path = _resolved_model_path

_need_reload = True

except Exception:

pass

Copilot · 2026-04-17T05:47:37Z

        layer_config = self.scheme_generator.get_layer_config()
+
+        if _need_reload:
+            logger.info("Reloading model after AutoScheme")
+            self.model, self.tokenizer = llm_load_model(
+                _model_path,
+                device="cpu",
+                trust_remote_code=self.trust_remote_code,
+            )
+            self.model = self.model.eval()
+            check_and_mark_quantized_module(self.model)
+            # Re-apply module structure updates that quantize() applied before AutoScheme
+            formats = self.formats if hasattr(self, "formats") else None
+            if not self.diffusion and formats is not None:
+                self.model = update_module(
+                    self.model, formats=formats, trust_remote_code=self.trust_remote_code, cleanup_original=False
+                )
+            for n, m in self.model.named_modules():
+                m.global_name = n
+            self.shared_cache_keys = get_shared_keys(self.model)


When _need_reload is true, self.model is deleted/set to None before calling get_layer_config(), and reloaded afterward. If get_layer_config() raises (OOM, dataset error, etc.), the reload block is skipped and the compressor instance is left in a broken state (self.model is None). Wrap the AutoScheme call in a try/finally so the model is reliably restored (or at least the object ends up in a consistent state) even on exceptions.

Suggested change

layer_config = self.scheme_generator.get_layer_config()

if _need_reload:

logger.info("Reloading model after AutoScheme")

self.model, self.tokenizer = llm_load_model(

_model_path,

device="cpu",

trust_remote_code=self.trust_remote_code,

)

self.model = self.model.eval()

check_and_mark_quantized_module(self.model)

# Re-apply module structure updates that quantize() applied before AutoScheme

formats = self.formats if hasattr(self, "formats") else None

if not self.diffusion and formats is not None:

self.model = update_module(

self.model, formats=formats, trust_remote_code=self.trust_remote_code, cleanup_original=False

)

for n, m in self.model.named_modules():

m.global_name = n

self.shared_cache_keys = get_shared_keys(self.model)

try:

layer_config = self.scheme_generator.get_layer_config()

finally:

if _need_reload:

logger.info("Reloading model after AutoScheme")

self.model, self.tokenizer = llm_load_model(

_model_path,

device="cpu",

trust_remote_code=self.trust_remote_code,

)

self.model = self.model.eval()

check_and_mark_quantized_module(self.model)

# Re-apply module structure updates that quantize() applied before AutoScheme

formats = self.formats if hasattr(self, "formats") else None

if not self.diffusion and formats is not None:

self.model = update_module(

self.model, formats=formats, trust_remote_code=self.trust_remote_code, cleanup_original=False

)

for n, m in self.model.named_modules():

m.global_name = n

self.shared_cache_keys = get_shared_keys(self.model)

Copilot · 2026-04-17T05:47:38Z

@@ -322,7 +322,7 @@ def test_clear_and_load_model_block(self, tiny_opt_model_path):
        assert current_params < original_params

        # Load back from model files
-        load_block_from_model_files(tiny_opt_model_path, block_name, block)
+        _load_block_from_model_files(tiny_opt_model_path, block_name, block)


This test now imports and calls the private helper _load_block_from_model_files. Tests typically should exercise the public API surface; relying on a leading-underscore symbol makes refactors harder. If this helper is intended to be stable, consider exposing a public load_block_from_model_files (even as a thin alias) and use that in the test instead.

Copilot · 2026-04-17T05:47:38Z


    Selectively loads only tensors belonging to *block_name* without loading


load_block_from_model_files appears to have been replaced by the private _load_block_from_model_files. Since the old name looked like a public helper, removing/renaming it can break downstream imports. Consider keeping a public load_block_from_model_files wrapper/alias (optionally with a deprecation warning) that forwards to _load_block_from_model_files to preserve backward compatibility.

Copilot · 2026-04-17T05:47:38Z

+    # Collect all matching tensor names
+    matching: dict[str, str] = {}  # tensor_name -> shard_file
+    for tensor_name, shard_file in weight_map.items():
+        for layer_name in layer_names:
+            if tensor_name == layer_name or tensor_name.startswith(layer_name + "."):


_load_layers_from_model_files builds matching via a nested loop over weight_map.items() and layer_names, which is O(num_tensors × num_layers). For large models this can be a noticeable startup cost (and duplicates work already done when computing non_block_layer_names). Consider restructuring to avoid the nested scan (e.g., precompute a prefix set / trie, or generate exact tensor-name lists for the requested layers and look them up in weight_map).

Suggested change

# Collect all matching tensor names

matching: dict[str, str] = {} # tensor_name -> shard_file

for tensor_name, shard_file in weight_map.items():

for layer_name in layer_names:

if tensor_name == layer_name or tensor_name.startswith(layer_name + "."):

# Collect all matching tensor names without a nested scan over all

# requested layers for each tensor. A tensor matches if its full name is

# requested directly or if any dotted module prefix is requested.

requested_layers = set(layer_names)

matching: dict[str, str] = {} # tensor_name -> shard_file

for tensor_name, shard_file in weight_map.items():

if tensor_name in requested_layers:

matching[tensor_name] = shard_file

continue

prefix_end = -1

while True:

prefix_end = tensor_name.find(".", prefix_end + 1)

if prefix_end == -1:

break

if tensor_name[:prefix_end] in requested_layers:

lvliang-intel · 2026-04-17T06:20:21Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-04-17T06:20:31Z

Azure Pipelines successfully started running 1 pipeline(s).

Continue optimizing AutoScheme RAM consumption

ee19523

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Copilot AI review requested due to automatic review settings April 17, 2026 05:43

Copilot started reviewing on behalf of lvliang-intel April 17, 2026 05:43 View session

[pre-commit.ci] auto fixes from pre-commit.com hooks

f19224e

for more information, see https://pre-commit.ci

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuously optimize AutoScheme RAM consumption#1703

Continuously optimize AutoScheme RAM consumption#1703
lvliang-intel wants to merge 2 commits intomainfrom
lvl/autoscheme_ram_opt

lvliang-intel commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

lvliang-intel commented Apr 17, 2026

Uh oh!

azure-pipelines bot commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if _model_path is not None and os.path.isdir(_model_path):
		_need_reload = True

-            if _model_path is not None and os.path.isdir(_model_path):
-                _need_reload = True
+            if isinstance(_model_path, str) and _model_path:
+                if os.path.isdir(_model_path):
+                    _need_reload = True
+                else:
+                    try:
+                        from huggingface_hub import snapshot_download
+                        _resolved_model_path = snapshot_download(_model_path, local_files_only=True)
+                        if os.path.isdir(_resolved_model_path):
+                            _model_path = _resolved_model_path
+                            _need_reload = True
+                    except Exception:
+                        pass


		Selectively loads only tensors belonging to block_name without loading

-    # Collect all matching tensor names
-    matching: dict[str, str] = {}  # tensor_name -> shard_file
-    for tensor_name, shard_file in weight_map.items():
-        for layer_name in layer_names:
-            if tensor_name == layer_name or tensor_name.startswith(layer_name + "."):
+    # Collect all matching tensor names without a nested scan over all
+    # requested layers for each tensor. A tensor matches if its full name is
+    # requested directly or if any dotted module prefix is requested.
+    requested_layers = set(layer_names)
+    matching: dict[str, str] = {}  # tensor_name -> shard_file
+    for tensor_name, shard_file in weight_map.items():
+        if tensor_name in requested_layers:
+            matching[tensor_name] = shard_file
+            continue
+        prefix_end = -1
+        while True:
+            prefix_end = tensor_name.find(".", prefix_end + 1)
+            if prefix_end == -1:
+                break
+            if tensor_name[:prefix_end] in requested_layers:

Conversation

lvliang-intel commented Apr 17, 2026

Description

=== Summary === case exit peak_ram_gb peak_vram_gb wall_time_s quant_tune_s

Type of Change

Related Issues

Checklist Before Submitting

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

lvliang-intel commented Apr 17, 2026

Uh oh!

azure-pipelines bot commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

=== Summary ===
case exit peak_ram_gb peak_vram_gb wall_time_s quant_tune_s