Continuously optimize AutoScheme RAM consumption#1703
Continuously optimize AutoScheme RAM consumption#1703lvliang-intel wants to merge 2 commits intomainfrom
Conversation
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
Reduces peak CPU RAM during AutoScheme by avoiding holding full model weights in memory and instead using a meta-device “skeleton” plus on-demand block reload.
Changes:
- Adds meta-skeleton loading and selective non-block layer materialization utilities in the offload module.
- Updates AutoScheme (DeltaLoss) flow to support meta-skeleton models and more aggressive per-block memory release.
- Updates compressor AutoScheme generation to optionally release and reload the model around scheme generation.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
auto_round/utils/offload.py |
Introduces meta-skeleton + non-block materialization helpers and refactors checkpoint-loading helpers used by OffloadManager. |
auto_round/auto_scheme/delta_loss.py |
Loads meta skeleton when low CPU+GPU mem usage is enabled; materializes non-block layers; frees stored block inputs earlier. |
auto_round/compressors/base.py |
Releases model before AutoScheme and reloads afterward to reduce peak RAM in combined low_cpu_mem_usage + low_gpu_mem_usage mode. |
test/test_cpu/schemes/test_auto_scheme_low_cpu_mem.py |
Adjusts test to use the renamed/private block reload helper. |
| if _model_path is not None and os.path.isdir(_model_path): | ||
| _need_reload = True | ||
|
|
There was a problem hiding this comment.
The low-CPU-memory reload path is gated on os.path.isdir(_model_path). For many Hugging Face loads, config._name_or_path is a repo id (not a directory), so this optimization will silently not activate. Consider resolving repo ids to a local snapshot directory (e.g., via huggingface_hub.snapshot_download(local_files_only=True) or an existing helper) rather than requiring _name_or_path to already be a local dir.
| if _model_path is not None and os.path.isdir(_model_path): | |
| _need_reload = True | |
| if isinstance(_model_path, str) and _model_path: | |
| if os.path.isdir(_model_path): | |
| _need_reload = True | |
| else: | |
| try: | |
| from huggingface_hub import snapshot_download | |
| _resolved_model_path = snapshot_download(_model_path, local_files_only=True) | |
| if os.path.isdir(_resolved_model_path): | |
| _model_path = _resolved_model_path | |
| _need_reload = True | |
| except Exception: | |
| pass |
| layer_config = self.scheme_generator.get_layer_config() | ||
|
|
||
| if _need_reload: | ||
| logger.info("Reloading model after AutoScheme") | ||
| self.model, self.tokenizer = llm_load_model( | ||
| _model_path, | ||
| device="cpu", | ||
| trust_remote_code=self.trust_remote_code, | ||
| ) | ||
| self.model = self.model.eval() | ||
| check_and_mark_quantized_module(self.model) | ||
| # Re-apply module structure updates that quantize() applied before AutoScheme | ||
| formats = self.formats if hasattr(self, "formats") else None | ||
| if not self.diffusion and formats is not None: | ||
| self.model = update_module( | ||
| self.model, formats=formats, trust_remote_code=self.trust_remote_code, cleanup_original=False | ||
| ) | ||
| for n, m in self.model.named_modules(): | ||
| m.global_name = n | ||
| self.shared_cache_keys = get_shared_keys(self.model) |
There was a problem hiding this comment.
When _need_reload is true, self.model is deleted/set to None before calling get_layer_config(), and reloaded afterward. If get_layer_config() raises (OOM, dataset error, etc.), the reload block is skipped and the compressor instance is left in a broken state (self.model is None). Wrap the AutoScheme call in a try/finally so the model is reliably restored (or at least the object ends up in a consistent state) even on exceptions.
| layer_config = self.scheme_generator.get_layer_config() | |
| if _need_reload: | |
| logger.info("Reloading model after AutoScheme") | |
| self.model, self.tokenizer = llm_load_model( | |
| _model_path, | |
| device="cpu", | |
| trust_remote_code=self.trust_remote_code, | |
| ) | |
| self.model = self.model.eval() | |
| check_and_mark_quantized_module(self.model) | |
| # Re-apply module structure updates that quantize() applied before AutoScheme | |
| formats = self.formats if hasattr(self, "formats") else None | |
| if not self.diffusion and formats is not None: | |
| self.model = update_module( | |
| self.model, formats=formats, trust_remote_code=self.trust_remote_code, cleanup_original=False | |
| ) | |
| for n, m in self.model.named_modules(): | |
| m.global_name = n | |
| self.shared_cache_keys = get_shared_keys(self.model) | |
| try: | |
| layer_config = self.scheme_generator.get_layer_config() | |
| finally: | |
| if _need_reload: | |
| logger.info("Reloading model after AutoScheme") | |
| self.model, self.tokenizer = llm_load_model( | |
| _model_path, | |
| device="cpu", | |
| trust_remote_code=self.trust_remote_code, | |
| ) | |
| self.model = self.model.eval() | |
| check_and_mark_quantized_module(self.model) | |
| # Re-apply module structure updates that quantize() applied before AutoScheme | |
| formats = self.formats if hasattr(self, "formats") else None | |
| if not self.diffusion and formats is not None: | |
| self.model = update_module( | |
| self.model, formats=formats, trust_remote_code=self.trust_remote_code, cleanup_original=False | |
| ) | |
| for n, m in self.model.named_modules(): | |
| m.global_name = n | |
| self.shared_cache_keys = get_shared_keys(self.model) |
| @@ -322,7 +322,7 @@ def test_clear_and_load_model_block(self, tiny_opt_model_path): | |||
| assert current_params < original_params | |||
|
|
|||
| # Load back from model files | |||
| load_block_from_model_files(tiny_opt_model_path, block_name, block) | |||
| _load_block_from_model_files(tiny_opt_model_path, block_name, block) | |||
There was a problem hiding this comment.
This test now imports and calls the private helper _load_block_from_model_files. Tests typically should exercise the public API surface; relying on a leading-underscore symbol makes refactors harder. If this helper is intended to be stable, consider exposing a public load_block_from_model_files (even as a thin alias) and use that in the test instead.
|
|
||
| Selectively loads only tensors belonging to *block_name* without loading |
There was a problem hiding this comment.
load_block_from_model_files appears to have been replaced by the private _load_block_from_model_files. Since the old name looked like a public helper, removing/renaming it can break downstream imports. Consider keeping a public load_block_from_model_files wrapper/alias (optionally with a deprecation warning) that forwards to _load_block_from_model_files to preserve backward compatibility.
| # Collect all matching tensor names | ||
| matching: dict[str, str] = {} # tensor_name -> shard_file | ||
| for tensor_name, shard_file in weight_map.items(): | ||
| for layer_name in layer_names: | ||
| if tensor_name == layer_name or tensor_name.startswith(layer_name + "."): |
There was a problem hiding this comment.
_load_layers_from_model_files builds matching via a nested loop over weight_map.items() and layer_names, which is O(num_tensors × num_layers). For large models this can be a noticeable startup cost (and duplicates work already done when computing non_block_layer_names). Consider restructuring to avoid the nested scan (e.g., precompute a prefix set / trie, or generate exact tensor-name lists for the requested layers and look them up in weight_map).
| # Collect all matching tensor names | |
| matching: dict[str, str] = {} # tensor_name -> shard_file | |
| for tensor_name, shard_file in weight_map.items(): | |
| for layer_name in layer_names: | |
| if tensor_name == layer_name or tensor_name.startswith(layer_name + "."): | |
| # Collect all matching tensor names without a nested scan over all | |
| # requested layers for each tensor. A tensor matches if its full name is | |
| # requested directly or if any dotted module prefix is requested. | |
| requested_layers = set(layer_names) | |
| matching: dict[str, str] = {} # tensor_name -> shard_file | |
| for tensor_name, shard_file in weight_map.items(): | |
| if tensor_name in requested_layers: | |
| matching[tensor_name] = shard_file | |
| continue | |
| prefix_end = -1 | |
| while True: | |
| prefix_end = tensor_name.find(".", prefix_end + 1) | |
| if prefix_end == -1: | |
| break | |
| if tensor_name[:prefix_end] in requested_layers: |
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description
Continuously optimize AutoScheme RAM consumption. Applied the following optimizations:
1 Meta-skeleton loading
2 Selective non-block materialization
3 Block-wise offloading and reload
4 One-block-at-a-time execution
5 Release-before-reload around AutoScheme
keep structure, not weights; only materialize the small always-needed parts; stream blocks from checkpoint on demand.
Test Result with Llama-3.1-8B:
compare_low_cpu_mem_usage.py
=== Summary ===
case exit peak_ram_gb peak_vram_gb wall_time_s quant_tune_s
disable_low_cpu_mem_usage 0 29.51 14.37 1417.22 1154.88
default_low_cpu_mem_usage 0 12.02 14.67 2057.75 1842.60
=== Delta (default - disable) ===
peak_ram_gb: 17.49
peak_vram_gb: -0.30
wall_time_sec: -640.53
Type of Change
Related Issues
Fixes or relates to #
Checklist Before Submitting