Skip to content

Add AMD GPU support to accelpath (WIP)#205

Draft
zerefwayne wants to merge 6 commits intoEESSI:mainfrom
zerefwayne:amdaccel
Draft

Add AMD GPU support to accelpath (WIP)#205
zerefwayne wants to merge 6 commits intoEESSI:mainfrom
zerefwayne:amdaccel

Conversation

@zerefwayne
Copy link
Copy Markdown

@zerefwayne zerefwayne commented Apr 17, 2026

Temporarily disabled nvidia_smi to run amd code

  • Extract amdgcn_cc using amd-smi
    aayushj@j14n2 /h/a/c/s/init (amdaccel) $ ./eessi_archdetect.sh accelpath
    accel/amd/gfx90a
    
  • Use logic in llvm-project to extract amdgcn_cc
    aayushj@j14n2 /h/a/c/s/init (amdaccel) $ ./eessi_archdetect.sh accelpath
    accel/amd/gfx90a
    

@zerefwayne zerefwayne changed the title Amdaccel Add AMD GPU support to accelpath Apr 17, 2026
@zerefwayne zerefwayne changed the title Add AMD GPU support to accelpath Add AMD GPU support to accelpath (WIP) Apr 17, 2026
@zerefwayne
Copy link
Copy Markdown
Author

zerefwayne commented Apr 17, 2026

etp/mi210

aayushj@j14n2 /h/a/c/s/init (amdaccel) $ ./eessi_archdetect.sh cpupath
x86_64/intel/icelake
aayushj@j14n2 /h/a/c/s/init (amdaccel) $ ./eessi_archdetect.sh -d accelpath
2026-04-17 11:42:41 [DEBUG] accelpath: Override variable set as ''
2026-04-17 11:42:41 [DEBUG] nvidia_accelpath: nvidia-smi command not found
2026-04-17 11:42:41 [DEBUG] amd_accelpath: KFD sysfs path found @ /sys/devices/virtual/kfd/kfd/topology/nodes
2026-04-17 11:42:41 [DEBUG] amd_accelpath: AMDGCN compute capability 'gfx90a' derived from KFD node 2
2026-04-17 11:42:41 [DEBUG] accelpath: result: accel/amd/gfx90a
accel/amd/gfx90a

lumi/mi250x

joglekar@nid007956:~/software-layer-scripts/init> ./eessi_archdetect.sh cpupath
x86_64/amd/zen3
joglekar@nid007956:~/software-layer-scripts/init> ./eessi_archdetect.sh -d accelpath
2026-04-17 13:15:35 [DEBUG] accelpath: Override variable set as ''
2026-04-17 13:15:35 [DEBUG] nvidia_accelpath: nvidia-smi command not found
2026-04-17 13:15:35 [DEBUG] amd_accelpath: KFD sysfs path found @ /sys/devices/virtual/kfd/kfd/topology/nodes
2026-04-17 13:15:35 [DEBUG] amd_accelpath: AMDGCN compute capability 'gfx90a' derived from KFD node 8
2026-04-17 13:15:35 [DEBUG] accelpath: result: accel/amd/gfx90a
accel/amd/gfx90a

snellius/rome (no GPU)

aayushj@tcn351 /h/a/c/s/init (amdaccel) $ ./eessi_archdetect.sh cpupath
x86_64/amd/zen2
aayushj@tcn351 /h/a/c/s/init (amdaccel) $ ./eessi_archdetect.sh accelpath
aayushj@tcn351 /h/a/c/s/init (amdaccel) $ ./eessi_archdetect.sh -d accelpath
2026-04-17 11:33:41 [DEBUG] accelpath: Override variable set as ''
2026-04-17 11:33:41 [DEBUG] nvidia_accelpath: nvidia-smi command found @ /usr/bin/nvidia-smi
2026-04-17 11:33:41 [DEBUG] nvidia_accelpath: nvidia-smi command failed, see output in /tmp/nvidia_smi_out.31GYb
2026-04-17 11:33:41 [DEBUG] amd_accelpath: KFD sysfs path not found. Falling back to amd-smi.
2026-04-17 11:33:41 [DEBUG] amd_accelpath: amd-smi command not found
2026-04-17 11:33:41 [DEBUG] accelpath: No supported accelerators found on this system

snellius/gpu_a100

aayushj@gcn18 /h/a/c/s/init (amdaccel) $ ./eessi_archdetect.sh cpupath
x86_64/intel/icelake
aayushj@gcn18 /h/a/c/s/init (amdaccel) $ ./eessi_archdetect.sh -d accelpath
2026-04-17 11:38:33 [DEBUG] accelpath: Override variable set as ''
2026-04-17 11:38:33 [DEBUG] nvidia_accelpath: nvidia-smi command found @ /usr/bin/nvidia-smi
2026-04-17 11:38:33 [DEBUG] nvidia_accelpath: CUDA compute capability '80' derived from nvidia-smi output 'NVIDIA A100-SXM4-40GB, 1, 590.48.01, 8.0'
2026-04-17 11:38:33 [DEBUG] accelpath: result: accel/nvidia/cc80
accel/nvidia/cc80

snellius/gpu_h100

aayushj@gcn122 /h/a/c/s/init (amdaccel) $ ./eessi_archdetect.sh cpupath
x86_64/amd/zen4
aayushj@gcn122 /h/a/c/s/init (amdaccel) $ ./eessi_archdetect.sh -d accelpath
2026-04-17 11:41:31 [DEBUG] accelpath: Override variable set as ''
2026-04-17 11:41:31 [DEBUG] nvidia_accelpath: nvidia-smi command found @ /usr/bin/nvidia-smi
2026-04-17 11:41:31 [DEBUG] nvidia_accelpath: CUDA compute capability '90' derived from nvidia-smi output 'NVIDIA H100, 1, 590.48.01, 9.0'
2026-04-17 11:41:31 [DEBUG] accelpath: result: accel/nvidia/cc90
accel/nvidia/cc90

@Thyre
Copy link
Copy Markdown

Thyre commented Apr 17, 2026

My personal computer:

$ amdgpu-arch
gfx1201
gfx1036
$ bash eessi_archdetect.sh accelpath
accel/amd/gfx1201

My notebook:

$ amdgpu-arch
gfx1152
$ bash eessi_archdetect.sh accelpath
accel/amd/gfx1152

OACISS Odyssey (MI300A):

[reuter1@odyssey ~]$ /opt/rocm/llvm/bin/amdgpu-arch
gfx942
gfx942
gfx942
gfx942
[reuter1@odyssey ~]$ bash eessi_archdetect.sh accelpath
accel/amd/gfx942

OACISS Illyad (H100 + MI210):

[reuter1@illyad ~]$ nvptx-arch
sm_90
[reuter1@illyad ~]$ amdgpu-arch
gfx90a
gfx90a
[reuter1@illyad ~]$ bash eessi_archdetect.sh accelpath
accel/nvidia/cc90

OACISS Instinct (M40 + MI100 + A770):

reuter1@instinct:~$ /opt/rocm/llvm/bin/nvptx-arch
sm_52
reuter1@instinct:~$ /opt/rocm/llvm/bin/amdgpu-arch
gfx908
reuter1@instinct:~$ bash eessi_archdetect.sh accelpath
accel/nvidia/cc52

Comment thread init/eessi_archdetect.sh Outdated
Comment thread init/eessi_archdetect.sh Outdated
Comment thread init/eessi_archdetect.sh
if [[ $? -eq 0 ]]; then
log "DEBUG" "accelpath: result: ${nv_res}"
echo "$nv_res"
return 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we haven't really thought about it yet...but what happens when there are multiple GPUs of different generations, or both AMD and NVIDIA GPUs?

I guess you just have to use an override, but it would be nice if we supported a mode that listed all the possibilities for the override (and explained how to set it).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily for this PR, but worth an issue once this is merged

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now (for simplicity), we discussed to prefer NVIDIA over AMD and choose the first accelerator that is found.
However, it is true that users might want to override this.

Due to how a system might lay out the accelerators, one might detect some integrated graphics over the dedicated accelerator, or a user might prefer a certain accelerator because of performance advantages for his use case.

In most cases, especially in HPC, I do not expect to see different kinds of accelerators on a node.
In the consumer space, this is a lot more common, especially with integrated graphics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants