Documentation ¶
Overview ¶
Package nccl monitors the NCCL status. Optional, enabled if the host has NVIDIA GPUs.
Index ¶
Constants ¶
View Source
const ( // repeated messages may indicate GPU communication issues, which may happen due to fabric manager issues // e.g., // [Thu Oct 10 03:06:53 2024] pt_main_thread[2536443]: segfault at 7f797fe00000 ip 00007f7c7ac69996 sp 00007f7c12fd7c30 error 4 in libnccl.so.2[7f7c7ac00000+d3d3000] EventNameNCCLSegfaultInLibncclFromDmesg = "nccl_segfault_in_libnccl_from_dmesg" EventKeyNCCLSegfaultInLibncclFromDmesgUnixSeconds = "unix_seconds" EventKeyNCCLSegfaultInLibncclFromDmesgLogLine = "log_line" )
Variables ¶
This section is empty.
Functions ¶
Types ¶
Click to show internal directories.
Click to hide internal directories.