Omnia

Ansible playbook-based tools for deploying Slurm and Kubernetes clusters for High Performance Computing, Machine Learning, Deep Learning, and High-Performance Data Analytics

This project is maintained by dellhpc

Setting Up Grafana

Using Grafana, users can poll multiple devices and create graphs/visualizations of key system metrics such as temperature, System power consumption, Memory Usage, IO Usage, CPU Usage, Total Memory Power, System Output Power, Total Fan Power, Total Storage Power, System Input Power, Total CPU Power, RPM Readings, Total Heat Dissipation, Power to Cool ratio, System Air Flow Efficiency etc.

A lot of these metrics are collected using iDRAC telemetry. iDRAC telemetry allows you to stream telemetry data from your servers to a centralized log/metrics server. For more information on iDRAC telemetry, click here.

Prerequisites

  1. To set up Grafana, ensure that control_plane/input_params/login_vars.yml is updated with the Grafana Username and Password.
  2. All parameters in telemetry/input_params/telemetry_login_vars.yml need to be filled in.
  3. All parameters in telemetry/input_params/telemetry_base_vars.yml need to be filled in.
  4. Find the IP of the Grafana UI using:

kubectl get svc -n grafana

Logging into Grafana

Use any one of the following browsers to access the Grafana UI (https://< Grafana UI IP >:5000):

Note: Always enable JavaScript in your browser. Running Grafana without JavaScript enabled in the browser is not supported.

Prerequisites to Enabling Slurm Telemetry

Initiating Telemetry

  1. Once control_plane.yml and omnia.yml are executed, run the following commands from omnia/telemetry:

ansible-playbook telemetry.yml

Note: Telemetry Collection is only initiated on iDRACs on AWX that have a datacenter license and are running a firmware version of 4 or higher.

Adding a New Node to Telemetry

After initiation, new nodes can be added to telemetry by running the following commands from omnia/telemetry:

ansible-playbook add_idrac_node.yml