Omnia

Ansible playbook-based tools for deploying Slurm and Kubernetes clusters for High Performance Computing, Machine Learning, Deep Learning, and High-Performance Data Analytics

This project is maintained by dellhpc

Monitor Kubernetes and Slurm

Omnia provides playbooks to configure additional software components for Kubernetes such as JupyterHub and Kubeflow. For workload management (submitting, controlling, and managing jobs) of HPC, AI, and Data Analytics clusters, you can access Kubernetes and Slurm dashboards and other supported applications.

Before accessing the dashboards

To access any of the dashboards, ensure that a compatible web browser is installed. If you are connecting remotely to your Linux server by using MobaXterm version later than 8 or other X11 Clients though ssh, follow the below-mentioned steps to launch the Firefox Browser:

Note: When the PuTTY or MobaXterm session ends, you must run export DISPLAY=:10.0 command each time, else Firefox cannot be launched again.

Access FreeIPA Dashboard

The FreeIPA Dashboard can be accessed from the control plane, manager, and login nodes. To access the dashboard:

  1. Install the Firefox Browser.
  2. Open the Firefox Browser and enter the url: https://<hostname>. For example, enter https://manager.example.com.
  3. Enter the username and password. If the admin or user has obtained a Kerberos ticket, then the credentials need not be provided.

Note: To obtain a Kerberos ticket, perform the following actions:

  1. Enter kinit <username>
  2. When prompted, enter the password.

An administrator can create users on the login node using FreeIPA. The users will be prompted to change the passwords upon first login.

Access Kubernetes Dashboard

  1. To verify if the Kubernetes-dashboard service is in the Running state, run kubectl get pods --namespace kubernetes-dashboard.
  2. To start the Kubernetes dashboard, run kubectl proxy.
  3. To retrieve the encrypted token, run kubectl get secret -n kubernetes-dashboard $(kubectl get serviceaccount admin-user -n kubernetes-dashboard -o jsonpath="{.secrets[0].name}") -o jsonpath="{.data.token}" | base64 --decode.
  4. Copy the encrypted token value.
  5. On a web browser on the control plane (for control_plane.yml) or manager node (for omnia.yml) enter http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/.
  6. Select the authentication method as Token.
  7. On the Kubernetes Dashboard, paste the copied encrypted token and click Sign in to access the Kubernetes Dashboard.

Access Kubeflow Dashboard

  1. Before accessing the Kubeflow Dashboard, run kubectl -n kubeflow get applications -o yaml profiles. Wait till profiles-deployment enters the Ready state.
  2. To retrieve the External IP or CLUSTER IP, run kubectl get services istio-ingressgateway --namespace istio-system.
  3. On a web browser installed on the manager node, enter the External IP or Cluster IP to open the Kubeflow Central Dashboard.

For more information about the Kubeflow Central Dashboard, see https://www.kubeflow.org/docs/components/central-dash/overview/.

Access JupyterHub Dashboard

  1. To verify if the JupyterHub services are running, run kubectl get pods --namespace jupyterhub.
  2. Ensure that the pod names starting with hub and proxy are in the Running state.
  3. To retrieve the External IP or CLUSTER IP, run kubectl get services proxy-public --namespace jupyterhub.
  4. On a web browser installed on the manager node, enter the External IP or Cluster IP to open the JupyterHub Dashboard.
  5. JupyterHub is running with a default dummy authenticator. Enter any username and password combination to access the dashboard.

For more information about configuring username and password, and to access the JupyterHub Dashboard, see https://zero-to-jupyterhub.readthedocs.io/en/stable/jupyterhub/customization.html.

Access Prometheus UI

Prometheus is installed:

A. When Prometheus is installed as a Kubernetes role.

B. When Prometheus is installed on the host.

  1. Navigate to Prometheus folder. The default path is /var/lib/prometheus-2.23.0.linux-amd64/.
  2. Start the web server: ./prometheus.
  3. To launch the Prometheus UI, in the web browser, enter http://localhost:9090.

Note:

Accessing Cluster metrics (fetched by Prometheus) on Grafana

Note: Both the control plane and HPC clusters can be monitored on these dashboards by toggling the datasource at the top of each dashboard.

Accessing Control Plane metrics (fetched by Prometheus) on Grafana

Prometheus DataSource

Note: Both the control plane and HPC clusters can be monitored on these dashboards by toggling the datasource at the top of each dashboard:

Data Source Description Source
hpc-prometheus-manager-nodeIP Manages the Kubernetes and Slurm Cluster on the Manager and Compute nodes. This datasource is set up when Omnia.yml is run.
control_plane_prometheus Monitors the Single Node cluster running on the Control Plane This datasource is set up when control_plane.yml is run.

Prometheus DataSource

Type Subtype Dashboard Name Available DataSources
    CoreDNS control-plane-prometheus, hpc-prometheus-manager-nodeIP
Kubernetes   API Types control-plane-prometheus, hpc-prometheus-manager-nodeIP
Kubernetes Compute Resources Cluster control-plane-prometheus, hpc-prometheus-manager-nodeIP
Kubernetes Compute Resources Namespace (Pods) control-plane-prometheus, hpc-prometheus-manager-nodeIP
Kubernetes Compute Resources Node (Pods) control-plane-prometheus, hpc-prometheus-manager-nodeIP
Kubernetes Compute Resources Pod control-plane-prometheus, hpc-prometheus-manager-nodeIP
Kubernetes Compute Resources Workload control-plane-prometheus, hpc-prometheus-manager-nodeIP
Kubernetes   Kubelet control-plane-prometheus, hpc-prometheus-manager-nodeIP
Kubernetes Networking Cluster control-plane-prometheus, hpc-prometheus-manager-nodeIP
Kubernetes Networking Namespace (Pods) control-plane-prometheus, hpc-prometheus-manager-nodeIP
Kubernetes Networking Namespace (Workload) control-plane-prometheus, hpc-prometheus-manager-nodeIP
Kubernetes Networking Pod control-plane-prometheus, hpc-prometheus-manager-nodeIP
Kubernetes Networking Workload control-plane-prometheus, hpc-prometheus-manager-nodeIP
Kubernetes   Scheduler control-plane-prometheus, hpc-prometheus-manager-nodeIP
Kubernetes   Stateful Sets control-plane-prometheus, hpc-prometheus-manager-nodeIP
    Prometheus Overview control-plane-prometheus, hpc-prometheus-manager-nodeIP
Slurm   CPUs/GPUs, Jobs, Nodes, Scheduler hpc-prometheus-manager-nodeIP
Slurm   Node Exporter Server Metrics hpc-prometheus-manager-nodeIP