Ansible playbook-based tools for deploying Slurm and Kubernetes clusters for High Performance Computing, Machine Learning, Deep Learning, and High-Performance Data Analytics
This project is maintained by dellhpc
Resolution:
If control_plane.yml
has run, a version file is created here: /opt/omnia/omnia_version
.
| File name | Purpose | Associated Variable (base_vars.yml) | Format | Sample File Path |
|————————-|——————————————————————————————————-|—————————————-|———————————-|——————————————————|
| Host mapping | Mapping file listing all devices (barring iDRAC) and provisioned hosts for DHCP configurations | host_mapping_file_path
| xx:yy:zz:aa:bb,server,172.17.0.5 | omnia/examples/host_mapping_file_os_provisioning.csv |
| Management mapping file | Mapping file listing iDRACs for DHCP configurations | mgmt_mapping_file_path
| xx:yy:zz:aa:bb,172.17.0.5 | omnia/examples/mapping_device_file.csv |
No route to host
?Potential Cause:
/etc/exports
and in omnia_config.yml
under nfs_client_params
. Potential Cause:
The required services are not running on the control plane. Verify the service status using:
systemctl status sssd-kcm.socket
systemctl status sssd.service
Resolution:
systemctl start sssd-kcm.socket
systemctl start sssd.service
control_plane.yml
using the tags init
and security
. Ansible-playbook control_plane.yml –tags init,security
omnia.yml
?Potential Cause: Corrupted entries in the /root/.ansible/cp/
folder. For more information on this issue, check this out!
Resolution: Clear the directory /root/.ansible/cp/
using the following commands:
cd /root/.ansible/cp/
rm -rf *
Alternatively, run the task manually:
cd omnia/tools
ansible-playbook gather_facts_resolution.yml
Failed to mount NFS client. Make sure NFS Server is running on IP xx.xx.xx.xx
?Potential Cause:
firewall-cmd --permanent --add-service=<service name>
and then reload the firewall using firewall-cmd --reload
.configure_device_cli
fail when awx_web_support
is set to true in base_vars.yml
?Potential Cause: CLI templates require that AWX is disabled when deployed.
Resolution: Set awx_web_support
to false when deploying configure_device_cli
.
omnia.yml
fails with nfs-server.service might not be running on NFS Server. Please check or start services
?Potential Cause: nfs-server.service is not running on the target node.
Resolution: Use the following commands to bring up the service:
systemctl start nfs-server.service
systemctl enable nfs-server.service
Potential Cause: Temporary network glitches may cause a loss of information.
Resolution: Re-run the playbook collect_node_info.yml
to repopulate the data. Use the command ansible-playbook collect_node_info.yml
to run the playbook.
collect_node_info.yml
?Potential Cause:
login_vars.yml
Resolution:
login_vars.yml
if the target is a server.ImagePullBack
or ErrPullImage
errors in their status?Potential Cause:
* The errors occur when the Docker pull limit is exceeded.
Resolution:
* For omnia.yml
and control_plane.yml
: Provide the docker username and password for the Docker Hub account in the omnia_config.yml file and execute the playbook.
* For HPC cluster, during omnia.yml execution
, a kubernetes secret ‘dockerregcred’ will be created in default namespace and patched to service account. User needs to patch this secret in their respective namespace while deploying custom applications and use the secret as imagePullSecrets in yaml file to avoid ErrImagePull. Click here for more info
Note: If the playbook is already executed and the pods are in ImagePullBack state, then run
kubeadm reset -f
in all the nodes before re-executing the playbook with the docker credentials.
The connection to the server head_node_ip:port was refused - did you specify the right host or port?
On the control plane or the manager node, run the following commands:
* swapoff -a
* systemctl restart kubelet
Use CLI to execute Omnia by default by disabling AWX (set awx_web_support
in base_vars.yml
to false
).
control_plane.yml
fails at the webui_awx stage?In the webui_awx/files
directory, delete the .tower_cli.cfg
and .tower_vault_key
files, and then re-run control_plane.yml
.
no ipv4_secondaries present
?
Potential Cause: If a shared LOM environment is in use, the management network/host network NIC may only have one IP assigned to it.
Resolution: Ensure that the NIC used for host and data connections has 2 IPs assigned to it.
Monitoring of Job- device_inventory_job aborted due to timeout
happen?
Potential Cause:
This error is caused by design. There is a mismatch between the AWX version (20.0.0) and the AWX galaxy collection (19.4.0) version used by control plane. At the time of design (Omnia 1.2.1), these were the latest available versions of AWX/AWX galaxy collection. This will be fixed in later code releases.
Note: This failure does not stop the execution of other tasks. Check the AWX log to verify that the script has run successfully.
This error is known to Red Hat and is being addressed here. Red Hat has offered a user intervention here. Omnia recommends that in the event of this failure, any OS other than RHEL 8.3.
awx_web_support
is false in base_vars.yml
?As a pre-requisite to running AWX job templates, AWX should be enabled by setting awx_web_support
to true in base_vars.yml
.
Potential Cause:
The provided device credentials may be invalid.
Resolution :
Manually validate/update the relevant login information on the AWX settings screen
dhcp.leases
and mgmt_provisioned_hosts.yml
updated in the Device Inventory Job/ iDRAC inventory during control_plane.yml
execution?Potential Cause:
Certain IPs may not update in AWX immediately because the device may be assigned an IP previously and the DHCP lease has not expired.
Resolution:
Wait for the DHCP lease for the relevant device to expire or restart the switch/device to clear the lease.
control_plane.yml
?Hosts that are not in DHCP mode do not get populated in the host list when control_plane.yml
is run.
Failure in talking to yum: Cannot find a valid baseurl for repo: base/7/x86_64.
Potential Cause:
There are connections missing on the NFS node.
Resolution:
Ensure that there are 3 NICs being used on the NFS node:
1. For provisioning the OS
2. For connecting to the internet (Management purposes)
3. For connecting to PowerVault (Data Connection)
ifup <InfiniBand NIC>
.zypper install -n rdma-core librdmacm1 libibmad5 libibumad3 infiniband-diags
to install IB NIC drivers. (If the drivers do not install smoothly, reboot the server to apply the required changes)service network status
to verify that wicked.service
is running./etc/sysconfig/network
.ifup <InfiniBand NIC>
. omnia.yml
to activate the NIC.Error creating pod: container failed to start, ImagePullBackOff
?Potential Cause:
After running control_plane.yml
, the AWX image got deleted due to space considerations (use df -h
to diagnose the issue.).
Resolution:
Delete unnecessary files from the partition`` and then run the following commands:
1. cd omnia/control_plane/roles/webui_awx/files
2. buildah bud -t custom-awx-ee awx_ee.yml
Potential Cause:
Lack of space in the root partition (/) causes Linux to clear files automatically (Use df -h
to diagnose the issue).
Resolution:
find / -xdev -size +5M | xargs ls -lh | sort -n -k5
to identify these files). Before running Omnia Control Plane, it is recommended to have a minimum of 50% free space in the root partition.kubeadm reset -f
control_plane.yml
Potential Cause:
The device name and connection name listed by the network manager in /etc/sysconfig/network-scripts/ifcfg-<nic name>
do not match.
Resolution:
nmcli connection
to list all available connections and their attributes./etc/sysconfig/network-scripts/ifcfg-<nic name>
using vi editor.No. Before re-deploying the cluster, users have to manually delete all hosts from the awx UI.
control_plane.yml
fails?Potential Causes:
Resolution:
Wait for AWX UI to be accessible at http://<management-station-IP>:8081, and then run the control_plane.yml
file again, where management-station-IP is the IP address of the management node.
control_plane_common: Assert Value of idrac_support if mngmt_network container needed
?When device_config_support
is set to true, idrac_support
also needs to be set to true.
idrac.yml
template hang during the import SCP file task on certain target nodes?Potential Causes:
Resolution:
idrac.yml
for certain target nodes?Potential Causes:
Resolution:
Wait for 15 minutes after the Kubernetes cluster reboots. Next, verify the status of the cluster using the following commands:
kubectl get nodes
on the manager node to get the real-time k8s cluster status.kubectl get pods --all-namespaces
on the manager node to check which the pods are in the Running state.kubectl cluster-info
on the manager node to verify that both the k8s master and kubeDNS are in the Running state.kubectl get pods --all-namespaces
to verify that all pods are in the Running state.kubectl delete pods <name of pod>
omnia.yml
, jupyterhub.yml
, or kubeflow.yml
.Run the command kubectl get pods --namespace default
to ensure nfs-client pod and all Prometheus server pods are in the Running state.
control_plane.yml
fail during the Run import command?Cause:
The mounted .iso file is corrupt.
Resolution:
1. Go to var->log->cobbler->cobbler.log to view the error.
2. If the error message is repo verification failed, the .iso file is not mounted properly.
3. Verify that the downloaded .iso file is valid and correct.
4. Delete the Cobbler container using docker rm -f cobbler
and rerun control_plane.yml
.
To enable routing, update the primary_dns
and secondary_dns
in base_vars
with the appropriate IPs (hostnames are currently not supported). For compute nodes that are not directly connected to the internet (ie only host network is configured), this configuration allows for internet connectivity.
Potential Causes:
The target compute node does not have a configured PXE device with an active NIC.
Resolution:
1. Create a Non-RAID or virtual disk on the server.
2. Check if other systems except for the management node have cobblerd running. If yes, then stop the Cobbler container using the following commands: docker rm -f cobbler
and docker image rm -f cobbler
.
3. On the server, go to BIOS Setup -> Network Settings -> PXE Device
. For each listed device (typically 4), configure an active NIC under PXE device settings
systemctl restart slurmdbd
systemctl restart slurmctld
systemctl restart prometheus-slurm-exporter
systemctl status slurmd
to manually restart the following service on all the compute nodes.Potential Cause: The slurm.conf
is not configured properly.
Recommended Actions:
slurmdbd -Dvvv
slurmctld -Dvvv
/var/lib/log/slurmctld.log
file for more information.Cause: Slurm database connection fails.
Recommended Actions:
slurmdbd -Dvvv
slurmctld -Dvvv
/var/lib/log/slurmctld.log
file.netstat -antp | grep LISTEN
for PIDs in the listening state.slurmctl restart slurmctld
on manager node
systemctl restart slurmdbd
on manager node
systemctl restart slurmd
on compute node
Potential Cause: The host network is faulty causing DNS to be unresponsive
Resolution:
kubeadm reset -f
on all the nodes.omnia_config.yml
file to change the Kubernetes Pod Network CIDR. The suggested IP range is 192.168.0.0/16. Ensure that the IP provided is not in use on your host network.ansible-playbook omnia.yml --skip-tags slurm
Potential Cause: Unstable or slow Internet connectivity.
Resolution:
calico
to flannel
.Run kubectl rollout restart deployment awx -n awx
from the control plane and try to re-run the job.
If the above solution doesn’t work,
/var/nfs_awx
.omnia/control_plane/roles/webui_awx/files/.tower_cli.cfg
.Potential Cause: The directory being used by the client as a mount point is already in use by a different NFS export.
Resolution: Verify that the directory being used as a mount point is empty by using cd <client share path> | ls
or mount | grep <client share path>
. If empty, re-run the playbook.
kubectl get pods --all-namespaces
/var/log/omnia/startup_omnia/startup_omnia_yyyy-mm-dd-HHMMSS.log
for more information.provision_os
and iso_file_path
in base_vars.yml
. Re-run control_plane.yml
with different values for provision_os
and iso_file_path
to restore the profiles.If device_config_support
is set to TRUE,
device_config_support
is set to FALSE, no reboots are required.idrac.yml
file or other .yml files from AWX?Potential Cause: The “PermissionError: [Errno 13] Permission denied” error is displayed if you have used the ansible-vault decrypt or encrypt commands.
Resolution:
chmod 664 <filename>.yml
It is recommended that the ansible-vault view or edit commands are used and not the ansible-vault decrypt or encrypt commands.
racadm getremoteservicesstatus
Error: The specified disk is not available. - Unavailable disk (0.x) in disk range '0.x-x'
:show disks
You cannot create a linear disk group when a virtual disk group exists on the system.
?At any given time only one type of disk group can be created on the system. That is, all disk groups on the system have to exclusively be linear or virtual. To fix the issue, either delete the existing disk group or change the type of pool you are creating.
Provisioning server using BOSS controller is now supported by Omnia 1.2.1.
Potential Cause: Older firmware version in PowerEdge servers. Omnia supports only iDRAC 8 based Dell EMC PowerEdge Servers with firmware versions 2.75.75.75 and above and iDRAC 9 based Dell EMC PowerEdge Servers with Firmware versions 4.40.40.00 and above.
/var/nfs_awx
/<project name>/control_plane/roles/webui_awx/files/.tower_cli.cfg
Once complete, it’s safe to re-run control_plane.yml.
Potential Cause: The control_plane playbook does not support hostnames with an underscore in it such as ‘mgmt_station’.
As defined in RFC 822, the only legal characters are the following:
Alphanumeric (a-z and 0-9): Both uppercase and lowercase letters are acceptable, and the hostname is case-insensitive. In other words, dvader.empire.gov is identical to DVADER.EMPIRE.GOV and Dvader.Empire.Gov.
Hyphen (-): Neither the first nor the last character in a hostname field should be a hyphen.
Period (.): The period should be used only to delimit fields in a hostname (e.g., dvader.empire.gov)
Potential Cause: Your Docker pull limit has been exceeded. For more information, click here
helm delete jupyterhub -n jupyterhub
Potential Cause: Your Docker pull limit has been exceeded. For more information, click here
kfctl delete -V -f /root/k8s/omnia-kubeflow/kfctl_k8s_istio.v1.0.2.yaml
No. During Cobbler based deployment, only one OS is supported at a time. If the user would like to deploy both, please deploy one first, unmount /mnt/iso
and then re-run cobbler for the second OS.
Due to the latest catalog.xml
file, Firmware updates may fail for certain components. Omnia execution doesn’t get interrupted but an error gets logged on AWX. For now, please download those individual updates manually.
infiniband.yml
?To configure a new Infiniband Switch, it is required that HTTP and JSON gateway be enabled. To verify that they are enabled, run:
show web
(To check if HTTP is enabled)
show json-gw
(To check if JSON Gateway is enabled)
To correct the issue, run:
web http enable
(To enable the HTTP gateway)
json-gw enable
(To enable the JSON gateway)
BeeGFS-client
service fail?Potential Causes:
sestatus
to diagnose the issue)systemctl status beegfs-mgmtd, systemctl status beegfs-meta, systemctl status beegfs-storage
to diagnose the issue)Resolution:
/etc/sysconfig/selinux
and reboot the server./var/log/beegfs-client.log
idrac.yml
?Upto 4 active NICs can be configured by idrac.yml
. Past the first 4 NICs, all NICs will be ignored.
While the NIC qualifies as active, it may not qualify as a PXE device NIC (It may be a mellanox NIC). In such a situation, Omnia assumes that PXE device settings are already configured and proceeds to attempt a PXE boot.
If this is not the case, manually configure a PXE device NIC and re-run idrac.yml
to proceed.
control_plane.yml
fail with ‘Error: kinit: Connection refused while getting default ccache’ while completing the control plane security role?systemcl start sssd-kcm.socket
control_plane.yml
Potential Causes: Required repositories may not be enabled by your red hat subscription.
Resolution: Enable all required repositories via your red hat subscription.
Potential Cause:
The hostnames of the manager and login nodes are not set in the correct format.
Resolution:
If you have enabled the option to install the login node in the cluster, set the hostnames of the nodes in the format: hostname.domainname. For example, manager.omnia.test is a valid hostname for the login node. Note: To find the cause for the failure of the FreeIPA server and client installation, see ipaserver-install.log in the manager node or /var/log/ipaclient-install.log in the login node.
Potential Cause: The network config file for the public NIC on the control plane does not define any DNS entries.
Resolution: Ensure the fields DNS1
and DNS2
are updated appropriately in the file /etc/sysconfig/network-scripts/ifcfg-<NIC name>
.