Issue
DKP uses ClusterAPI architecture, which uses a kubernetes cluster to manage a kubernetes cluster. This could present resource utilization issues on the machine where the management cluster is being hosted. Especially for cases where the management cluster is provisioning large number of nodes for the workload cluster.
Cause
When the management cluster provisions a workload cluster, it spins up a pod, per node that is being provisioned, which executes the Ansible playbook. This might cause the host to reach the memory allocation limit, and might cause some pods to be terminated.
Solution / Workaround
Our Engineering team is continually working on making the CAP* provisioners more efficient. Our suggested workaround at the moment is to leverage the Cluster API's capability of pivoting the management cluster and/or scaling the machine deployment.
With Cluster API's pivot capability, you are moving the management of the cluster lifecycle to the workload cluster. This can give you a variety of options in deploying a large number of nodes. Please check the suggested strategies below:
1. Create a small workload cluster, then make it self-managed. Once the cluster is self-managed, provision the additional large number of nodes that you require.
2. ClusterAPI also has a CRD object called machine from machinedeployment which can be scaled up or down. This can be used to create a small self managed cluster first, and scale the machinedeployment to the desired replicas.
With the 2 strategy mentioned, you are essentially putting the workload of provisioning to multiple nodes instead of just the single node when it is not pivoted.
Note: The information above is only a high level overview of the strategies and does not include the details for each provider. Please see the specifics of the provider being used when implementing the recommendation.