Issue
As DKP 2.x uses ClusterAPI architecture, which uses a kubernetes cluster to manage a kubernetes cluster. This could present resource utilization issues on the machine where the management cluster is being hosted. Specially for cases where the management cluster is provisioning large number of nodes for the workload cluster.
Cause
When the management cluster provisions a workload cluster, it spins up a pod, per node that is being provisioned, which executes the ansible playbook. This might cause the host to reach the memory allocation limit, and might cause some pods to be terminated.
Solution / Workaround
Our Engineering team is continually working on making the CAP* provisioners more efficient.
Our suggested workaround at the moment is to leverage the Cluster API's capability of pivoting the management cluster and/or scaling the machine deployment.
With ClusterAPI's pivot capability, you are moving the management cluster to the workload cluster. This can give you a variety of options in deploying a large number of nodes. Please check the suggested strategies below:
1. Create a small workload cluster, then pivot the management cluster. With the small cluster's 3 CP and 4 worker nodes, provision the large number of nodes that you require.
2. ClusterAPI also has a CRD object called machine from machinedeployment which can be scaled up or down. This can be used to create a small self managed cluster first, and scale the machinedeployment to the desired replicas.
With the 2 strategy mentioned, you are essentially putting the workload of provisioning to multiple nodes instead of just the single node when it is not pivoted.
Note: The information above is only a high level overview of the strategies and does not include the details for each provider. Please see the specifics of the provider being used when implementing the recommendation.