Rook Ceph Installation Errors for On Premise – D2iQ

In DKP 2.4, we have transitioned from Minio to Ceph for cluster storage. We have identified an issue with the Ceph Application Deployment that will prevent you from standing up a new cluster On Premise, as well as remediation steps for you.

You must add at least 40 Gigabytes of raw storage to each worker node in your cluster. An example of how you might do this is to add a new 40 Gigabyte hard disk to your worker node VM in your HyperVisor which should then immediately be picked up by the OS.

You can use the lsblk tool with the -f flag to confirm that you have added this storage correctly:

[twindebank@wka1 proc]$ lsblk -f
NAME FSTYPE LABEL UUID MOUNTPOINT
sda 
├─sda1 xfs a55feb32-3e73-4ac3-96cd-4bd788f7088a /boot
└─sda2 LVM2_member G5ulZJ-CNZk-yE2p-qIj4-FrMk-7ZlO-TPWxaR 
├─centos-root xfs a1daa026-9179-4793-aa6b-b42963753798 /
└─centos-home xfs 50a00d22-73ee-4973-bd1e-ca1ad5f8e403 /home
sdb xfs 67e584b7-ff10-4ac3-b837-748d2db06051 /var/lib/kubelet/pods/b6bec7ab-91d6-4869-ae31-bac042a56eeb/volumes/kubernetes.io~local-volume/local-pv-6ef33f00
sdc xfs 9e2d3881-e770-4ab9-ade1-a7cb76e023f0 /mnt/disks/9e2d3881-e770-4ab9-ade1-a7cb76e023f0
sdd xfs 1f35c84c-2604-4993-a799-217236d2f73a /mnt/disks/1f35c84c-2604-4993-a799-217236d2f73a
sde 
sr0

Note that the example above has 3 150 gig volumes mounted for use as Persistent Volumes on this worker node in addition to the raw storage which shows as sde in the above output. You can easily differentiate your formatted vs raw storage by checking if it has a FSTYPE entry. Raw storage will not.

After ensuring your worker nodes each have a 40G raw disk attached, you can deploy DKP as normal. Once you are ready to install Kommander, you must generate a config file and then modify it. You can then apply the modified config to your Kommander installation:

1. Generate the config file:

./dkp install kommander --init > kommander.yaml

2. Modify the rook-ceph-cluster section:

  rook-ceph-cluster:
    enabled: true
    values: |
      cephClusterSpec:
        storage:
          storageClassDeviceSets: []
          useAllDevices: true
          useAllNodes: true

3. Apply the file during installation:

./dkp install kommander --installer-config kommander.yaml

You can also directly edit the CephCluster Object if you do not want to re-run the kommander installation at this time, but please edit your kommander.yaml file to reflect the changes necessary so you don't forget the next time you upgrade:

1. Get the CephCluster Object:

kubectl get cephcluster -n kommander
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL
dkp-ceph-cluster /var/lib/rook 3 73m Ready Cluster created successfully HEALTH_OK

2. Edit the CephCluster Object to include the values we also added to kommander.yaml:

useAllDevices: true
useAllNodes: true

kubectl edit cephcluster dkp-ceph-cluster -n kommander

  storage:
    storageClassDeviceSets:
    - count: 4
      name: rook-ceph-osd-set1
      placement:
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - rook-ceph-osd
              - rook-ceph-osd-prepare
          maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - rook-ceph-osd
              - rook-ceph-osd-prepare
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      portable: true
      resources: {}
      volumeClaimTemplates:
      - metadata:
          name: data
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 40Gi
          volumeMode: Block
        status: {}
    useAllDevices: true
    useAllNodes: true

3. Check the Jobs in the Kommander namespace to ensure that there is now a new job created for every worker in your cluster:

kubectl get job -n kommander

NAME COMPLETIONS DURATION AGE
check-dkp-ceph-crd 1/1 8m6s 54m
check-dkp-loki 0/1 2m6s 2m6s
check-dkp-velero 0/1 2m6s 2m6s
delete-node-exporter-daemonset 1/1 16s 55m
delete-prometheus-adapter-deployment 1/1 16s 47m
dex-grpc-certs 1/1 19s 57m
rook-ceph-osd-prepare-rook-ceph-osd-set1-data-05dq5m 0/1 39m 39m
rook-ceph-osd-prepare-rook-ceph-osd-set1-data-1xl9nc 0/1 39m 39m
rook-ceph-osd-prepare-rook-ceph-osd-set1-data-2xhvs7 0/1 39m 39m
rook-ceph-osd-prepare-rook-ceph-osd-set1-data-3grcdz 0/1 39m 39m
rook-ceph-osd-prepare-wka1.daclusta 1/1 16s 26s
rook-ceph-osd-prepare-wka2.daclusta 1/1 13s 25s
rook-ceph-osd-prepare-wka3.daclusta 0/1 25s 25s
rook-ceph-osd-prepare-wka4.daclusta 1/1 16s 25s
rook-ceph-osd-prepare-wka5.daclusta 1/1 11s 24s
rook-ceph-osd-prepare-wka6.daclusta 1/1 12s 24s
rook-ceph-osd-prepare-wka7.daclusta 0/1 23s 23s
rook-ceph-osd-prepare-wka8.daclusta 0/1 23s 23s

You can see that we now have 8 rook-ceph-osd-prepare-<node-name> jobs running which means that our modifications to the CephCluster object were succcessful.

4. Check the status of your Kommander installation with kubectl get hr -A. If you do not apply the above fix fast enough, some of your helm releases may have timed out and will not retry installation:

object-bucket-claims
velero
grafana-loki

You can fix this with several methods:

1. Patch the Helm Releases with suspend (true/false) to cause them to retry:

kubectl -n kommander patch helmrelease velero --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": true}]'
kubectl -n kommander patch helmrelease velero --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": false}]'


kubectl -n kommander patch helmrelease object-bucket-claims --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": true}]'
kubectl -n kommander patch helmrelease object-bucket-claims --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": false}]'


kubectl -n kommander patch helmrelease grafana-loki --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": true}]'
kubectl -n kommander patch helmrelease grafana-loki --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": false}]'

2. Delete the Helm Release object in question and then retrying:

kubectl delete hr -n kommander object-bucket-claims velero grafana-loki

3. Patch the Helm Releases during install time to have infinite retries instead of 30:

This can only be done ahead of time and cannot be applied to an existing Kommander Cluster.

Clone the Applications Repository for your specific Kommander version:

git clone https://github.com/mesosphere/kommander-applications.git --branch v2.4.0

Locate the Helm Release for the specific Application under /kommander-applications/services/ and then edit the Helm Release object so that retries is -1:

install:
  crds: CreateReplace
  remediation:
    retries: -1
upgrade:
  crds: CreateReplace
  remediation:
    retries: -1

Install Kommander specifying the Kommander applications repository directory containing your changes:

./dkp install kommander --installer-config kommander.yaml --kommander-applications-repository ./kommander-applications