Antipattern in status checks

# TL;DR

This issue is acknowledged and work is in progress to resolve it:

- [ ] - #181
- [x] - #207
- [x] - #252
- [x] - #259
- [ ] - #260
- [ ] - #265 

Please read further to understand the background reasoning.

# Summary

The status field of the `EtcdCluster` is self-referential and is not determined by the status of an actual etcd cluster.

# Background

Kubernetes controlelrs are [level based](https://book-v1.book.kubebuilder.io/basics/what_is_a_controller.html). This generally means, that a controller should be able to determine its status from the state of the surrounding environment. It is generally considered a bad idea to read the status of the root object (in this case, the `EtcdCluster` CR).

> Remember, status should be able to be reconstituted from the state of the world, so it’s generally not a good idea to read from the status of the root object. Instead, you should reconstruct it every run.
>
> Kubebuilder book — [Implementing a controller](https://book.kubebuilder.io/cronjob-tutorial/controller-implementation#2-list-all-active-jobs-and-update-the-status)

Controllers also do not distinguish between types of events, which trigger reconciliation (was a resource created, updated, deleted, etc). Trying to fight this restriction may cause complications further down the line. For example, implementing statuses, which describe the lifecycle of a resource and then relying on them to select a mode of operation of the controller can lead to new failure modes which need to be handled in different ways, growing the complexity of the codebase.

# Issue

In its current state, the controller relies on an `EtcdCluster`'s status field to determine the next steps in the reconciliation logic. If the controller reads an incorrect status without actually validating it, the etcd cluster will fail to launch. A contrived example can produce this issue as follows:

<details>
<summary>First apply the following yaml</summary>

```yaml
---
apiVersion: etcd.aenix.io/v1alpha1
kind: EtcdCluster
metadata:
  name: test
  namespace: default
spec:
  replicas: 3
  podTemplate:
    spec:
      containers:
      - name: etcd
        image: "nginx"
        args: ["-c", "sleep 3600"]
        command: ["/bin/sh"]
        livenessProbe:
          httpGet:
            host: ifconfig.me
            path: /
            port: 80
            scheme: HTTP
        readinessProbe:
          httpGet:
            host: ifconfig.me
            path: /
            port: 80
            scheme: HTTP
        startupProbe:
          httpGet:
            host: ifconfig.me
            path: /
            port: 80
            scheme: HTTP
```

</details>

This object will generate a statefulset which will launch and pass all readiness checks, but will not in fact build a working cluster, instead three containers will simply run `sleep 3600`. The cluster-state configmap will set the value of `ETCD_INITIAL_CLUSTER_STATE` to `existing`.

<details>
<summary>Next, attempt to fix the problem, by applying a correct manifest</summary>

```yaml
---
apiVersion: etcd.aenix.io/v1alpha1
kind: EtcdCluster
metadata:
  name: test
  namespace: default
spec:
  replicas: 3
```

</details>

The statefulset's podspec will be updated to correct values to launch an etcd cluster, but since the initial cluster state is erroneously set to "existing", a cluster will not be bootstrapped.

# Analysis

In my opinion, the root cause of this problem is that no attempt is made to verify the "actual" status of the etcd cluster. Although at the moment the logic of the controller is quite simple and one must craft rather contrived bad inputs to break it, the examples do show that a failed transaction here or there, such as a dropped packet, a crashed node, etc, can leave an `EtcdCluster` (the custom resource) in an unrecoverable state.

Since implementing communication with the etcd cluster will be necessary in any case, for instance, to implement scaling up and down, I think it is prudent to start work on such a feature now and use it to determine the status of the cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Antipattern in status checks #186

TL;DR

Summary

Background

Issue

Analysis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Antipattern in status checks #186

Description

TL;DR

Summary

Background

Issue

Analysis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions