[Dual Controller] heartbeat

Each OVS node run as heartbeat service. The service tries every 10 seconds  to take a polling lock in Arakoon.

(see google doc for if/else layout below)

The controller-heartbeat, a  small systemd service, executes following:
- Take controller-heartbeat lock
- If it can take the lock
- Execute through the API get /alba/nodes/ to get the local summary of each asdnode (only real asd nodes, filter out the dual controller nodes and asd nodes part of the global backend).
` client.get('/alba/nodes/', {'contents' : 'local_summary,type'})`
- Loop over each albanode
-If nr of osds in error > nr of osds in (OK, warning)
- Take failover-<node-id>-lock
- If it can take the fail-over lock
 - Release controller-heartbeat lock (so another node can check other nodes)
 - Check if ASD manager B is online by executing a get to the root of the  ASD API.
 - In case there is no second ASD Manager defined, release the fail-over lock.
 - If the response status code != 200, repeat the check every 5 secs for 3 times. If it continues to fail, release the fail-over lock.   
 - If response status code = 200, kill the controller with the failing disks through the IPMI extension IPMI. `ipmitool -I lanplus -H <IPMIP> -U <username> -P <password> chassis power off`
 - Check every 5 secs  thrugh the IMPI extension (`ipmitool -I lanplus -H <IPMIP> -U <username> -P <password> chassis status`) and if  (System Power != off) after 30 sec, release failover lock.
 - For each OSD set the node id to empty. Igf not able to set to empty  leave old node id
 - For each OSD call the OSD_move on the API of the passive ASD manager. The API is called OSD per OSD (serial).
 - If all OSDs are moved, release the failover-<node-id>-lock
- Else do nothing as someone else is doing the failover already
- Else nothing todo as enough disks are not in error
- Else nothing todo as another node is doing the check.

 **OSD_move**
 - The OSD move function in an new API call on the ASD manager, input is osd_id. This executes following actions
 - Check if node id is empty. If not empty stop with error OSD still owner by ASD Manager <hostname  of manager>, esle
 - Node ID update to transfer the ASD to the other ASD manager. https://github.com/openvstorage/framework-alba-plugin/blob/e274a3d29ea8a0b73314c98ffd31c1209041eeac/ovs/lib/alba.py#L142 , AlbaCLI.run(command='update-osd', config=config_location, named_params={'long-id': osd_id, 'node_id': <ASD manager id>})
 - If this fails, Log an error and leave the OSD assigned to the old ASD manager.
 - OSD update to update the IP in ALBA https://github.com/openvstorage/framework-alba-plugin/blob/e274a3d29ea8a0b73314c98ffd31c1209041eeac/ovs/lib/alba.py#L83
 -  If this fails, Log an error and leave the old IPs.
 - Start the ASD process

https://docs.google.com/document/d/1Jzptv2gkq7xbnStq9r93fHsT8Yfw8qo4_1hYyzfilqc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dual Controller] heartbeat #573

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Dual Controller] heartbeat #573

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions