-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
Each OVS node run as heartbeat service. The service tries every 10 seconds to take a polling lock in Arakoon.
(see google doc for if/else layout below)
The controller-heartbeat, a small systemd service, executes following:
- Take controller-heartbeat lock
- If it can take the lock
- Execute through the API get /alba/nodes/ to get the local summary of each asdnode (only real asd nodes, filter out the dual controller nodes and asd nodes part of the global backend).
client.get('/alba/nodes/', {'contents' : 'local_summary,type'}) - Loop over each albanode
-If nr of osds in error > nr of osds in (OK, warning) - Take failover--lock
- If it can take the fail-over lock
- Release controller-heartbeat lock (so another node can check other nodes)
- Check if ASD manager B is online by executing a get to the root of the ASD API.
- In case there is no second ASD Manager defined, release the fail-over lock.
- If the response status code != 200, repeat the check every 5 secs for 3 times. If it continues to fail, release the fail-over lock.
- If response status code = 200, kill the controller with the failing disks through the IPMI extension IPMI.
ipmitool -I lanplus -H <IPMIP> -U <username> -P <password> chassis power off - Check every 5 secs thrugh the IMPI extension (
ipmitool -I lanplus -H <IPMIP> -U <username> -P <password> chassis status) and if (System Power != off) after 30 sec, release failover lock. - For each OSD set the node id to empty. Igf not able to set to empty leave old node id
- For each OSD call the OSD_move on the API of the passive ASD manager. The API is called OSD per OSD (serial).
- If all OSDs are moved, release the failover--lock
- Else do nothing as someone else is doing the failover already
- Else nothing todo as enough disks are not in error
- Else nothing todo as another node is doing the check.
OSD_move
- The OSD move function in an new API call on the ASD manager, input is osd_id. This executes following actions
- Check if node id is empty. If not empty stop with error OSD still owner by ASD Manager , esle
- Node ID update to transfer the ASD to the other ASD manager. , AlbaCLI.run(command='update-osd', config=config_location, named_params={'long-id': osd_id, 'node_id': })
framework-alba-plugin/ovs/lib/alba.py
Line 142 in e274a3d
AlbaCLI.run(command='update-osd', config=config_location, named_params={'long-id': osd_id, 'ip': ','.join(ips)}) - If this fails, Log an error and leave the OSD assigned to the old ASD manager.
- OSD update to update the IP in ALBA
framework-alba-plugin/ovs/lib/alba.py
Line 83 in e274a3d
def update_osds(osds, alba_node_guid): - If this fails, Log an error and leave the old IPs.
- Start the ASD process
https://docs.google.com/document/d/1Jzptv2gkq7xbnStq9r93fHsT8Yfw8qo4_1hYyzfilqc