Skip to content

[Dual Controller] heartbeat #573

@wimpers

Description

@wimpers

Each OVS node run as heartbeat service. The service tries every 10 seconds to take a polling lock in Arakoon.

(see google doc for if/else layout below)

The controller-heartbeat, a small systemd service, executes following:

  • Take controller-heartbeat lock
  • If it can take the lock
  • Execute through the API get /alba/nodes/ to get the local summary of each asdnode (only real asd nodes, filter out the dual controller nodes and asd nodes part of the global backend).
    client.get('/alba/nodes/', {'contents' : 'local_summary,type'})
  • Loop over each albanode
    -If nr of osds in error > nr of osds in (OK, warning)
  • Take failover--lock
  • If it can take the fail-over lock
  • Release controller-heartbeat lock (so another node can check other nodes)
  • Check if ASD manager B is online by executing a get to the root of the ASD API.
  • In case there is no second ASD Manager defined, release the fail-over lock.
  • If the response status code != 200, repeat the check every 5 secs for 3 times. If it continues to fail, release the fail-over lock.
  • If response status code = 200, kill the controller with the failing disks through the IPMI extension IPMI. ipmitool -I lanplus -H <IPMIP> -U <username> -P <password> chassis power off
  • Check every 5 secs thrugh the IMPI extension (ipmitool -I lanplus -H <IPMIP> -U <username> -P <password> chassis status) and if (System Power != off) after 30 sec, release failover lock.
  • For each OSD set the node id to empty. Igf not able to set to empty leave old node id
  • For each OSD call the OSD_move on the API of the passive ASD manager. The API is called OSD per OSD (serial).
  • If all OSDs are moved, release the failover--lock
  • Else do nothing as someone else is doing the failover already
  • Else nothing todo as enough disks are not in error
  • Else nothing todo as another node is doing the check.

OSD_move

  • The OSD move function in an new API call on the ASD manager, input is osd_id. This executes following actions
  • Check if node id is empty. If not empty stop with error OSD still owner by ASD Manager , esle
  • Node ID update to transfer the ASD to the other ASD manager.
    AlbaCLI.run(command='update-osd', config=config_location, named_params={'long-id': osd_id, 'ip': ','.join(ips)})
    , AlbaCLI.run(command='update-osd', config=config_location, named_params={'long-id': osd_id, 'node_id': })
  • If this fails, Log an error and leave the OSD assigned to the old ASD manager.
  • OSD update to update the IP in ALBA
    def update_osds(osds, alba_node_guid):
  • If this fails, Log an error and leave the old IPs.
  • Start the ASD process

https://docs.google.com/document/d/1Jzptv2gkq7xbnStq9r93fHsT8Yfw8qo4_1hYyzfilqc

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions