SRE Maintenance

From Delft Solutions
Revision as of 08:11, 30 May 2024 by Dortund (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Access to hosts

When an employee needs root access to a host, either for client work or SRE maintence, and they are not the First Responder for that week, they can request the First Responder to add them and the hosts to the sla-temporary-access groups. Once work is completed, the employee needs to inform the First Responder that they and the hosts can be removed from the groups again.

Notes

Sometimes the host might have an older cache of IPA (Either by the employee logging in recently before they and the host were added, or due to other reasons). The cache can cleared using the following commands:

  1. apt install sssd-tools
  2. sss_cache -E

The first commands installs the necessary package if it is not present already. The second command clears the cache. It can be that after this, the employee still cannot gain root access. In that case, a restart of the sssd service might help

  • service sssd restart (WARNING: this is a dangerous operation and can lead to even the first responder no longer having access. Use with caution/as a last resort)

For further debugging of any problems, the following commands might prove usefull:

  • Check sudo rights for a user: 'sudo -l -U <user>'
  • Check which IPA groups a user is part of according to a host: 'groups <user>', and/or 'getent group sla-temporary-access'
  • Chechk which IPA groups the host is part of according to a host: 'getent netgroup sla-temporary-access'

Maintenance Periods

If you are working on a host, and you expect some triggers to fire, you need to put the host in a maintenance period. Do not that:

  • Acknowledge triggers fired for the host while under maintence. Include in the acknowledgement message that this was during a maintenance period.
  • Keep maintence periods scoped to few hosts/triggers. This means that for a maintenance mode on Ceph, where we don't expect critical triggers to fire, to only fire for non-critical triggers by setting the tag to 'ceph_health' equaling the value 'warning'.
  • Remove the maintenance period after work is completed