System Operation Failures¶
Alert Group | Alert Name | Description | Alert Processing Algorithm |
---|---|---|---|
blackbox.rules | EndpointDown | The Endpoint is unavailable, for example, a web service or LDAP server | Go to the endpoint node, use logs to determine the cause of the failure, and restore the service's operation. |
blackbox.rules | SSLCertExpiringSoon | The SSL certificate is expiring in (number of days) | Reissue the certificate or contact the issuer. |
clickhouse.rules | ClickhouseInsertRateLow | Low data insertion rate into ClickHouse | Check average insertion rate in graphs, validate ClickHouse status, and check chwriter. |
clickhouse.rules | DiskSpacePredictionCH | Disk space usage in ClickHouse will exceed N% in 3 days | Clean up old data in ClickHouse using chpolicy. |
consul.rules | ConsulServicesCountDecrease | More than N% of service processes are not running | Investigate why services are no longer registered in Consul. |
infra.rules | CPUUsageHigh | CPU usage exceeds % | Determine the cause of increased consumption and take action if necessary. |
infra.rules | MemoryUsageHigh | Memory usage exceeds % | Determine the cause of increased consumption and take action if necessary. |
infra.rules | SwapUsageHigh | Swap usage exceeds % | Determine the cause of increased consumption and take action if necessary. |
infra.rules | LoadAverageHigh | High load average | Determine the cause of increased consumption and take action if necessary. |
infra.rules | DiskSpaceUsage | Disk space usage exceeds % | Depending on the node type, take the following actions: 1) For MongoDB - unload archival collections, run compaction, as a last resort, initiate data transfer 2) For Postgres - check log sizes (log rotation, log cleaning) 3) For ClickHouse - use chpolicy, clean old data 4) Conduct investigations. |
infra.rules | DiskInodesUsageHigh | Inode usage exceeds % | Conduct investigations. |
infra.rules | SystemReboot | System rebooted | Determine the reason for the reboot. |
liftbridge.rules | CorrelatorQueueTooLarge | Message queue for the correlator service exceeds N | Check the duration of the situation in graphs, access correlator logs, and restart the service if necessary. |
liftbridge.rules | ClassifierQueueTooLarge | Message queue for the classifier service exceeds N | Check the duration of the situation in graphs, access classifier logs, and restart the service if necessary. |
liftbridge.rules | UncommitedMessagesTooMuch | Not all messages are replicated to all replicas according to the ISR number | Check the logs of all Liftbridge services, possibly one of the cluster nodes is unavailable or Liftbridge migrations are not completed. |
liftbridge.rules | StreamInvertedValue | Cursor shift occurred | Fix by recreating the stream. |
mongo.rules | MongoClasterServerCountChange | MongoDB cluster has reduced in size | Determine the cause of the member's unavailability and restore availability. |
mongo.rules | MongoConnectionLow | No open connections to MongoDB | Determine why there are no active connections. |
mongo.rules | MongoReplicationLag | MongoDB replication lag | Check logs, run resync if necessary. |
noc.rules | FMNoEscalations | Number of created incidents in the external system is zero | Check escalator logs, service status, review incident graphs. |
noc.rules | FmTooManyAlerts | High percentage of incidents | Check the situation on graphs. |
noc.rules | LateTasksOnPool | Polling task execution delay | Check graphs, check activator logs. |
noc.rules | LateTasksScheduler | Scheduler task queue overload | Check scheduler logs, review incident graphs. If no incidents are present, it may be due to slow response time from the database. |
noc.rules | HighTracesPerSecond | High trace generation rate from activator | Check activator logs, probably related to hardware profile errors. |
noc.rules | HighTracesPerSecond | High trace generation rate from non-activator | Check service logs, possibly related to database unavailability. |
postgres.rules | PostgresqlDeadlocksHigh | Deadlocks detected in PostgreSQL | If the situation repeats frequently, search for answers in PostgerSQL logs. |
postgres.rules | PostgresqlBackendsLow | Number of free connections to PostgreSQL | Install pgbouncer, increase the number of threads, check which PostgreSQL connections never terminate. |