Skip to content

System Operation Failures

Alert Group Alert Name Description Alert Processing Algorithm
blackbox.rules EndpointDown The Endpoint is unavailable, for example, a web service or LDAP server Go to the endpoint node, use logs to determine the cause of the failure, and restore the service's operation.
blackbox.rules SSLCertExpiringSoon The SSL certificate is expiring in (number of days) Reissue the certificate or contact the issuer.
clickhouse.rules ClickhouseInsertRateLow Low data insertion rate into ClickHouse Check average insertion rate in graphs, validate ClickHouse status, and check chwriter.
clickhouse.rules DiskSpacePredictionCH Disk space usage in ClickHouse will exceed N% in 3 days Clean up old data in ClickHouse using chpolicy.
consul.rules ConsulServicesCountDecrease More than N% of service processes are not running Investigate why services are no longer registered in Consul.
infra.rules CPUUsageHigh CPU usage exceeds % Determine the cause of increased consumption and take action if necessary.
infra.rules MemoryUsageHigh Memory usage exceeds % Determine the cause of increased consumption and take action if necessary.
infra.rules SwapUsageHigh Swap usage exceeds % Determine the cause of increased consumption and take action if necessary.
infra.rules LoadAverageHigh High load average Determine the cause of increased consumption and take action if necessary.
infra.rules DiskSpaceUsage Disk space usage exceeds % Depending on the node type, take the following actions:
1) For MongoDB - unload archival collections, run compaction, as a last resort, initiate data transfer
2) For Postgres - check log sizes (log rotation, log cleaning)
3) For ClickHouse - use chpolicy, clean old data
4) Conduct investigations.
infra.rules DiskInodesUsageHigh Inode usage exceeds % Conduct investigations.
infra.rules SystemReboot System rebooted Determine the reason for the reboot.
liftbridge.rules CorrelatorQueueTooLarge Message queue for the correlator service exceeds N Check the duration of the situation in graphs, access correlator logs, and restart the service if necessary.
liftbridge.rules ClassifierQueueTooLarge Message queue for the classifier service exceeds N Check the duration of the situation in graphs, access classifier logs, and restart the service if necessary.
liftbridge.rules UncommitedMessagesTooMuch Not all messages are replicated to all replicas according to the ISR number Check the logs of all Liftbridge services, possibly one of the cluster nodes is unavailable or Liftbridge migrations are not completed.
liftbridge.rules StreamInvertedValue Cursor shift occurred Fix by recreating the stream.
mongo.rules MongoClasterServerCountChange MongoDB cluster has reduced in size Determine the cause of the member's unavailability and restore availability.
mongo.rules MongoConnectionLow No open connections to MongoDB Determine why there are no active connections.
mongo.rules MongoReplicationLag MongoDB replication lag Check logs, run resync if necessary.
noc.rules FMNoEscalations Number of created incidents in the external system is zero Check escalator logs, service status, review incident graphs.
noc.rules FmTooManyAlerts High percentage of incidents Check the situation on graphs.
noc.rules LateTasksOnPool Polling task execution delay Check graphs, check activator logs.
noc.rules LateTasksScheduler Scheduler task queue overload Check scheduler logs, review incident graphs. If no incidents are present, it may be due to slow response time from the database.
noc.rules HighTracesPerSecond High trace generation rate from activator Check activator logs, probably related to hardware profile errors.
noc.rules HighTracesPerSecond High trace generation rate from non-activator Check service logs, possibly related to database unavailability.
postgres.rules PostgresqlDeadlocksHigh Deadlocks detected in PostgreSQL If the situation repeats frequently, search for answers in PostgerSQL logs.
postgres.rules PostgresqlBackendsLow Number of free connections to PostgreSQL Install pgbouncer, increase the number of threads, check which PostgreSQL connections never terminate.