System Operation Failures¶

Alert Group	Alert Name	Description	Alert Processing Algorithm
blackbox.rules	EndpointDown	The Endpoint is unavailable, for example, a web service or LDAP server	Go to the endpoint node, use logs to determine the cause of the failure, and restore the service's operation.
blackbox.rules	SSLCertExpiringSoon	The SSL certificate is expiring in (number of days)	Reissue the certificate or contact the issuer.
clickhouse.rules	ClickhouseInsertRateLow	Low data insertion rate into ClickHouse	Check average insertion rate in graphs, validate ClickHouse status, and check chwriter.
clickhouse.rules	DiskSpacePredictionCH	Disk space usage in ClickHouse will exceed N% in 3 days	Clean up old data in ClickHouse using chpolicy.
consul.rules	ConsulServicesCountDecrease	More than N% of service processes are not running	Investigate why services are no longer registered in Consul.
infra.rules	CPUUsageHigh	CPU usage exceeds %	Determine the cause of increased consumption and take action if necessary.
infra.rules	MemoryUsageHigh	Memory usage exceeds %	Determine the cause of increased consumption and take action if necessary.
infra.rules	SwapUsageHigh	Swap usage exceeds %	Determine the cause of increased consumption and take action if necessary.
infra.rules	LoadAverageHigh	High load average	Determine the cause of increased consumption and take action if necessary.
infra.rules	DiskSpaceUsage	Disk space usage exceeds %	Depending on the node type, take the following actions: 1) For MongoDB - unload archival collections, run compaction, as a last resort, initiate data transfer 2) For Postgres - check log sizes (log rotation, log cleaning) 3) For ClickHouse - use chpolicy, clean old data 4) Conduct investigations.
infra.rules	DiskInodesUsageHigh	Inode usage exceeds %	Conduct investigations.
infra.rules	SystemReboot	System rebooted	Determine the reason for the reboot.
liftbridge.rules	CorrelatorQueueTooLarge	Message queue for the correlator service exceeds N	Check the duration of the situation in graphs, access correlator logs, and restart the service if necessary.
liftbridge.rules	ClassifierQueueTooLarge	Message queue for the classifier service exceeds N	Check the duration of the situation in graphs, access classifier logs, and restart the service if necessary.
liftbridge.rules	UncommitedMessagesTooMuch	Not all messages are replicated to all replicas according to the ISR number	Check the logs of all Liftbridge services, possibly one of the cluster nodes is unavailable or Liftbridge migrations are not completed.
liftbridge.rules	StreamInvertedValue	Cursor shift occurred	Fix by recreating the stream.
mongo.rules	MongoClasterServerCountChange	MongoDB cluster has reduced in size	Determine the cause of the member's unavailability and restore availability.
mongo.rules	MongoConnectionLow	No open connections to MongoDB	Determine why there are no active connections.
mongo.rules	MongoReplicationLag	MongoDB replication lag	Check logs, run resync if necessary.
noc.rules	FMNoEscalations	Number of created incidents in the external system is zero	Check escalator logs, service status, review incident graphs.
noc.rules	FmTooManyAlerts	High percentage of incidents	Check the situation on graphs.
noc.rules	LateTasksOnPool	Polling task execution delay	Check graphs, check activator logs.
noc.rules	LateTasksScheduler	Scheduler task queue overload	Check scheduler logs, review incident graphs. If no incidents are present, it may be due to slow response time from the database.
noc.rules	HighTracesPerSecond	High trace generation rate from activator	Check activator logs, probably related to hardware profile errors.
noc.rules	HighTracesPerSecond	High trace generation rate from non-activator	Check service logs, possibly related to database unavailability.
postgres.rules	PostgresqlDeadlocksHigh	Deadlocks detected in PostgreSQL	If the situation repeats frequently, search for answers in PostgerSQL logs.
postgres.rules	PostgresqlBackendsLow	Number of free connections to PostgreSQL	Install pgbouncer, increase the number of threads, check which PostgreSQL connections never terminate.