Skip to content

System Operation Failures

Alert GroupAlert NameDescriptionAlert Processing Algorithm
blackbox.rulesEndpointDownThe Endpoint is unavailable, for example, a web service or LDAP serverGo to the endpoint node, use logs to determine the cause of the failure, and restore the service's operation.
blackbox.rulesSSLCertExpiringSoonThe SSL certificate is expiring in (number of days)Reissue the certificate or contact the issuer.
clickhouse.rulesClickhouseInsertRateLowLow data insertion rate into ClickHouseCheck average insertion rate in graphs, validate ClickHouse status, and check chwriter.
clickhouse.rulesDiskSpacePredictionCHDisk space usage in ClickHouse will exceed N% in 3 daysClean up old data in ClickHouse using chpolicy.
consul.rulesConsulServicesCountDecreaseMore than N% of service processes are not runningInvestigate why services are no longer registered in Consul.
infra.rulesCPUUsageHighCPU usage exceeds %Determine the cause of increased consumption and take action if necessary.
infra.rulesMemoryUsageHighMemory usage exceeds %Determine the cause of increased consumption and take action if necessary.
infra.rulesSwapUsageHighSwap usage exceeds %Determine the cause of increased consumption and take action if necessary.
infra.rulesLoadAverageHighHigh load averageDetermine the cause of increased consumption and take action if necessary.
infra.rulesDiskSpaceUsageDisk space usage exceeds %Depending on the node type, take the following actions:
1) For MongoDB - unload archival collections, run compaction, as a last resort, initiate data transfer
2) For Postgres - check log sizes (log rotation, log cleaning)
3) For ClickHouse - use chpolicy, clean old data
4) Conduct investigations.
infra.rulesDiskInodesUsageHighInode usage exceeds %Conduct investigations.
infra.rulesSystemRebootSystem rebootedDetermine the reason for the reboot.
liftbridge.rulesCorrelatorQueueTooLargeMessage queue for the correlator service exceeds NCheck the duration of the situation in graphs, access correlator logs, and restart the service if necessary.
liftbridge.rulesClassifierQueueTooLargeMessage queue for the classifier service exceeds NCheck the duration of the situation in graphs, access classifier logs, and restart the service if necessary.
liftbridge.rulesUncommitedMessagesTooMuchNot all messages are replicated to all replicas according to the ISR numberCheck the logs of all Liftbridge services, possibly one of the cluster nodes is unavailable or Liftbridge migrations are not completed.
liftbridge.rulesStreamInvertedValueCursor shift occurredFix by recreating the stream.
mongo.rulesMongoClasterServerCountChangeMongoDB cluster has reduced in sizeDetermine the cause of the member's unavailability and restore availability.
mongo.rulesMongoConnectionLowNo open connections to MongoDBDetermine why there are no active connections.
mongo.rulesMongoReplicationLagMongoDB replication lagCheck logs, run resync if necessary.
noc.rulesFMNoEscalationsNumber of created incidents in the external system is zeroCheck escalator logs, service status, review incident graphs.
noc.rulesFmTooManyAlertsHigh percentage of incidentsCheck the situation on graphs.
noc.rulesLateTasksOnPoolPolling task execution delayCheck graphs, check activator logs.
noc.rulesLateTasksSchedulerScheduler task queue overloadCheck scheduler logs, review incident graphs. If no incidents are present, it may be due to slow response time from the database.
noc.rulesHighTracesPerSecondHigh trace generation rate from activatorCheck activator logs, probably related to hardware profile errors.
noc.rulesHighTracesPerSecondHigh trace generation rate from non-activatorCheck service logs, possibly related to database unavailability.
postgres.rulesPostgresqlDeadlocksHighDeadlocks detected in PostgreSQLIf the situation repeats frequently, search for answers in PostgerSQL logs.
postgres.rulesPostgresqlBackendsLowNumber of free connections to PostgreSQLInstall pgbouncer, increase the number of threads, check which PostgreSQL connections never terminate.