Перейти к содержанию

NOC Internal Metrics

Service Metrics

Given to a separate service for reference to the link /mon/

All metrics are tagged:

  • service - the name of the service. It is fixed in the code
  • pool - in the case of a sharded service, an indication of the pool

General Service Metrics

Metric nameTag valueA placePhysical meaning
serviceService (core.service.base)Service name
statusService status - always returns True
pidService process number
poolThe name of the pool with which the service works
nodeThe server name is filled in config.node
uptimeTime since the service started (in seconds)
mon_requestsNumber of requests by reference /mon/
http_requestsmethod (GET / POST / PUT)The number of calls to HTTP service
http_responsestatusNumber of responses returned by the service (by status)
spansspansThe number of requests for the return of telemetry to the service
errorsdebugThe number of traces in the service
unique_errorsdebugThe number of unique (new) errors in the service (counted from the moment of launch)
err_<code>errorThe number of errors by code

Cache metrics

Most services provide work with the cache. A separate component is responsible for this. The cache is divided into 2 types:

  • internal (L1, internal) cache is represented by intra-service memory for a certain number of pieces of information
  • external ( L2, external) cache is organized by the cache service specified in the system settings. It can be located:

  • mongodb (by default, unless otherwise specified)

  • memcached - external service memcached
  • redis - external service redis

Each element of the system that uses the cache is assigned a unique key - cache_key. It is used as tag.

Metric nameTag valueA placePhysical meaning
cache_requestscache_keycachedmethodThe number of requests to the cache by key
cache_hitscache_keycachedmethodThe number of successful requests (hits) in the cache
cache_hitscache_level: internalcachedmethodNumber of queries processed (passed) by internal cache
cache_hitscache_level: externalcachedmethodNumber of queries processed (passed) by external cache
cache_missescache_keycachedmethodThe number of requests past the cache (missed)
cache_locks_acquirescache_keycachedmethodNumber of cache accesses

HTTP client metrics

Built-in HTTP client supports metrics:

Metric nameTag valueA placePhysical meaning
httpclient_requestsmethod (method name)http_client.fetchNumber of completed requests
httpclient_timeoutshttp_client.fetchNumber of requests with an error timeout
httpclient_proxy_timeoutshttp_client.fetchThe number of requests with a proxy error

Client RPC Metrics

A client RPC used to interact with a part of system services that support the protocol JSON-RPC. Supports the following tags:

  • called_service - the name of the called service
  • method - the name of the called method in the service (the list depends on the called service)
Metric nameTag valueA placePhysical meaning
rpc_callmethod (method name), called_service (name of the service being called)http_client.fetchThe number of calls to a specific method in a given service

DCS metrics

DCS client is used to work with the services service Consul. Consul is used for:

  • Search services. A request is made to search for a service by name, the IP address is returned: the port of the nearest service
  • To register yourself (at startup)
  • For unregistration (at a stop)
  • For blocking (if the launched service can work in a single copy)
  • To get a slot (in the case of sharding by objects)

The client provides the following metrics:

Metric nameTag valueA placePhysical meaning
dcs_resolver_activeservicesname (service name)ResolverBaseRequest for exhibiting active service.
dcs_resolver_requestsResolverBaseThe total number of requests for the nearest service
dcs_resolver_hintsResolverBase
dcs_resolver_successResolverBaseThe number of requests for service, completed success
errorstype: dcs_resolver_timeoutResolverBaseThe number of service requests that failed

Threadpool metrics

In system services using multi-thread processing is used pool of threads( threadpool). This component is responsible for managing flows and provides the following metrics:

Metric nameTag valueA placePhysical meaning
_max_workersThreadPoolExecutorMaximum number of threads
_idle_workersThreadPoolExecutorNumber of idle threads
_running_workersThreadPoolExecutorThe number of busy threads
_submitted_tasksThreadPoolExecutorNumber of completed tasks
_queued_jobsThreadPoolExecutorNumber of jobs waiting (in the queue)
_uptimeThreadPoolExecutorFlow time
  • - The name of the threadpool. The following items are available:

  • script - used by the Activator service to run scripts

  • query - used by the service BI
  • max - use services Web,NBI

Scheduler metrics

In system services using work with tasks, the scheduler component ( scheduler) is used. It is responsible for working with tasks (planning, sending for execution ...). Provides the following metrics:

Metric nameTag valueA placePhysical meaning
<service>_jobs_startedSchedulerThe total number of running tasks (during operation)
<service>_jobs_retries_exceededSchedulerNumber of tasks exceeding the maximum number of executions
<service>_jobs_burstSchedulerThe number of tasks exceeding the maximum
<service>_bulk_failedSchedulerThe number of update status errors in the collection
<service>_cache_set_requestsSchedulerNumber of Scheduler Cache Saves
<service>_cache_set_errorsSchedulerThe number of errors while saving the scheduler cache

Activator

Metric nameTag valueA placePhysical meaning
errortype:invalid_scriptActivatorAPI.scriptThe number of calls to a non-existent script
type:script_errorActivatorAPI.scriptThe number of errors during the execution of the script
type:snmp_v1_errorActivatorAPI.snmp_v1_getThe number of SNMP V1 request errors
type:snmp_v2_errorActivatorAPI.snmp_v2c_getNumber of SNMP V2 Request Errors
type:http_error_<code>ActivatorAPI.http_getThe number of HTTP request errors (divided by code)

Discovery

todo

SAE

todo

Ping

Metric nameTag valueA placePhysical meaning
ignorable_ping_errorsPingSocket.pingThe number of ignored errors when the collector receives an ICMP message
ping_recvfrom_errorsPingSocket.pingThe number of errors when the collector receives an ICMP message
ping_unknown_icmp_packetsPingSocket.pingICMP packet belonging to another service
ping_time_stepbacksPingSocket.pingThe number of packages containing more time system
ping_check_recoverPingSocket.pingNumber of IP address availability recoveries
ping_objectsPingserviceThe number of objects checked by the sample
down_objectsPingserviceNumber of unavailable objects
ping_probe_createPingserviceNumber of samples (one object = one sample)
ping_probe_updatePingserviceThe number of updates in the samples
ping_probe_deletePingserviceNumber of samples removed
ping_check_totalPingserviceThe number of checks performed
ping_check_skipsPingserviceThe number of missed checks
ping_check_successPingserviceThe number of successful checks
ping_check_failPingserviceThe number of failed checks

Collectors

Metric nameTag valueA placePhysical meaning
trap_msg_inTrapServer.on_readThe number of incoming UPD SNMP Trap packets
events_outTrapCollectorService.register_messageThe number of events in the direction of the classifier
sources_changedTrapCollectorService.update_source, SyslogCollectorService.update_sourceUpdating information on source IP addresses
sources_deletedTrapCollectorService.sources_deleted, SyslogCollectorServiceDeleting information by IP address
errortype:decode_failedTrapServer.on_read
errortype:socket_listen_erroron_activate
errortype:object_not_foundTrapCollectorService.lookup_config
syslog_msg_inSyslogServer.on_readThe number of incoming UPD syslog packages

Classifier

Metric nameTag valueA placePhysical meaning
lag_usClassifierService.on_eventDelay versus message creation time at source
events_preprocessedClassifierServiceThe number of events classified by pre-processing
events_processedClassifierServiceThe number of events received for processing
events_unk_objectClassifierServiceThe number of events from an unknown object
events_unk_duplicatedClassifierServiceNumber of duplicate events detected by codebook
events_duplicatedClassifierServiceThe number of classified events that have a duplicate detected.
events_disposedClassifierServiceThe number of classified events sent to the correlator
events_classifiedClassifierServiceThe number of classified events (there was a match with the classification rule)
events_unknownClassifierServiceNumber of unclassified (no rule found) events
events_suppressedClassifierServiceNumber of events suppressed due to replay
events_deletedClassifierServiceNumber of events deleted based on classification rule
events_failedClassifierServiceThe number of events that fell under the preprocessing with an invalid class
events_syslogClassifierServiceThe number of events from the Syslog collector
events_snmp_trapClassifierServiceThe number of events from the SNMP Trap collector
events_systemClassifierServiceThe number of events from system services
events_otherClassifierServiceThe number of events from unknown sources
rules_checkedRuleSet.find_ruleThe number of checked rules
esm_lookupsXRuleLookup.lookup_rulesThe number of checked rules XRules

Correlator

Metric nameTag valueA placePhysical meaning
alarm_correlated_ruleCorrelatorService.set_root_causeThe number of accidents with the root cause
alarm_change_moCorrelatorService.raise_alarmNumber of ManagedObject changes in crash with eval_expression
alarm_reopenCorrelatorService.raise_alarmNumber of reopen accidents
alarm_contributeCorrelatorService.raise_alarmThe number of events involved in accidents
alarm_raiseCorrelatorService.raise_alarmNumber of alarms raised
alarm_dropCorrelatorService.correlateChilo missed accidents (executed if the handler returned Severity 0)
unknown_objectCorrelatorService.clear_alarmNumber of failed crash closures due to lack of ManagedObject
alarm_clearCorrelatorService.clear_alarmNumber of closed accidents
alarm_disposeCorrelatorService.dispose_workerThe number of received events
alarm_dispose_errorCorrelatorService.dispose_workerThe number of errors when processing received events
event_lookup_failedCorrelatorService.lookup_eventThe number of errors when searching for events by ID
event_lookupsCorrelatorService.dispose_workerThe number of searches for events in the database by ID
event_hintsCorrelatorService.get_event_from_hintNumber of use of information on the event from the message
alarm_correlated_topologyCorrelatorService.topology_rcaThe number of primed causes
detached_rootcheck.check_close_consequenceThe number of trips of the root cause (in case of closing the main accident and the remaining subordinates)
errorstype: alarm_handlerCorrelatorService.correlateRuntime Handler Errors

Escalator

Metric nameTag valueA placePhysical meaning
escalation_missed_alarmescalator.escalateAt the time of the escalation, the accident was removed.
escalation_already_closedescalator.escalateAt the time of the escalation, the accident was closed
escalation_alarm_is_not_rootescalator.escalateAt the time of the escalation, the root cause of the accident was exposed (in this case, the accident is escalated as part of the parent)
escalation_not_foundescalator.escalateEscalation was removed (checked during escalation)
escalation_throttledescalator.escalateThe escalation was stopped because triggered check for exceeding the escalation limit
escalation_stop_on_maintenanceescalator.escalateThe escalation was stopped because equipment covered with Maintanance Window
escalation_tt_retryescalator.escalateDuring the creation of the TT (Incident, Trouble ticket) in the external system was detected Temporary Error, the escalation went to repeat later
escalation_tt_createescalator.escalateThe number of generated incidents in the external system
escalation_tt_failescalator.escalateThe number of errors when creating incidents in the external system
escalation_tt_commentescalator.escalateNumber of comments added to events in the external system
escalation_tt_comment_failescalator.escalateThe number of errors when commenting comments on the incidents in the external system
escalation_notifyescalator.escalateThe number of sent notifications
escalation_closed_while_escalatedescalator.escalateNumber of closed accidents detected during escalation
escalation_already_deescalatedescalator.notify_closeDe-escalation (incident closing) for an accident has already been made
escalation_tt_closeescalator.notify_closeThe number of incidents closed in the external system
escalation_tt_close_retryescalator.notify_closeThe number of repetitions of closing incidents in the external system
escalation_tt_close_failescalator.notify_closeThe number of errors when closing incidents in the external system
maintenance_tt_createescalator.maintenance
maintenance_tt_failescalator.maintenance
maintenance_tt_closeescalator.maintenance
maintenance_tt_close_failescalator.maintenance

Mailsender

Metric nameTag valueA placePhysical meaning
smtp_responsecode (SMTP code)MailSenderService.send_mailNumber of sent messages (divided by SMTP server response codes)

System-wide metrics (self-monitoring)

Subsystem metrics are calculated based on information from the database ( Postgres or MongoDB) and require an installed service seflmon.

Task

In many services of the system, tasks are performed with time reference. It is responsible for this scheduler. Technically, it is implemented as a queue of tasks in MongoDB. Collection of tasks with which it works is called a template: noc.scheduler.<scheduler_name>.<shard>. The tasks themselves are one-time and periodical. One-time after execution

Tags are added to all metrics:

  • scheduler_name - name of the scheduler, usually the same as the name of the service
  • pool - the name of the shard
Metric nameTag valueA placePhysical meaning
task_pool_totalscheduler_name, poolselfmon.taskTotal number of tasks
task_exception_count-Number of tasks with execution error
task_running_count-The number of tasks in the state запущено
task_late_count-The number of delayed tasks (start time later than the current)
task_lag_seconds-In case of delayed tasks, the delay value of the task (in seconds)
task_box_time_avg_seconds-Average task completion time (counted for equipment survey service (discovery))
task_periodic_time_avg_seconds-Average lead time

Inventory

Metric nameTag valueA placePhysical meaning
inventory_iface_countselfmon.inventoryThe total number of interfaces in the system (calculated from the collection of inv.interfaces)
inventory_iface_physical_countThe total number of physical interfaces in the system (calculated from the collection of inv.interfaces)
inventory_link_countTotal number of links in the system (calculated from the inv.links collection)
inventory_subinterface_countTotal number of subinterfaces in the system (calculated from the collection of inv.interfaces)
inventory_managedobject_totalselfmon.managedobject
inventory_managedobject_managedThe total number of management objects (ManagedObject) in the active state (ticked is_managed)
inventory_managedobject_unmanagedselfmon.managedobject

FM

For part of the metrics tags are added:

  • ac_group - group of accidents. Present:

  • availablility- NOC | Managed Object | Ping FailedICMP accessibility crashes

  • discovery- System crashes (class Discovery | ..) generated by survey issues
  • other - The rest of the accident (Syslog / SNMP Trap)

  • pool - a pool of services which practiced an accident (this includes pinger, classifier, correlator)

  • shard - analogue of the pool for the external system
Metric nameTag valueA placePhysical meaning
fm_events_active_totalselfmon.fmThe total number of active events (calculated from the fm.events.active collection)
fm_events_active_last_lag_secondsselfmon.fmThe difference (in seconds) between the current time and the time of the last message creation (counted according to the fm.events.active collection)
fm_alarms_active_totalselfmon.fmThe total number of active alarms (calculated from the fm.alarms.active collection)
fm_alarms_archived_totalselfmon.fmThe total number of archived accidents (counted in the fm.alarms.archived collection)
fm_alarms_active_last_lag_secondsselfmon.fmThe difference (in seconds) between the current time and the time of the last crash creation (calculated from the fm.alarms.active collection)
fm_alarms_active_late_countselfmon.fmThe number of events due to equipment unavailability (class NOC
fm_alarms_active_pool_countac_group, poolselfmon.fmNumber of active alarms with splitting in pool and class
fm_alarms_active_withroot_pool_countac_group, poolselfmon.fmThe number of active accidents with the underlying cause with splitting into pool and class
fm_alarms_active_withoutroot_pool_countac_group, poolselfmon.fmNumber of active accidents without root cause with splitting by pool and class
fm_escalation_pool_countshardselfmon.fmThe number of escalations in the queue
fm_escalation_first_lag_secondsshardselfmon.fmThe difference (in seconds) between the current time and the time of the first escalation in the queue
fm_escalation_lag_secondsshardselfmon.fmThe difference (in seconds) between the current time and the time of the last escalation in the queue

For each of the groups of metrics there are settings in the section config.selfmon:

  • enable_managedobject - enable metrics collection by managedobject
  • managedobject_ttl - metrics update interval for managedobject

Similar settings are for each section:

  • enable_task
  • task_ttl
  • enable_inventory
  • inventory_ttl
  • enable_fm
  • fm_ttl