NOC Internal Metrics¶
Service Metrics¶
Given to a separate service for reference to the link /mon/
All metrics are tagged:
- service - the name of the service. It is fixed in the code
- pool - in the case of a sharded service, an indication of the pool
General Service Metrics¶
Metric name | Tag value | A place | Physical meaning |
---|---|---|---|
service | Service (core.service.base) | Service name | |
status | Service status - always returns True | ||
pid | Service process number | ||
pool | The name of the pool with which the service works | ||
node | The server name is filled in config.node | ||
uptime | Time since the service started (in seconds) | ||
mon_requests | Number of requests by reference /mon/ | ||
http_requests | method (GET / POST / PUT) | The number of calls to HTTP service | |
http_response | status | Number of responses returned by the service (by status) | |
spans | spans | The number of requests for the return of telemetry to the service | |
errors | debug | The number of traces in the service | |
unique_errors | debug | The number of unique (new) errors in the service (counted from the moment of launch) | |
err_<code> | error | The number of errors by code |
Cache metrics¶
Most services provide work with the cache. A separate component is responsible for this. The cache is divided into 2 types:
- internal (L1, internal) cache is represented by intra-service memory for a certain number of pieces of information
external ( L2, external) cache is organized by the cache service specified in the system settings. It can be located:
mongodb (by default, unless otherwise specified)
- memcached - external service memcached
- redis - external service redis
Each element of the system that uses the cache is assigned a unique key - cache_key. It is used as tag.
Metric name | Tag value | A place | Physical meaning |
---|---|---|---|
cache_requests | cache_key | cachedmethod | The number of requests to the cache by key |
cache_hits | cache_key | cachedmethod | The number of successful requests (hits) in the cache |
cache_hits | cache_level: internal | cachedmethod | Number of queries processed (passed) by internal cache |
cache_hits | cache_level: external | cachedmethod | Number of queries processed (passed) by external cache |
cache_misses | cache_key | cachedmethod | The number of requests past the cache (missed) |
cache_locks_acquires | cache_key | cachedmethod | Number of cache accesses |
HTTP client metrics¶
Built-in HTTP client supports metrics:
Metric name | Tag value | A place | Physical meaning |
---|---|---|---|
httpclient_requests | method (method name) | http_client.fetch | Number of completed requests |
httpclient_timeouts | http_client.fetch | Number of requests with an error timeout | |
httpclient_proxy_timeouts | http_client.fetch | The number of requests with a proxy error |
Client RPC Metrics¶
A client RPC used to interact with a part of system services that support the protocol JSON-RPC. Supports the following tags:
- called_service - the name of the called service
- method - the name of the called method in the service (the list depends on the called service)
Metric name | Tag value | A place | Physical meaning |
---|---|---|---|
rpc_call | method (method name), called_service (name of the service being called) | http_client.fetch | The number of calls to a specific method in a given service |
DCS metrics¶
DCS client is used to work with the services service Consul. Consul is used for:
- Search services. A request is made to search for a service by name, the IP address is returned: the port of the nearest service
- To register yourself (at startup)
- For unregistration (at a stop)
- For blocking (if the launched service can work in a single copy)
- To get a slot (in the case of sharding by objects)
The client provides the following metrics:
Metric name | Tag value | A place | Physical meaning |
---|---|---|---|
dcs_resolver_activeservices | name (service name) | ResolverBase | Request for exhibiting active service. |
dcs_resolver_requests | ResolverBase | The total number of requests for the nearest service | |
dcs_resolver_hints | ResolverBase | ||
dcs_resolver_success | ResolverBase | The number of requests for service, completed success | |
errors | type: dcs_resolver_timeout | ResolverBase | The number of service requests that failed |
Threadpool metrics¶
In system services using multi-thread processing is used pool of threads( threadpool). This component is responsible for managing flows and provides the following metrics:
Metric name | Tag value | A place | Physical meaning | |
---|---|---|---|---|
_max_workers | ThreadPoolExecutor | Maximum number of threads | ||
_idle_workers | ThreadPoolExecutor | Number of idle threads | ||
_running_workers | ThreadPoolExecutor | The number of busy threads | ||
_submitted_tasks | ThreadPoolExecutor | Number of completed tasks | ||
_queued_jobs | ThreadPoolExecutor | Number of jobs waiting (in the queue) | ||
_uptime | ThreadPoolExecutor | Flow time |
- The name of the threadpool. The following items are available: script - used by the Activator service to run scripts
- query - used by the service BI
- max - use services Web,NBI
Scheduler metrics¶
In system services using work with tasks, the scheduler component ( scheduler) is used. It is responsible for working with tasks (planning, sending for execution ...). Provides the following metrics:
Metric name Tag value A place Physical meaning <service>_jobs_started
Scheduler The total number of running tasks (during operation) <service>_jobs_retries_exceeded
Scheduler Number of tasks exceeding the maximum number of executions <service>_jobs_burst
Scheduler The number of tasks exceeding the maximum <service>_bulk_failed
Scheduler The number of update status errors in the collection <service>_cache_set_requests
Scheduler Number of Scheduler Cache Saves <service>_cache_set_errors
Scheduler The number of errors while saving the scheduler cache Activator¶
Metric name Tag value A place Physical meaning error type:invalid_script ActivatorAPI.script The number of calls to a non-existent script type:script_error ActivatorAPI.script The number of errors during the execution of the script type:snmp_v1_error ActivatorAPI.snmp_v1_get The number of SNMP V1 request errors type:snmp_v2_error ActivatorAPI.snmp_v2c_get Number of SNMP V2 Request Errors type:http_error_<code>
ActivatorAPI.http_get The number of HTTP request errors (divided by code) Discovery¶
todo
SAE¶
todo
Ping¶
Metric name Tag value A place Physical meaning ignorable_ping_errors PingSocket.ping The number of ignored errors when the collector receives an ICMP message ping_recvfrom_errors PingSocket.ping The number of errors when the collector receives an ICMP message ping_unknown_icmp_packets PingSocket.ping ICMP packet belonging to another service ping_time_stepbacks PingSocket.ping The number of packages containing more time system ping_check_recover PingSocket.ping Number of IP address availability recoveries ping_objects Pingservice The number of objects checked by the sample down_objects Pingservice Number of unavailable objects ping_probe_create Pingservice Number of samples (one object = one sample) ping_probe_update Pingservice The number of updates in the samples ping_probe_delete Pingservice Number of samples removed ping_check_total Pingservice The number of checks performed ping_check_skips Pingservice The number of missed checks ping_check_success Pingservice The number of successful checks ping_check_fail Pingservice The number of failed checks Collectors¶
Metric name Tag value A place Physical meaning trap_msg_in TrapServer.on_read The number of incoming UPD SNMP Trap packets events_out TrapCollectorService.register_message The number of events in the direction of the classifier sources_changed TrapCollectorService.update_source, SyslogCollectorService.update_source Updating information on source IP addresses sources_deleted TrapCollectorService.sources_deleted, SyslogCollectorService Deleting information by IP address error type:decode_failed TrapServer.on_read error type:socket_listen_error on_activate error type:object_not_found TrapCollectorService.lookup_config syslog_msg_in SyslogServer.on_read The number of incoming UPD syslog packages Classifier¶
Metric name Tag value A place Physical meaning lag_us ClassifierService.on_event Delay versus message creation time at source events_preprocessed ClassifierService The number of events classified by pre-processing events_processed ClassifierService The number of events received for processing events_unk_object ClassifierService The number of events from an unknown object events_unk_duplicated ClassifierService Number of duplicate events detected by codebook events_duplicated ClassifierService The number of classified events that have a duplicate detected. events_disposed ClassifierService The number of classified events sent to the correlator events_classified ClassifierService The number of classified events (there was a match with the classification rule) events_unknown ClassifierService Number of unclassified (no rule found) events events_suppressed ClassifierService Number of events suppressed due to replay events_deleted ClassifierService Number of events deleted based on classification rule events_failed ClassifierService The number of events that fell under the preprocessing with an invalid class events_syslog ClassifierService The number of events from the Syslog collector events_snmp_trap ClassifierService The number of events from the SNMP Trap collector events_system ClassifierService The number of events from system services events_other ClassifierService The number of events from unknown sources rules_checked RuleSet.find_rule The number of checked rules esm_lookups XRuleLookup.lookup_rules The number of checked rules XRules Correlator¶
Metric name Tag value A place Physical meaning alarm_correlated_rule CorrelatorService.set_root_cause The number of accidents with the root cause alarm_change_mo CorrelatorService.raise_alarm Number of ManagedObject changes in crash with eval_expression alarm_reopen CorrelatorService.raise_alarm Number of reopen accidents alarm_contribute CorrelatorService.raise_alarm The number of events involved in accidents alarm_raise CorrelatorService.raise_alarm Number of alarms raised alarm_drop CorrelatorService.correlate Chilo missed accidents (executed if the handler returned Severity 0) unknown_object CorrelatorService.clear_alarm Number of failed crash closures due to lack of ManagedObject alarm_clear CorrelatorService.clear_alarm Number of closed accidents alarm_dispose CorrelatorService.dispose_worker The number of received events alarm_dispose_error CorrelatorService.dispose_worker The number of errors when processing received events event_lookup_failed CorrelatorService.lookup_event The number of errors when searching for events by ID event_lookups CorrelatorService.dispose_worker The number of searches for events in the database by ID event_hints CorrelatorService.get_event_from_hint Number of use of information on the event from the message alarm_correlated_topology CorrelatorService.topology_rca The number of primed causes detached_root check.check_close_consequence The number of trips of the root cause (in case of closing the main accident and the remaining subordinates) errors type: alarm_handler CorrelatorService.correlate Runtime Handler Errors Escalator¶
Metric name Tag value A place Physical meaning escalation_missed_alarm escalator.escalate At the time of the escalation, the accident was removed. escalation_already_closed escalator.escalate At the time of the escalation, the accident was closed escalation_alarm_is_not_root escalator.escalate At the time of the escalation, the root cause of the accident was exposed (in this case, the accident is escalated as part of the parent) escalation_not_found escalator.escalate Escalation was removed (checked during escalation) escalation_throttled escalator.escalate The escalation was stopped because triggered check for exceeding the escalation limit escalation_stop_on_maintenance escalator.escalate The escalation was stopped because equipment covered with Maintanance Window escalation_tt_retry escalator.escalate During the creation of the TT (Incident, Trouble ticket) in the external system was detected Temporary Error, the escalation went to repeat later escalation_tt_create escalator.escalate The number of generated incidents in the external system escalation_tt_fail escalator.escalate The number of errors when creating incidents in the external system escalation_tt_comment escalator.escalate Number of comments added to events in the external system escalation_tt_comment_fail escalator.escalate The number of errors when commenting comments on the incidents in the external system escalation_notify escalator.escalate The number of sent notifications escalation_closed_while_escalated escalator.escalate Number of closed accidents detected during escalation escalation_already_deescalated escalator.notify_close De-escalation (incident closing) for an accident has already been made escalation_tt_close escalator.notify_close The number of incidents closed in the external system escalation_tt_close_retry escalator.notify_close The number of repetitions of closing incidents in the external system escalation_tt_close_fail escalator.notify_close The number of errors when closing incidents in the external system maintenance_tt_create escalator.maintenance maintenance_tt_fail escalator.maintenance maintenance_tt_close escalator.maintenance maintenance_tt_close_fail escalator.maintenance Mailsender¶
Metric name Tag value A place Physical meaning smtp_response code (SMTP code) MailSenderService.send_mail Number of sent messages (divided by SMTP server response codes) System-wide metrics (self-monitoring)¶
Subsystem metrics are calculated based on information from the database ( Postgres or MongoDB) and require an installed service seflmon.
Task¶
In many services of the system, tasks are performed with time reference. It is responsible for this scheduler. Technically, it is implemented as a queue of tasks in MongoDB. Collection of tasks with which it works is called a template:
noc.scheduler.<scheduler_name>.<shard>
. The tasks themselves are one-time and periodical. One-time after executionTags are added to all metrics:
- scheduler_name - name of the scheduler, usually the same as the name of the service
- pool - the name of the shard
Metric name Tag value A place Physical meaning task_pool_total scheduler_name, pool selfmon.task Total number of tasks task_exception_count - Number of tasks with execution error task_running_count - The number of tasks in the state запущено task_late_count - The number of delayed tasks (start time later than the current) task_lag_seconds - In case of delayed tasks, the delay value of the task (in seconds) task_box_time_avg_seconds - Average task completion time (counted for equipment survey service (discovery)) task_periodic_time_avg_seconds - Average lead time Inventory¶
Metric name Tag value A place Physical meaning inventory_iface_count selfmon.inventory The total number of interfaces in the system (calculated from the collection of inv.interfaces) inventory_iface_physical_count The total number of physical interfaces in the system (calculated from the collection of inv.interfaces) inventory_link_count Total number of links in the system (calculated from the inv.links collection) inventory_subinterface_count Total number of subinterfaces in the system (calculated from the collection of inv.interfaces) inventory_managedobject_total selfmon.managedobject inventory_managedobject_managed The total number of management objects (ManagedObject) in the active state (ticked is_managed) inventory_managedobject_unmanaged selfmon.managedobject FM¶
For part of the metrics tags are added:
ac_group - group of accidents. Present:
availablility- NOC | Managed Object | Ping FailedICMP accessibility crashes
- discovery- System crashes (class Discovery | ..) generated by survey issues
other - The rest of the accident (Syslog / SNMP Trap)
pool - a pool of services which practiced an accident (this includes pinger, classifier, correlator)
- shard - analogue of the pool for the external system
Metric name Tag value A place Physical meaning fm_events_active_total selfmon.fm The total number of active events (calculated from the fm.events.active collection) fm_events_active_last_lag_seconds selfmon.fm The difference (in seconds) between the current time and the time of the last message creation (counted according to the fm.events.active collection) fm_alarms_active_total selfmon.fm The total number of active alarms (calculated from the fm.alarms.active collection) fm_alarms_archived_total selfmon.fm The total number of archived accidents (counted in the fm.alarms.archived collection) fm_alarms_active_last_lag_seconds selfmon.fm The difference (in seconds) between the current time and the time of the last crash creation (calculated from the fm.alarms.active collection) fm_alarms_active_late_count selfmon.fm The number of events due to equipment unavailability (class NOC fm_alarms_active_pool_count ac_group, pool selfmon.fm Number of active alarms with splitting in pool and class fm_alarms_active_withroot_pool_count ac_group, pool selfmon.fm The number of active accidents with the underlying cause with splitting into pool and class fm_alarms_active_withoutroot_pool_count ac_group, pool selfmon.fm Number of active accidents without root cause with splitting by pool and class fm_escalation_pool_count shard selfmon.fm The number of escalations in the queue fm_escalation_first_lag_seconds shard selfmon.fm The difference (in seconds) between the current time and the time of the first escalation in the queue fm_escalation_lag_seconds shard selfmon.fm The difference (in seconds) between the current time and the time of the last escalation in the queue For each of the groups of metrics there are settings in the section config.selfmon:
- enable_managedobject - enable metrics collection by managedobject
- managedobject_ttl - metrics update interval for managedobject
Similar settings are for each section:
- enable_task
- task_ttl
- enable_inventory
- inventory_ttl
- enable_fm
- fm_ttl