Перейти к содержанию

NOC 20.4 is Released

In accordance to our Release Policy we're proudly present release 20.4.

20.4 release contains of 225 bugfixes, optimisations and improvements. Refer to the Release Notes for details.

Highlights

Generic Message Exchange

NOC can send notifications to email/telegram via Notification groups on alarms and configuration changes. Notifications are useful to take human attention to possible problem. To notify push data to external system NOC uses DataStream approach. External systems have to pull changes and process them according own logic.

NOC 20.4 generalises all data pushed to external systems to the concepts of messages. Message is the piece of data which can be passed from NOC to outside. Messages can be of different types:

  • alarms
  • object inventory data
  • configuration
  • configuration change
  • reboot
  • new object
  • system login
  • etc.

NOC can generate messages on certain condition. Humans and soulless robots can have interest in messages. So we need some kind of routing.

NOC 20.4 introduces new service, called Message Exchanger or mx. Like mail servers, mx receives the message, processes it headers and decides where to route the message. mx relies on family of the sender processes. Each kind of sender can deliver the message outside of the system. Each sender supports particular exchange protocol, hiding delivery details from mx. mx can transform messages or apply the templates to convert delivered message to desired format. mx, senders, message generation and transport conventions became the viable part of NOC called Generalized Message Exchange or GMX.

NOC 20.4 introduces kafkasender service, used to push data to a Kafka message bus. We're planning to convert other senders (mailsender, tgsender, etc) to GMX in the NOC 21.1.

Kafka Integration

NOC 20.4 introduces the kafkasender service, the part of GMX. Kafka became mainstream message bus in telecom operation, and NOC is being able to push all data, available via DataStream to a Kafka for following routing and processing, reducing amount of mutual system-to-system integrations.

Biosegmentation

Biosegmentation has been introduced in NOC 20.3 as ad-hoc segmentation process. Process relies on the series of trials. Each trial can lead to merging or fixing the structure of segments tree. Current implementation relies on inter-segment links. But sometimes the segment hierarchy must be established before the linking process.

NOC 20.4 introduces additional MAC-based biosegmentation approach, called Vacuum Bulling, allowing to build segment hierarchy basing on MAC addresses, collected on interfaces.

Ordered Message Queue

NOC uses NSQ as internal message queue. Lightweight and hi-performance solution shows good result usually. But after the time architectural corner cases became more and more visible:

  • NSQ designed to be always-on-dial solution. nsqd is on every host, communicating to publisher via localhost loopback. In modern container world that fact being bug, not a feature. Reliance on absolute reliability of connection between publisher and broker became unacceptable.
  • Subscribers have to communicate with nsqlookup service to find the hosts containing data. Then they have to establish direct connection with them. Official python NSQ client uses up to 5 tcp connections. So amount of connection grows fast with grow of amount producers and subscribers.
  • Official python NSQ client's error handling is far from ideal. Code base is old and obscure and hard to maintain. No asyncio version is available.
  • No fault tolerance. Failed nsqd will lead to the lost messages. No message replication at all.
  • Out-of-order messages. Message order may change due to internal nsqd implementation and to client logic. Applications like fault management relies on message order. Closing events must follow opening ones. Otherwise the hanging alarms will pollute the system.

During the researches we'd decided we need message system with commit-log approach. Though Kafka is industrial standard, its dependency on JVM and Zookeeper may be a burden. We stopped on Liftbridge. Liftbridge is clean and simple implementation of proven Kafka storage and replication algorithms.

We'd ported events topics to Liftbridge, fixing critical events ordering problem. GMX topics uses Liftbridge too. Next release (21.1) will address remaining topics.

FastAPI

We'd starting migration from Tornado to FastAPI. Main motivation is:

  • Tornado has bring generator-based asynchronous programming to Python2. Python3 has introduced native asynchronous programming along with asyncio library. Later Tornado versions are simple wrappers atop asyncio.
  • FastAPI uses Pydantic for request and response validation. We'd considered Pydantic very useful during out ETL refactoring
  • FastAPI generates OpenAPI/Swagger scheme, improving integration capabilities.
  • FastAPI is fast.

We'd ported login service to FastAPI. JWT had replaced Tornado's signed cookies. We'd also implemented the set of OAuth2-based endpoints for our next-generation UI.

ETL Improvements

ETL has relied on CSV format to store extracted data. Though it simple and wraps SQL responses in obvious way, it have some limitation:

  • Metadata of extracted fields stored outside of extractor, in the loader.
  • Field order hardcoded in loader
  • Fields has no type information, leading to leaky validation
  • No native way to pass complex data structures, like list and nested documents
  • Extractors must return empty data for long time deprecated fields

NOC 20.4 introduces new extractor API. Instead of lists, passed to CSV, extractor returns pydantic model instances. Pydantic models are defined in separate modules and reused by both extractors and loaders. Interface between extractor and loader became well-defined. Models perform data validation on extraction and load stages. So errors in extractor will lead to informative error message and to the stopping of process.

ETL now uses JSON Line format (jsonl) - a bunch of JSON structures for each row, separated by newlines. So it is possible to store structures with arbitrary complexity. We'd ever provided the tool to convert legacy extracted data to a new format.

SNMP Rate Limiting

NOC 20.4 allows to limit a rate of SNMP requests basing on profile or platform settings. This reduces impact on the platforms with weak CPU or slow control-to-dataplane bus.

orjson

orjson is used instead of ujson for JSON serialization/deserialization.

New profiles

  • KUB Nano
  • Qtech.QFC

Migration

Tower Upgrade

Please upgrade Tower up to 1.0.0 or later before continuing NOC installation/upgrade process. See Tower upgrade process documentation<https://code.getnoc.com/noc/tower/-/blob/master/UPDATING.md>_ for more details.

Elder versions of Tower will stop deploy with following error message

Liftbridge/NATS

NOC 20.4 introduces Liftbridge service for ordered message queue. You should deploy at least 1 Liftbridge and 1 NATS service instance. See more details in Tower's service configuration section.

ETL

Run fix after upgrade

./noc fix apply fix_etl_jsonl