NOC 20.4

13.11.2020

In accordance to our Release Policy we’re proudly present release 20.4.

20.4 release contains of 225 bugfixes, optimisations and improvements. Refer to the Release Notes for details.

Highlights

Generic Message Exchange

NOC can send notifications to email/telegram via Notification groups on alarms and configuration changes. Notifications are useful to take human attention to possible problem. To notify push data to external system NOC uses DataStream approach. External systems have to pull changes and process them according own logic.

NOC 20.4 generalises all data pushed to external systems to the concepts of messages. Message is the piece of data which can be passed from NOC to outside. Messages can be of different types:

  • alarms
  • object inventory data
  • configuration
  • configuration change
  • reboot
  • new object
  • system login
  • etc.

NOC can generate messages on certain condition. Humans and soulless robots can have interest in messages. So we need some kind of routing.

NOC 20.4 introduces new service, called Message Exchanger or mx. Like mail servers, mx receives the message, processes it headers and decides where to route the message. mx relies on family of the sender processes. Each kind of sender can deliver the message outside of the system. Each sender supports particular exchange protocol, hiding delivery details from mx. mx can transform messages or apply the templates to convert delivered message to desired format. mx, senders, message generation and transport conventions became the viable part of NOC called Generalized Message Exchange or GMX.

NOC 20.4 introduces kafkasender service, used to push data to a Kafka message bus. We’re planning to convert other senders (mailsender, tgsender, etc) to GMX in the NOC 21.1.

Kafka Integration

NOC 20.4 introduces the kafkasender service, the part of GMX. Kafka became mainstream message bus in telecom operation, and NOC is being able to push all data, available via DataStream to a Kafka for following routing and processing, reducing amount of mutual system-to-system integrations.

Biosegmentation

Biosegmentation has been introduced in NOC 20.3 as ad-hoc segmentation process. Process relies on the series of trials. Each trial can lead to merging or fixing the structure of segments tree. Current implementation relies on inter-segment links. But sometimes the segment hierarchy must be established before the linking process.

NOC 20.4 introduces additional MAC-based biosegmentation approach, called Vacuum Bulling, allowing to build segment hierarchy basing on MAC addresses, collected on interfaces.

Ordered Message Queue

NOC uses NSQ as internal message queue. Lightweight and hi-performance solution shows good result usually. But after the time architectural corner cases became more and more visible:

  • NSQ designed to be always-on-dial solution. nsqd is on every host, communicating to publisher via localhost loopback. In modern container world that fact being bug, not a feature. Reliance on absolute reliability of connection between publisher and broker became unacceptable.
  • Subscribers have to communicate with nsqlookup service to find the hosts containing data. Then they have to establish direct connection with them. Official python NSQ client uses up to 5 tcp connections. So amount of connection grows fast with grow of amount producers and subscribers.
  • Official python NSQ client’s error handling is far from ideal. Code base is old and obscure and hard to maintain. No asyncio version is available.
  • No fault tolerance. Failed nsqd will lead to the lost messages. No message replication at all.
  • Out-of-order messages. Message order may change due to internal nsqd implementation and to client logic. Applications like fault management relies on message order. Closing events must follow opening ones. Otherwise the hanging alarms will pollute the system.

During the researches we’d decided we need message system with commit-log approach. Though Kafka is industrial standard, its dependency on JVM and Zookeeper may be a burden. We stopped on Liftbridge. Liftbridge is clean and simple implementation of proven Kafka storage and replication algorithms.

We’d ported events topics to Liftbridge, fixing critical events ordering problem. GMX topics uses Liftbridge too. Next release (21.1) will address remaining topics.

FastAPI

We’d starting migration from Tornado to FastAPI. Main motivation is:

  • Tornado has bring generator-based asynchronous programming to Python2. Python3 has introduced native asynchronous programming along with asyncio library. Later Tornado versions are simple wrappers atop asyncio.
  • FastAPI uses Pydantic for request and response validation. We’d considered Pydantic very useful during out ETL refactoring
  • FastAPI generates OpenAPI/Swagger scheme, improving integration capabilities.
  • FastAPI is fast.

We’d ported login service to FastAPI. JWT had replaced Tornado’s signed cookies. We’d also implemented the set of OAuth2-based endpoints for our next-generation UI.

ETL Improvements

ETL has relied on CSV format to store extracted data. Though it simple and wraps SQL responses in obvious way, it have some limitation:

  • Metadata of extracted fields stored outside of extractor, in the loader.
  • Field order hardcoded in loader
  • Fields has no type information, leading to leaky validation
  • No native way to pass complex data structures, like list and nested documents
  • Extractors must return empty data for long time deprecated fields

NOC 20.4 introduces new extractor API. Instead of lists, passed to CSV, extractor returns pydantic model instances. Pydantic models are defined in separate modules and reused by both extractors and loaders. Interface between extractor and loader became well-defined. Models perform data validation on extraction and load stages. So errors in extractor will lead to informative error message and to the stopping of process.

ETL now uses JSON Line format (jsonl) - a bunch of JSON structures for each row, separated by newlines. So it is possible to store structures with arbitrary complexity. We’d ever provided the tool to convert legacy extracted data to a new format.

SNMP Rate Limiting

NOC 20.4 allows to limit a rate of SNMP requests basing on profile or platform settings. This reduces impact on the platforms with weak CPU or slow control-to-dataplane bus.

orjson

orjson is used instead of ujson for JSON serialization/deserialization.

New profiles

  • KUB Nano
  • Qtech.QFC

Migration

Tower Upgrade

Please upgrade Tower up to 1.0.0 or later before continuing NOC installation/upgrade process. See Tower upgrade process documentation<https://code.getnoc.com/noc/tower/-/blob/master/UPDATING.md>_ for more details.

Elder versions of Tower will stop deploy with following error message

Liftbridge/NATS

NOC 20.4 introduces Liftbridge service for ordered message queue. You should deploy at least 1 Liftbridge and 1 NATS service instance. See more details in Tower’s service configuration section.

ETL

Run fix after upgrade

./noc fix apply fix_etl_jsonl

New features

MR Title
!1668 Added function get alarms for controllers and devices for periodic job.
!4223 FastAPI login service
!4256 Add Project to ETL
!4274 New profile Qtech.QFC
!4290 Liftbridge client
!4361 #1363 ifdesc: Interface autocreation
!4388 Add new controller profile KUB Nano
!4398 mx service
!4403 kafkasender service
!4473 #1368 Model Interface scopes
!4488 #892 ETL JSON format
!4519 noc/noc#1356 SNMP Rate Limit
!4538 Configurable LDAP server policies
!4567 Biosegmentation: Vacuum bulling

Improvements

MR Title
!4225 Fix ddash refid
!4233 Allow alternative locations for binary speedup modules
!4236 Catch when sentry-sdk module enabled but not installed.
!4246 Fix Qtech.BFC profile
!4261 noc/noc#1304 Replace ujson with orjson
!4264 runtime optimization ReportMaxMetrics
!4275 ElectronR.KO01M profile scripts
!4278 noc/noc#1383 Add IfPath collator to confdb
!4280 noc/noc#1381 Add alarm_consequence_policy to TTSystem settings.
!4281 #1384 Add source-ip aaa hints.
!4287 Add round argument to metric scale function
!4293 Debian-based docker image
!4296 Change python to python3 when use ./noc
!4314 Update Card for Sensor Controller
!4320 Fill capabilities for beef.
!4338 New Grafana dashboards
!4344 Profile fix controllers
!4348 exp_decay window function
!4349 Controller/fix2
!4354 add_interface-type_Juniper_JUNOSe
!4358 Fix Qtech.BFC profile
!4364 LiftBridgeClient: Proper handling of message headers
!4369 LiftBridgeClient: fetch_metadata() stream and wait_for_stream parameters
!4380 Add to_json for thresholdprofile
!4383 Update threshold handler
!4384 Add collators to some profiles.
!4389 Electron fix profile
!4391 add new metric Qtech.BFC
!4394 fix some controllers ddash/metrics
!4396 Fix inerfaces name Qtech.BFC
!4399 Up report MAX_ITERATOR to 800 000.
!4402 mx: Use FastAPIService
!4405 liftbridge cursor persistence api
!4407 add_columns_total_reportmaxmetrics
!4416 Add csv+zip format to ReportDetails.
!4417 Add Long Alarm Archive options to ReportAlarm, from Clickhouse table.
!4428 Add available_only options to ReportDiscoveryTopologyProblem.
!4432 Reset NetworkSegment TTL cache after remove.
!4433 Change is_uplink criterias priority on segment MAC discovery.
!4439 fix_reportmaxmetrics
!4447 Add octets_in_sum and octets_out_sum columns to ReportMetrics.
!4453 ConfDB syslog
!4455 Fix controllers profiles, ddash
!4457 Fix get_iface_metrics
!4462 noc/noc#1392 Add search port by contains ifdescription token to ifdecr discovery.
!4464 LiftBridge client: Connection pooling
!4470 Add ReportMovedMacApplication application.
!4475 Add sorted to tags application.
!4477 noc/noc#1416 Extend ConfDB meta section.
!4479 Add get_confdb_query method to ManagedObjectSelector and MatchPrefix ConfDB function.
!4480 Add csv_zip file format to MetricsDetail Report.
!4483 noc/noc#1397 Additional biosegtrial criteria to policy.
!4486 Add migrate_ts field to ReportMovedMac.
!4501 noc/noc#1428 Add InterfaceDiscoveryApplicator for fill ConfDB info from interface discovery.
!4508 add_csvzip_reportmaxmetrics
!4511 Fix ./noc discovery for LB
!4515 noc/noc#1432 lb client: Configurable message size limit
!4516 fix csv_import view
!4517 Additional options to segment command
!4535 Bump networkx/numpy requirements
!4539 lb client: increased resilience
!4547 Add JOB_CLASS param to core.defer util.
!4549 ETL model Reference
!4551 add column reboots in fm.reportalarmdetail
!4553 fix processing trunk port vlan for HP A3100-24 (v 5.20.99)
!4565 Add ttl-policy argument to link command.
!4571 Filter Multicast MACs on Moved MAC report.
!4573 Add api_unlimited_row_limit param
!4579 liftBridge: publish_async waits for all the acks
!4582 noc/noc#1371 Add schedule_discovery_config handler to events.discovery.
!4592 noc/noc#1400 Migrate InterfaceClassification to ConfDB.
!4602 Add MatchAllVLAN and MatchAnyVLAN function to ConfDB.
!4607 Bump pytest version
!4624 add metrics “Subscribers
!4629 noc/noc#1440 Use all macs on ‘Discovery ID cache poison’ report.
!4630 Convert limit from dcs to int.
!4632 Add Telephony SIP metrics graph.
!4633 Always uplinks calculate.

Bugfixes

MR Title
!4249 Fix card MO
!4251 Fix status RNR
!4258 Change field_num on ReportObjectStat
!4269 noc/noc#1374 Fix typo on datastream format check.
!4285 Fix Profile Check Summary typo.
!4303 #1335 ConfDB: Fix and inside or combination
!4310 Fix RNR affected AD
!4319 Add err_status to beef snmp_getbulk_response method.
!4321 Convert oid on snmp raw_varbinds.
!4322 Fix event clean
!4327 Convert set to list on orjson dumps.
!4328 Add xmac discovery to ReportDiscoveryResult.
!4363 ./noc migrate-liftbridge: Do not create streams for disabled services
!4368 Fix hash_int()
!4373 Fix typo on Calcify Biosegmentation policy.
!4409 Add get_pool_partitions method to TrapCollectorService.
!4418 Add id field to project etl loader.
!4419 Fix multiple segment args on discovery command.
!4423 noc/noc#1399 Delete Permissions and Favorites on wipe user.
!4424 noc/noc#1375 Fix DEFAULT_STENCIL use on SegmentTopology.
!4425 noc/noc#1396 AlarmEscalation. Use item delay for consequence escalation.
!4426 Fix extapp group regex splitter to non-greedy.
!4430 Fix ManagedObject _reset_caches key for _id_cache.
!4452 noc/noc#1406 Use system username for JWT.
!4461 noc/noc#1229 Fix user cleanup Django Admin Log.
!4472 Add audience param to is_logged jwt.decode.
!4474 Add 120 sec to out_of_order escalation time.
!4485 noc/noc#688 Fix invalidate l1 cache for ManagedObject.
!4492 Skipping files if already compressed on destination.
!4497 noc/noc#1427 Fix whois ARIN url.
!4498 Fix object data use.
!4502 Move orjson defaults to jsonutils.
!4505 Bump ssh2-python to 0.23.
!4506 pm/utils -> Fix dict
!4507 Some etl loader fixes.
!4513 noc/noc#1423 Convert pubkey to bytes.
!4514 Convert empty object data to list on 0020 migration.
!4518 Fix vendors and handlers migrations
!4522 Fix typo on ifdescr discovery.
!4524 #1312 Consistent VPN ID generation
!4540 Fix customfields for mongoengine.
!4555 Revert uvicorn to 0.12.1.
!4561 Fix typo on interfaceprofile UI Application.
!4564 Fix trace when execute other script that command on MRT.
!4569 Fix typo on MRT service.
!4575 Add static_service_groups and static_client_groups clean_map to managedobject etl loader.
!4590 Fix login cookie ttl
!4594 Fix ETL loader change.
!4595 Fix extra filter when set extra order.
!4598 Fix datetime field on Service ETL model.
!4614 Fix SNMP_GET_OIDS on get_chassis_id scripts to list.
!4627 noc/noc#1439 Fix tag contains query for non latin symbol.

Code Cleanup

MR Title
!4254 Cleanup flake.
!4301 Fix vendor docs test
!4317 Updated .dockerignore
!4360 Remove unused dependencies: tornadis, mistune
!4362 Update blinker, bsdiff, cachetools, crontab, progressbar2, psycopg2, python-dateutil versions
!4465 Remove legacy scripts/ci-run
!4496 Fix formatting
!4533 Bump requirements
!4587 Fix collect beef for orjson.
!4589 Fix some lint errors
!4622 Fix Service etl model.

Profile Changes

Cisco.IOS

MR Title
!4316 Update Cisco.IOS profile to support more physical interfaces

Cisco.IOSXR

MR Title
!4408 added interfacetypes for IOSXR platform
MR Title
!4355 DLink.DxS.get_metrics. Fix SNMP Error when ‘CPU
!4434 Fix Dlink.DxS profile.

EdgeCore.ES

MR Title
!4556 EdgeCore.ES.get_spanning_tree. Fix getting port_id for Trunk interface.

Eltex.MES

MR Title
!4217 test tacacs1.yml crashed. AssertionError: assert [] == [(right syntax)]
!4262 Eltex.MES.get_capabilities. Fix detect stack mode by SNMP.
!4523 Eltex.MES.get_vlans. Use Generic script.
!4615 Eltex.MES. Add 1.3.6.1.4.1.89.53.4.1.7.1 to display_snmp.

Eltex.MES24xx

MR Title
!4381 Fix Eltex.MES24xx.get_interfaces script

Extreme.XOS

MR Title
!4404 Fix Extreme.XOS.get_lldp_neighbors script

Generic

MR Title
!4239 Generic.get_capabilities add ‘SNMP
!4342 Generic.get_arp. Cleanup snmp for py3
!4613 Generic.get_chassis_id. Add ‘LLDP-MIB::lldpLocChassisId’ oid to display_hints.

Huawei.MA5600T

MR Title
!4611 Huawei.MA5600T.get_spanning_tree. Fix waited command.

Huawei.VRP

MR Title
!4422 Huawei.VRP. Add NE8000 version detect.
!4550 Huawei.VRP fix normalize_enable_stp
!4557 Huawei.VRP. Check nexthop type on ConfDB route normalizer.

Juniper.JUNOS

MR Title
!4324 Fix Juniper.JUNOS.get_chassis_id script
!4377 Fix Juniper.JUNOS.get_interfaces script

NAG.SNR

MR Title
!4351 Fix NAG.SNR.get_interfaces script
!4481 Fix NAG.SNR.get_lldp_neighbors script

Qtech.QSW

MR Title
!4576 Fix Qtech.QSW profile

Qtech.QSW2800

MR Title
!4444 Qtech.QSW2800. Add sdiag prompt.
!4542 Fix Qtech.QSW2800.get_version script

Ubiquiti.AirOS

MR Title
!4240 Ubiquiti.AirOS.get_version. Cleanup for py3.

rare

MR Title
!4214 ConfDB tests profile Raisecom.RCIOS.
!4241 Alstec.MSPU.get_version. Fix HappyBaby platform regex.
!4265 Fix ZTE.ZXA10 profile
!4272 Eltex.WOPLR. Add get_interface_type method to profile.
!4279 Update Rotek.BT profile
!4288 Add Enterasys.EOS profile
!4295 Fix metric name
!4302 add snmp in profile Juniper.JUNOSe
!4313 Rotek.BT fix get_metrics
!4335 add snmp in profile Alcatel.TIMOS
!4353 Update ZTE.ZXA10 profile to support C610
!4365 Fix prompt matching in Fortinet.Fortigate profile
!4371 Alcatel.OS62xx.get_version. Set always_prefer to S for better platform detect.
!4376 fix_get_lldp_neighbors_NSN.TIMOS
!4406 Add AcmePacket.NetNet profile.
!4431 noc/noc#1391 Cisco.WLC. Add get_interface_type method.
!4536 add_bras_metrics_Juniper_JUNOSe
!4570 Fix h3c get_switchport
!4578 Eltex.ESR add snmp support
!4583 Update DCN.DCWS profile.py
!4585 Update sa/profiles/DCN/DCWS/get_config.py
!4586 Ericsson.SEOS.get_interfaces. Migrate to Generic SNMP.
!4596 Fix DLink.DxS_Smart profile
!4600 Huawei.VRP3.get_interface_status_ex. Fix return in/out speed as kbit/sec.
!4610 Huawei.VRP3.get_interface_status_ex. Fix trace when SNMP Timeout.
!4617 NSN.TIMOS.get_interfaces. Fix empty MAC on output.

Collections Changes

MR Title
!4277 Add more Juniper part number
!4282 Add new caps - Sensor
!4294 New Environment metrics
!4305 Fix bad json on collection.
!4307 Cleanup HP fm.eventclassificationrule.
!4337 Fix get metrics script for controller
!4345 Fix dev.specs SNMP chassis for Huawei and Generic.
!4411 Add some Juniper models
!4451 Add some Juniper models
!4460 noc/noc#1411 Add PhonePeer MetricScope.
!4499 Fix default username BI dashboard.
!4520 sa.profilecheckrules: Eltex
!4625 Add AcmePacket Vendor.

Deploy Changes

MR Title
!4478 noc/noc#1241 Merge ansible deploy to master repo
!4623 Add liftbridge deployflow
!4637 Fix auth path redirect
!4640 Catch trace on etl loader when delete lost mapping.
!4643 Change start condition