Перейти к содержанию

NOC 20.4

20.4 release contains [225](https://code.getnoc.com/noc/noc/merge_requests?scope=all&state=merged&milestone_title=20.4) bugfixes, optimisations and improvements.

Highlights

Generic Message Exchange

NOC can send notifications to email/telegram via Notification groups on alarms and configuration changes. Notifications are useful to take human attention to possible problem. To notify push data to external system NOC uses DataStream approach. External systems have to pull changes and process them according own logic.

NOC 20.4 generalises all data pushed to external systems to the concepts of messages. Message is the piece of data which can be passed from NOC to outside. Messages can be of different types:

  • alarms
  • object inventory data
  • configuration
  • configuration change
  • reboot
  • new object
  • system login
  • etc.

NOC can generate messages on certain condition. Humans and soulless robots can have interest in messages. So we need some kind of routing.

NOC 20.4 introduces new service, called Message Exchanger or mx. Like mail servers, mx receives the message, processes it headers and decides where to route the message. mx relies on family of the sender processes. Each kind of sender can deliver the message outside of the system. Each sender supports particular exchange protocol, hiding delivery details from mx. mx can transform messages or apply the templates to convert delivered message to desired format. mx, senders, message generation and transport conventions became the viable part of NOC called Generalized Message Exchange or GMX.

NOC 20.4 introduces kafkasender service, used to push data to a Kafka message bus. We're planning to convert other senders (mailsender, tgsender, etc) to GMX in the NOC 21.1.

Kafka Integration

NOC 20.4 introduces the kafkasender service, the part of GMX. Kafka became mainstream message bus in telecom operation, and NOC is being able to push all data, available via DataStream to a Kafka for following routing and processing, reducing amount of mutual system-to-system integrations.

Biosegmentation

Biosegmentation has been introduced in NOC 20.3 as ad-hoc segmentation process. Process relies on the series of trials. Each trial can lead to merging or fixing the structure of segments tree. Current implementation relies on inter-segment links. But sometimes the segment hierarchy must be established before the linking process.

NOC 20.4 introduces additional MAC-based biosegmentation approach, called Vacuum Bulling, allowing to build segment hierarchy basing on MAC addresses, collected on interfaces.

Ordered Message Queue

NOC uses NSQ as internal message queue. Lightweight and hi-performance solution shows good result usually. But after the time architectural corner cases became more and more visible:

  • NSQ designed to be always-on-dial solution. nsqd is on every host, communicating to publisher via localhost loopback. In modern container world that fact being bug, not a feature. Reliance on absolute reliability of connection between publisher and broker became unacceptable.
  • Subscribers have to communicate with nsqlookup service to find the hosts containing data. Then they have to establish direct connection with them. Official python NSQ client uses up to 5 tcp connections. So amount of connection grows fast with grow of amount producers and subscribers.
  • Official python NSQ client's error handling is far from ideal. Code base is old and obscure and hard to maintain. No asyncio version is available.
  • No fault tolerance. Failed nsqd will lead to the lost messages. No message replication at all.
  • Out-of-order messages. Message order may change due to internal nsqd implementation and to client logic. Applications like fault management relies on message order. Closing events must follow opening ones. Otherwise the hanging alarms will pollute the system.

During the researches we'd decided we need message system with commit-log approach. Though Kafka is industrial standard, its dependency on JVM and Zookeeper may be a burden. We stopped on Liftbridge. Liftbridge is clean and simple implementation of proven Kafka storage and replication algorithms.

We'd ported events topics to Liftbridge, fixing critical events ordering problem. GMX topics uses Liftbridge too. Next release (21.1) will address remaining topics.

FastAPI

We'd starting migration from Tornado to FastAPI. Main motivation is:

  • Tornado has bring generator-based asynchronous programming to Python2. Python3 has introduced native asynchronous programming along with asyncio library. Later Tornado versions are simple wrappers atop asyncio.
  • FastAPI uses Pydantic for request and response validation. We'd considered Pydantic very useful during out ETL refactoring
  • FastAPI generates OpenAPI/Swagger scheme, improving integration capabilities.
  • FastAPI is fast.

We'd ported login service to FastAPI. JWT had replaced Tornado's signed cookies. We'd also implemented the set of OAuth2-based endpoints for our next-generation UI.

ETL Improvements

ETL has relied on CSV format to store extracted data. Though it simple and wraps SQL responses in obvious way, it have some limitation:

  • Metadata of extracted fields stored outside of extractor, in the loader.
  • Field order hardcoded in loader
  • Fields has no type information, leading to leaky validation
  • No native way to pass complex data structures, like list and nested documents
  • Extractors must return empty data for long time deprecated fields

NOC 20.4 introduces new extractor API. Instead of lists, passed to CSV, extractor returns pydantic model instances. Pydantic models are defined in separate modules and reused by both extractors and loaders. Interface between extractor and loader became well-defined. Models perform data validation on extraction and load stages. So errors in extractor will lead to informative error message and to the stopping of process.

ETL now uses JSON Line format (jsonl) - a bunch of JSON structures for each row, separated by newlines. So it is possible to store structures with arbitrary complexity. We'd ever provided the tool to convert legacy extracted data to a new format.

SNMP Rate Limiting

NOC 20.4 allows to limit a rate of SNMP requests basing on profile or platform settings. This reduces impact on the platforms with weak CPU or slow control-to-dataplane bus.

orjson

orjson is used instead of ujson for JSON serialization/deserialization.

New profiles

  • KUB Nano
  • Qtech.QFC

Migration

Tower Upgrade

Please upgrade Tower up to 1.0.0 or later before continuing NOC installation/upgrade process. See [Tower upgrade process documentation](https://code.getnoc.com/noc/tower/-/blob/master/UPDATING.md) for more details.

Elder versions of Tower will stop deploy with following error message

Liftbridge/NATS

NOC 20.4 introduces Liftbridge service for ordered message queue. You should deploy at least 1 Liftbridge and 1 NATS service instance. See more details in Tower's service configuration section.

ETL

Run fix after upgrade

` $ ./noc fix apply fix_etl_jsonl`

New features

MRTitle
MR1668Added function get alarms for controllers and devices
for periodic job.
MR4223FastAPI login service
MR4256Add Project to ETL
MR4274New profile Qtech.QFC
MR4290Liftbridge client
MR4361#1363 ifdesc: Interface autocreation
MR4388Add new controller profile KUB Nano
MR4398mx service
MR4403kafkasender service
MR4473#1368 Model Interface scopes
MR4488#892 ETL JSON format
MR4519noc/noc#1356 SNMP Rate Limit
MR4538Configurable LDAP server policies
MR4567Biosegmentation: Vacuum bulling

Improvements

MRTitle
MR4225Fix ddash refid
MR4233Allow alternative locations for binary speedup modules
MR4236Catch when sentry-sdk module enabled but not installed.
MR4246Fix Qtech.BFC profile
MR4261noc/noc#1304 Replace ujson with orjson
MR4264runtime optimization ReportMaxMetrics
MR4275ElectronR.KO01M profile scripts
MR4278noc/noc#1383 Add IfPath collator to confdb
MR4280noc/noc#1381 Add alarm_consequence_policy to TTSystem settings.
MR4281#1384 Add source-ip aaa hints.
MR4287Add round argument to metric scale function
MR4293Debian-based docker image
MR4296Change python to python3 when use ./noc
MR4314Update Card for Sensor Controller
MR4320Fill capabilities for beef.
MR4338New Grafana dashboards
MR4344Profile fix controllers
MR4348exp_decay window function
MR4349Controller/fix2
MR4354add_interface-type_Juniper_JUNOSe
MR4358Fix Qtech.BFC profile
MR4364LiftBridgeClient: Proper handling of message headers
MR4369LiftBridgeClient: fetch_metadata() stream and wait_for_stream parameters
MR4380Add to_json for thresholdprofile
MR4383Update threshold handler
MR4384Add collators to some profiles.
MR4389Electron fix profile
MR4391add new metric Qtech.BFC
MR4394fix some controllers ddash/metrics
MR4396Fix inerfaces name Qtech.BFC
MR4399Up report MAX_ITERATOR to 800 000.
MR4402mx: Use FastAPIService
MR4405liftbridge cursor persistence api
MR4407add_columns_total_reportmaxmetrics
MR4416Add csv+zip format to ReportDetails.
MR4417Add Long Alarm Archive options to ReportAlarm, from Clickhouse table.
MR4428Add available_only options to ReportDiscoveryTopologyProblem.
MR4432Reset NetworkSegment TTL cache after remove.
MR4433Change is_uplink criterias priority on segment MAC discovery.
MR4439fix_reportmaxmetrics
MR4447Add octets_in_sum and octets_out_sum columns to ReportMetrics.
MR4453ConfDB syslog
MR4455Fix controllers profiles, ddash
MR4457Fix get_iface_metrics
MR4462noc/noc#1392 Add search port by contains ifdescription token to ifdecr discovery.
MR4464LiftBridge client: Connection pooling
MR4470Add ReportMovedMacApplication application.
MR4475Add sorted to tags application.
MR4477noc/noc#1416 Extend ConfDB meta section.
MR4479Add get_confdb_query method to ManagedObjectSelector and MatchPrefix ConfDB function.
MR4480Add csv_zip file format to MetricsDetail Report.
MR4483noc/noc#1397 Additional biosegtrial criteria to policy.
MR4486Add migrate_ts field to ReportMovedMac.
MR4501noc/noc#1428 Add InterfaceDiscoveryApplicator for fill ConfDB info from interface discovery.
MR4508add_csvzip_reportmaxmetrics
MR4511Fix ./noc discovery for LB
MR4515noc/noc#1432 lb client: Configurable message size limit
MR4516fix csv_import view
MR4517Additional options to segment command
MR4535Bump networkx/numpy requirements
MR4539lb client: increased resilience
MR4547Add JOB_CLASS param to core.defer util.
MR4549ETL model Reference
MR4551add column reboots in fm.reportalarmdetail
MR4553fix processing trunk port vlan for HP A3100-24 (v5.20.99)
MR4565Add ttl-policy argument to link command.
MR4571Filter Multicast MACs on Moved MAC report.
MR4573Add api_unlimited_row_limit param
MR4579liftBridge: publish_async waits for all the acks
MR4582noc/noc#1371 Add schedule_discovery_config handler to events.discovery.
MR4592noc/noc#1400 Migrate InterfaceClassification to ConfDB.
MR4602Add MatchAllVLAN and MatchAnyVLAN function to ConfDB.
MR4607Bump pytest version
MR4624add metrics Subscribers \| Summary Alcatel.TIMOS
MR4629noc/noc#1440 Use all macs on 'Discovery ID cache poison' report.
MR4630Convert limit from dcs to int.
MR4632Add Telephony SIP metrics graph.
MR4633Always uplinks calculate.

Bugfixes

MRTitle
MR4249Fix card MO
MR4251Fix status RNR
MR4258Change field_num on ReportObjectStat
MR4269noc/noc#1374 Fix typo on datastream format check.
MR4285Fix Profile Check Summary typo.
MR4303#1335 ConfDB: Fix and inside or combination
MR4310Fix RNR affected AD
MR4319Add err_status to beef snmp_getbulk_response method.
MR4321Convert oid on snmp raw_varbinds.
MR4322Fix event clean
MR4327Convert set to list on orjson dumps.
MR4328Add xmac discovery to ReportDiscoveryResult.
MR4363./noc migrate-liftbridge: Do not create streams for disabled services
MR4368Fix hash_int()
MR4373Fix typo on Calcify Biosegmentation policy.
MR4409Add get_pool_partitions method to TrapCollectorService.
MR4418Add id field to project etl loader.
MR4419Fix multiple segment args on discovery command.
MR4423noc/noc#1399 Delete Permissions and Favorites on wipe user.
MR4424noc/noc#1375 Fix DEFAULT_STENCIL use on SegmentTopology.
MR4425noc/noc#1396 AlarmEscalation. Use item delay for consequence escalation.
MR4426Fix extapp group regex splitter to non-greedy.
MR4430Fix ManagedObject _reset_caches key for _id_cache.
MR4452noc/noc#1406 Use system username for JWT.
MR4461noc/noc#1229 Fix user cleanup Django Admin Log.
MR4472Add audience param to is_logged jwt.decode.
MR4474Add 120 sec to out_of_order escalation time.
MR4485noc/noc#688 Fix invalidate l1 cache for ManagedObject.
MR4492Skipping files if already compressed on destination.
MR4497noc/noc#1427 Fix whois ARIN url.
MR4498Fix object data use.
MR4502Move orjson defaults to jsonutils.
MR4505Bump ssh2-python to 0.23.
MR4506pm/utils -> Fix dict
MR4507Some etl loader fixes.
MR4513noc/noc#1423 Convert pubkey to bytes.
MR4514Convert empty object data to list on 0020 migration.
MR4518Fix vendors and handlers migrations
MR4522Fix typo on ifdescr discovery.
MR4524#1312 Consistent VPN ID generation
MR4540Fix customfields for mongoengine.
MR4555Revert uvicorn to 0.12.1.
MR4561Fix typo on interfaceprofile UI Application.
MR4564Fix trace when execute other script that command on MRT.
MR4569Fix typo on MRT service.
MR4575Add static_service_groups and static_client_groups clean_map to managedobject etl loader.
MR4590Fix login cookie ttl
MR4594Fix ETL loader change.
MR4595Fix extra filter when set extra order.
MR4598Fix datetime field on Service ETL model.
MR4614Fix SNMP_GET_OIDS on get_chassis_id scripts to list.
MR4627noc/noc#1439 Fix tag contains query for non latin symbol.

Code Cleanup

MRTitle
MR4254Cleanup flake.
MR4301Fix vendor docs test
MR4317Updated .dockerignore
MR4360Remove unused dependencies: tornadis, mistune
MR4362Update blinker, bsdiff, cachetools, crontab,
progressbar2, psycopg2, python-dateutil versions
MR4465Remove legacy scripts/ci-run
MR4496Fix formatting
MR4533Bump requirements
MR4587Fix collect beef for orjson.
MR4589Fix some lint errors
MR4622Fix Service etl model.

Profile Changes

Cisco.IOS

MRTitle
MR4316Update Cisco.IOS profile to support more physical
interfaces

Cisco.IOSXR

MRTitle
MR4408added interfacetypes for IOSXR platform

DLink.DxS

MRTitle
MR4355DLink.DxS.get_metrics. Fix SNMP Error when 'CPU
Usage' metric.
MR4434Fix Dlink.DxS profile.

EdgeCore.ES

MRTitle
MR4556EdgeCore.ES.get_spanning_tree. Fix getting port_id
for Trunk interface.

Eltex.MES

MRTitle
MR4217test tacacs1.yml crashed. AssertionError: assert \[\] == \[(right syntax)\]
MR4262Eltex.MES.get_capabilities. Fix detect stack mode by SNMP.
MR4523Eltex.MES.get_vlans. Use Generic script.
MR4615Eltex.MES. Add 1.3.6.1.4.1.89.53.4.1.7.1 to display_snmp.

Eltex.MES24xx

MRTitle
MR4381Fix Eltex.MES24xx.get_interfaces script

Extreme.XOS

MRTitle
MR4404Fix Extreme.XOS.get_lldp_neighbors script

Generic

MRTitle
MR4239Generic.get_capabilities add SNMP \| OID \|EnterpriseID len check.
MR4342Generic.get_arp. Cleanup snmp for py3
MR4613Generic.get_chassis_id. Add 'LLDP-MIB::lldpLocChassisId' oid to display_hints.

Huawei.MA5600T

MRTitle
MR4611Huawei.MA5600T.get_spanning_tree. Fix waited
command.

Huawei.VRP

MRTitle
MR4422Huawei.VRP. Add NE8000 version detect.
MR4550Huawei.VRP fix normalize_enable_stp
MR4557Huawei.VRP. Check nexthop type on ConfDB route normalizer.

Juniper.JUNOS

MRTitle
MR4324Fix Juniper.JUNOS.get_chassis_id script
MR4377Fix Juniper.JUNOS.get_interfaces script

NAG.SNR

MRTitle
MR4351Fix NAG.SNR.get_interfaces script
MR4481Fix NAG.SNR.get_lldp_neighbors script

Qtech.QSW

MRTitle
MR4576Fix Qtech.QSW profile

Qtech.QSW2800

MRTitle
MR4444Qtech.QSW2800. Add sdiag prompt.
MR4542Fix Qtech.QSW2800.get_version script

Ubiquiti.AirOS

MRTitle
MR4240Ubiquiti.AirOS.get_version. Cleanup for py3.

rare

MRTitle
MR4214ConfDB tests profile Raisecom.RCIOS.
MR4241Alstec.MSPU.get_version. Fix HappyBaby platform regex.
MR4265Fix ZTE.ZXA10 profile
MR4272Eltex.WOPLR. Add get_interface_type method to profile.
MR4279Update Rotek.BT profile
MR4288Add Enterasys.EOS profile
MR4295Fix metric name
MR4302add snmp in profile Juniper.JUNOSe
MR4313Rotek.BT fix get_metrics
MR4335add snmp in profile Alcatel.TIMOS
MR4353Update ZTE.ZXA10 profile to support C610
MR4365Fix prompt matching in Fortinet.Fortigate profile
MR4371Alcatel.OS62xx.get_version. Set always_prefer to S for better platform detect.
MR4376fix_get_lldp_neighbors_NSN.TIMOS
MR4406Add AcmePacket.NetNet profile.
MR4431noc/noc#1391 Cisco.WLC. Add get_interface_type method.
MR4536add_bras_metrics_Juniper_JUNOSe
MR4570Fix h3c get_switchport
MR4578Eltex.ESR add snmp support
MR4583Update DCN.DCWS profile.py
MR4585Update sa/profiles/DCN/DCWS/get_config.py
MR4586Ericsson.SEOS.get_interfaces. Migrate to Generic SNMP.
MR4596Fix DLink.DxS_Smart profile
MR4600Huawei.VRP3.get_interface_status_ex. Fix return in/out speed as kbit/sec.
MR4610Huawei.VRP3.get_interface_status_ex. Fix trace when SNMP Timeout.
MR4617NSN.TIMOS.get_interfaces. Fix empty MAC on output.

Collections Changes

MRTitle
MR4277Add more Juniper part number
MR4282Add new caps - Sensor | Controller
MR4294New Environment metrics
MR4305Fix bad json on collection.
MR4307Cleanup HP fm.eventclassificationrule.
MR4337Fix get metrics script for controller
MR4345Fix dev.specs SNMP chassis for Huawei and Generic.
MR4411Add some Juniper models
MR4451Add some Juniper models
MR4460noc/noc#1411 Add PhonePeer MetricScope.
MR4499Fix default username BI dashboard.
MR4520sa.profilecheckrules: Eltex | MES | MES5448 sysObjectID.0
MR4625Add AcmePacket Vendor.

Deploy Changes

MRTitle
MR4478noc/noc#1241 Merge ansible deploy to master repo
MR4623Add liftbridge deployflow
MR4637Fix auth path redirect
MR4640Catch trace on etl loader when delete lost mapping.
MR4643Change start condition