Discover, Visualize and Monitor Link Aggregation Problems – Part 2

By Dale Smith
28th August 2017


Link Aggregation Misconfiguration (Continued)
In this 2-part blog post series we explore several common scenarios of link aggregation misconfiguration. In Part 1, we looked at the scenario in which discrete 10G links that should have been included in a Link Aggregation Group (LAG) were not (for whatever reason). We observed that with the right tooling, it became fairly straightforward to diagnose and resolve this kind of misconfiguration. In particular, the automated discovery and visualization capabilities of NetSpyGlass enabled an immediate visual diagnosis that ports on both ends of a number of discrete 10G links were completely left out of the LAG. This accounted for the sub-par aggregated link performance and enabled an immediate resolution of the problem.

In this part 2 post, we explore the more challenging case of link aggregation misconfiguration where visual cues are less likely to provide immediate assistance in pinpointing the problem, at least initially. We will look at how NetSpyGlass monitoring variables can be leveraged to programmatically automate configuration checks and generate appropriate alerts and visual cues under well-defined conditions.

IEEE Standard 802.1AX-2014
Before we get started, let’s update our reference to currently applicable standards governing Link Aggregation. In the Part 1 post, we referenced IEEE 802.3ad-2000. While this standard is still widely referenced, it has been superseded by IEEE Standard 802.1AX-2008 which has in turn been superseded by IEEE Standard 802.1AX-2014. Incidentally, the IEEE provides downloadable pdf versions of many standards (free of charge). To download the current Link Aggregation standard in pdf format, click here – IEEE Standard 802.1AX-2014 (pdf download). Scroll to the bottom of the page, select a user type, enter an email address then download.

Link Aggregation Group (LAG), Revisited
From Section 1.1 of IEEE 802.1AX-2014 we have:

Link Aggregation provides protocols, procedures, and managed objects that allow the following:

  • One or more parallel instances of full-duplex point-to-point links to be aggregated together to form a Link Aggregation Group (LAG), such that a MAC Client can treat the LAG as if it were a single link.
  • A resilient interconnect using multiple full-duplex point-to-point links among one to three nodes in a network and one to three nodes in another, separately administered, network, along with a means to ensure that frames belonging to any given service will use the same physical path in both directions between the two networks.

This standard defines the MAC-independent Link Aggregation capability and general information relevant to specific MAC types that support Link Aggregation. The capabilities defined are compatible with previous versions of this standard.

Link Aggregation Control Protocol (LACP)
And, from Section 6.4 of IEEE 802.1AX-2014 we have:

The Link Aggregation Control Protocol (LACP) provides a standardized means for exchanging information between Partner Systems on a link to allow their Link Aggregation Control instances to reach agreement on the identity of the LAG to which the link belongs, move the link to that LAG, and enable its transmission and reception functions in an orderly manner.

LAG vs LACP
These two concepts (i.e. LAG and LACP) are important to have clearly defined before exploring more challenging link aggregation misconfiguration scenarios. Note that although LACP is defined by the 802.1AX standard, it is not required to create a LAG nor is LACP the same as link aggregation. So, an 802.1AX LAG can be created without employing LACP. LACP is used to dynamically construct LAGs as opposed to building them in a static fashion. Network devices will support either static LAGs or dynamic LAGs (via LACP) or both.

One key advantage of LACP is it enables devices to confirm that they are configured for link aggregation. With static link aggregation, a cabling or configuration error could go undetected with resulting network behavior that would be unexpected and very likely undesirable. Both static and dynamic LAGs (via LACP) can detect physical link failures within the LAG and continue to forward traffic using the other functioning links within that LAG. LACP can also detect switch or port failures that would not otherwise result in any kind of loss of link notification. And finally, it is worthwhile keeping in mind that when a port is configured as a static member of a LAG , it will neither transmit nor receive LACP messages.

As noted in the Part 1 post, there are situations in which it will not be obvious that a link aggregation misconfiguration even exists. The above discussion of LAG vs LACP provides some insight into why this would happen and why (absent the proper tooling) it is very challenging to diagnose and resolve. Network operators will observe unusual traffic patterns and/or sub-par performance of an aggregated link as their only tangible clue that a potential misconfiguration exists.


The More Challenging Link Aggregation Misconfiguration
In this more difficult to diagnose case of link aggregation misconfiguration, ports have been successfully added to a LAG but the aggregated link exhibits unexpected behavior. By reference to the adjacent figure, we see a NetSpyGlass generated network architecture map containing a number of aggregated links. In the upper left hand corner of the map we see two core routers with a high capacity aggregated link connecting them. But, just as in the Part 1 post scenario, the network operations team observes unexpected levels of performance, particularly in terms of traffic carrying capacity. However, in contrast with the Part 1 misconfiguration scenario, there are no immediate visual cues to indicate a possible source of configuration error. Everything appears to be fine, admin and operational status are “Up”. This is exactly how a network map would appear when ports are added to a LAG on one side of a connection but (for whatever reason) the corresponding ports were not properly added (or not added at all) to the appropriate LAG on the other side of the connection.

NetSpyGlass Discovery & Visualization


Diagnosis and Resolution
So, how to diagnose and resolve this kind of misconfiguration when (apparently) there is so little to work with? Referring back to our earlier discussion of LACP, it turns out that we can diagnose this kind of misconfiguration only by looking at the device LACP protocol state on either side of the link. Of course the problem with this is that a network operator would have no obvious indication after provisioning a LAG link that anything is misconfigured so they would have no cause to look at LACP protocol states. The consequence of this misconfiguration is as previously mentioned – the effective capacity of the LAG link is reduced.


Visualizing LAG Misconfiguration with LACP Analysis
However, with the proper tooling, even this most challenging misconfiguration can be diagnosed and resolved in an automated manner. For example, NetSpyGlass “looks at LACP” so the network operator doesn’t need to. It computes a “synthetic” variable called –  portAggregatorBandwidth that has a value of 100% if LACP shows that all physical ports in the LAG group are actually passing traffic. The value of this variable will drop below 100% if any of the LACP-supported ports in the LAG fails to pass traffic for any reason. The operator would then have a visual cue with which to see misconfigured LAG links on the network map by simply choosing this synthetic variable “portAggregatorBandwidth” in the map legend. The misconfigured link would suddenly appear in red (as shown in the adjacent network map). This NetSpyGlass ability to analyze LACP protocol data only applies to devices with support for IEEE8023-LAG-MIB. Because there isn’t a separate LACP MIB, LACP state data resides in the LAG MIB itself.

NetSpyGlass Discovery & Visualization


Proactive vs Reactive
Again, we see that with the right tooling, we can “visually” diagnose and resolve even the most challenging LAG misconfigurations. But, how about taking this one step further and proactively recognizing that a misconfiguration exists as part of the LAG provisioning workflow? Couldn’t some kind of alert be created to immediately notify network operators that a LAG misconfiguration problem exists at the outset?

Short answer – yes!…but again, only with the right tooling.

NetSpyGlass provides users with a number of monitoring variables that are easily accessible via Python scripts executed by NetSpyGlass’ embedded Python interpreter. This means we can programmatically automate LACP processing as part of a LAG configuration check and trigger an alert under user-defined conditions. This can be accomplished with an alert that monitors device LACP states via variable portAggregatorBandwidth which would be expected to have a value of 100% if all configured LAG members were online and passing traffic. The value computed for this variable drops below 100% only if a LAG member port is down or misconfigured. This alert would provide an effective way to proactively identify LACP-supported LAG links that have become degraded due to some kind of misconfiguration.

The example alert below uses a condition function to check if the value of the portAggregatorBandwidth variable is below 100%:

alert(
name='lagPartiallyDegraded',
input=import_var('portAggregatorBandwidth'),
condition=lambda _, value: value < 100,
description='$alert.deviceName:$alert.componentName :: One or more LAG has members failed, combined bundle bandwidth is below 100%',
details={},
duration=300,
percent_duration=100,
notification_time=300,
fan_out=True
)

Summary & Conclusion
The above scenario represents one of the most challenging LAG misconfiguration scenarios. Having an effective toolset for automated diagnosis and resolution with integrated visualization enables network operators to quickly resolve even the most difficult link aggregation problems. In Part 1 of this post series we saw how NetSpyGlass automated device discovery and visualization provided immediate visual cues that link aggregation misconfiguration was caused by a number of physical 10G ports not being added to the LAG on both sides of the connection. In Part 2, we saw how to leverage the LACP state analysis capabilities that NetSpyGlass provides to diagnose when physical ports are added to a LAG on one side of a connection but not added on the other side. We saw that this scenario was particularly challenging because there were no immediate visual cues to the misconfiguration. In each of the scenarios we explored, the aggregated link performance failed to meet expectations. Finally, we went one step further to show how NetSpyGlass can transform reactive diagnosis and resolution into a more proactive workflow to ensure proper LAG configuration during provisioning. This was accomplished with a Python script that triggers an alert and visual cue on the network map when the value of a monitoring variable related to LAG capacity falls below 100%.

Again, the right tooling makes all the difference. NetSpyGlass with its embedded Python interpreter provides NetOps teams with workflow automation capabilities that save time, reduce the risk of unplanned downtime and ensure that network performance expectations are consistently achieved.

About the Author

Dale Smith

Dale Smith enjoys operating at the intersection of web strategy and digital infrastructure. With a particular interest in monitoring automation for 21st century networks, he aims to be a helpful resource for networking professionals preparing for the software-defined future.

Share this Post

Leave a Reply

avatar
  Subscribe  
Notify of