Testing and monitoring MSTP in a heterogeneous network

Introduction


With the growth of any network, the administrator sooner or later faces, among other things, three problems - the risk of accidental drops due to broken lines, the appearance of rings in the switch tree and the lack of performance of individual lines.

To combat these types of evil, humanity, as you know (in particular, from several articles on the Habré, Wikipedia and many more from where) came up with and uses various versions of the Spanning-Tree protocol. The general idea of ​​which is that switches in a network with more or less arbitrary connectivity, by some rules, collectively decide which links between them to use for sending packets.

Pro and Contra


It is worth noting that from time to time people think about whether it is necessary at all ( here, for example ). Different thoughts about this come down (ok, as far as I know) to three ideas:

  1. And let's put duplicated links everywhere and all sorts of LAG / LACP aggregated pairs of wires
  2. Well, him, the second level, we will route everything on the third
  3. And let's live on real or virtual switch stacks


It is clear that for each particular network there are its own “design considerations” and it will never hurt to think a couple of times more, but there are certain disadvantages for both approaches. The first increases the cost of infrastructure in real conditions, if necessary, to maintain fault tolerance. An example - there are ten switches in a potential “ring”. And already laid optics. Cheap, 4 cores. If you want to build two independent links before each, which are then friends in some kind of aggregated interface, you will have to either turn up a bunch of new construction or put several wavelengths in one fiber, which is also not cheap. And if you “route everything”, then the switches will have to be changed to routers (I exaggerate, but the meaning remains) and have an increase in delays out of the blue. Alas.

Juniper has working and reliable distributed (up to 80 km, it seems) switch stacks. But they stand - like an airplane. Hang up.

Experience son of difficult mistakes


After reading, looking at all this economy, we decided to try to run it. And judging by the first impression of reading the various manuals, everything was very rosy - the mood is right and it will more or less understand and fly.
In the arsenal were Cisco-2960 and Dlink of various kinds. I wanted happiness in the form of MSTP for a couple of VLANs. There was no stand, everyone tried to assemble it on a live network (at night, at minimum load, etc.). Why exactly MSTP - because some, no, but a standard. There is a chance to start the system from the equipment of different vendors without large losses. And to ensure partial use of blocked links, again.

It didn’t work from the first call - Cisco rebuilds MSTP for a rather long time and the smallest mistake in arranging VLANs for different Instance leads to the fact that the system does not take off and is likely to lose control.

We realized that a Spanning Tree without monitoring and a quick idea of ​​what state it is in now is worthless, like a RAID, for example.

Stand


We rolled back the configuration, put the Cisco 3750 stack in the kernel, which removed the issues with the performance of the 2960s pair and for a noticeable time put off the bandwidth problems, leaving only the line reservation issue.

We assembled a stand of 3 2960 and 2 dlink disconnected from the main network, and began to play.

Instruments


First, ports were allocated on one of Cisco to control all the other switches, and on them the control interfaces were attached to a separate VLAN, which was not planned to be driven in MSTP, so as not to lose connectivity with the equipment during the experiments.

It was found that some dlink models support a very limited number of MST instances, which makes life difficult, but does not make it impossible. Available in our economy can up to 7, alas.

Further, the scripts were written in perl + Net :: Telnet, which can do two key things:

  1. Automatically and uniformly configure switches of different models
  2. Capture information sufficient to display the state of the tree


If suddenly someone comes in handy, I will give as an example the minimum commands for dlink

config stp version mstp
config stp mst_config_id name %cfname revision_level %revision
create stp instance_id 1
config stp instance_id 1 add_vlan %inst1vlans
create stp instance_id 2
config stp instance_id 2 add_vlan %inst2vlans
create stp instance_id 3
config stp instance_id 3 add_vlan %inst3vlans
create stp instance_id 4
config stp instance_id 4 add_vlan %inst4vlans
create stp instance_id 5
config stp instance_id 5 add_vlan %inst5vlans
create stp instance_id 6
config stp instance_id 6 add_vlan %inst6vlans
config stp ports 1:1-1:26 state enable
enable stp
config stp trap new_root enable
config stp trap topo_change enable

(Specifically, this example is for conditionally stackable longs. For non-stackable port numbers will be without ":")
and for cisco:

conf t
no spanning-tree mst configuration
spanning-tree mode mst
spanning-tree mst configuration
 name %cfname
 revision %revision
 instance 1 vlan %inst1vlans
 instance 2 vlan %inst2vlans
 instance 3 vlan %inst3vlans
 instance 4 vlan %inst4vlans
 instance 5 vlan %inst5vlans
 instance 6 vlan %inst6vlans
exit
exit


Instead of% cfname, we substitute the configuration name, instead of% revision, respectively, revision (a natural number from one and above).
% inst1vlans - a list of VLAN tags for the first instance, separated by commas.

For fine tuning - so that the balancing really turns on, the traffic spread across the links, etc. - you need to go through the ports and set priorities. This is better with your hands.

How to look at it?


It would be generally ideal if you could pull different switches over SNMP and see more or less the same data in more or less the same plates. But, despite the standard protocols (both SNMP and MSTP seem to be standardized), all vendors have their own ideas about the beautiful and there is no freebie. Or at least not found. For some reason, Cisco gives data on CIST, but not on the rest of the MST Instance. Why - it’s not clear even once ...
I had to tackle the file and reinvent the wheel. Namely - a program that climbs all the same telnet on the switches and removes data from them, parses and displays. To display this kind of data ideally (for my subjective taste, of course), graphviz is suitable - you can feed him a simple text file in the input, and he will spread the graph for you and draw arrows, and even insert hyperlinks if you ask me kindly.

It turned out something like this:
Commands for obtaining information, respectively, for Dlink:

show stp ports %port


and for Cisco:

show spanning-tree interface Gi %device/%port detail


Underwater rocks


Spanning tree has the right to converge up to a minute and this is normal.

The basic setting should be identical.

In order for all switches to have all instance s, it is necessary that they have all VLANs. (the picture above, however, has switches on which not all vlans are!)
Using the Spanning Tree (without service instances created exclusively for research purposes), it is impossible to determine which port of switch A includes switch B and which - Switch C if A in all MST instances is closer to Root than both B and C.

Summary


You can make standardized protocols work if you act carefully, take your time and watch your hands and equipment.