Basic SD-WAN for 7.4.x
(2026-04-08)
What This Is About
(This is a refinement of my previous sdwan writings.)
In the course of my day-to-day duties I've been seeing a lot of sdwan configs that are misbehaving, causing users to believe that their internet links are worse than they actually are. I've therefore put together a couple of sample configurations that should be usable in the basic cases such that sdwan will do what we want without unduely preventing users from getting done what they need to.
The main issue I'm encountering is SLA-driven flap, where a link is repeatedly taken out of, then put back into, service. Usually this kills all the active sessions because when a route is removed from the routing table, the address used to NAT out the connections changes, which means the sessions get rejected by the far end.
In an ideal world all ISPs provide links of stellar quality and you don't have to worry about blips in packet loss or latency. However out here in the real world we have offices with some pretty terrible links, and while in theory you want to not use a link if it is terrible, you have to consider the alternative. And if there is no alternative, why are you messing with it? You have what you have, and in that absence of a better option a bad link beats a down link in 95% of the scenarios that come up.
My primary objectives when setting up or reconfiguring a sdwan for internet are:
- minimize or eliminate link flap by making things excessively tolerant and turning connection issues back on the client applications; and
- if a link routinely bad because of packet loss, or excessive latency/jitter, that is an ISP problem and should be pursued with them rather than trying to mask it with sdwan trickery.
What This Is Not
If your circumstances differ from the beginning premise of each example in any way, such that you might reasonably ask but what if... or how about... then this page is not for you. You will have to seek your answers elsewhere.
So I have two pet peeves.
Peeve #1: One-armed sdwan configured with SLA targets
There is no point having a one-armed sdwan with SLA targets because there’s no alternative link around to maybe be able to take the traffic. In these scenarios, a terrible link will be better than an artificially-downed link every single time. In my view the default one-armed sdwan config needs to look something like this:
config system sdwan
set status enable
config zone
edit "virtual-wan-link"
next
end
config members
edit 1
set interface "wan1"
set gateway a.b.c.d
next
end
config health-check
edit "Ping_SLA"
set server "8.8.4.4" "9.9.9.9"
set interval 1000
set probe-timeout 1000
set update-static-route disable
set sla-fail-log-period 60
set sla-pass-log-period 0
set members 1
next
end
config service
edit 1
set name "Default_Outbound"
set dst "all"
set src "all"
set priority-members 1
next
end
end
A few comments here:
- Note the absence of wan2. If it isn’t active, don’t muddle things by pre-adding it. When the new ISP comes we can just add it at that point.
- I prefer ping tests to DNS tests since the interaction with the DNS server adds more variability to the return timing;
- I usually prefer to not to use google as a ping test because they seem to be aggressively peered with ISPs and so latency can be artificially low
- Running tests once per second, with one second timeout tolerance will reduce the chance of an artificial down event
- The sla log periods are set the way they are because A) they’re excessively noisy and B) I’ve never run into anyone who’s cared about the historic values of SLA tests that passed, ever
Peeve #2: Sdwan configurations with excessively low SLA thresholds
In this case, bad (or simply busy) links can be aribitrarilly marked as not sufficient for use (and with the “Delete route” set, not sufficient for ANY use). At best, you will end up with links flapping as links go in and out of SLA; at worst, you will end up with all links marked as unsuitable for use even though they’d actually pass traffic if used. My suggestion for a starting point for a configuration for a dual-armed sdwan where one is specifically for regular use and the other is expressly a backup link:
config system sdwan
set status enable
config zone
edit "virtual-wan-link"
next
end
config members
edit 1
set interface "wan1"
set gateway a.b.c.d
next
edit 1
set interface "wan2"
set gateway e.f.g.h
next
end
config health-check
edit "Ping_SLA"
set server "8.8.4.4" "9.9.9.9"
set interval 1000
set probe-timeout 1000
set update-static-route disable
set sla-fail-log-period 60
set sla-pass-log-period 0
set members 1 2
next
end
config service
edit 1
set name "Default_Outbound"
set dst "all"
set src "all"
set priority-members 1 2
next
end
end
Notes here:
- The primary goal with this setup is to minimize flapping events by making things excessively tolerant and turning connection issues back on the client applications
- Main link is used exclusively unless marked down
- Main link will only be marked down if it fails five consecutive ping tests
- Lack of a “remove route” means that it will be up to the active application to detect that the link stopped working and reconnect or die accordingly
- Lack of a “remove route” also means that active sessions on wan2 will stay on wan2 when wan1 comes back up
- Personally I don’t have comfort with the “load balancing” algorithms in use by 7.4.x so I’d not use them. I like the idea of source/dest-ip-hashing, but that seems to not be an option any more.
- In the event that we had a situation where traffic x (say voip) needed to go on wan2 while everything else rides wan1, I’d add make the service section look like
config service
edit 2
set name "VoIP"
set dst "VoIP Service Server Group"
set src “Phones”
set priority-members 2
next
edit 1
set name "Default_Outbound"
set dst "all"
set src "all"
set priority-members 1 2
next
end
- Here phones will use wan2 unless it is down, in which case their traffic falls through to rule 1, and will then use wan1.
- I’d probably be loose in service #2’s dst/src, probably only picking “all the phones” or “all the voip servers”.
Scenarios from here only get more complicated, but in general my view is that specific SLA targets are rarely needed, and tight SLA targets are never needed:
- If latency, jitter, or packet loss is important as a link selection criteria, then the tests should be constructed so that they run against the relevant internet servers – ie the specific servers used by the phone vendors -- rather than the usual DNS suspects.
- SLA targets should be extremely loose so that links which are “up” can satisfy them, but we can choose which link to use based on which one is “better” – tight rules set a threshold that effectively say “over this threshold this link should not be used for this purpose” and run the risk that none of the links satisfy the criteria. When this happens, even with two links functionally up they are unusable. As I said above, in some circumstances a bad link beats a down link.
- “Remove Route” setting should be avoided so that sessions which are using link A when link B becomes better don’t get arbitrarily dropped.
- In general, “Remove Route” is saying “this link is so bad it should not be used under any circumstances, ever.”