Skip to main content

Basic SDWAN Setup

(2025-06)

What This Is About

In the course of my day-to-day duties I've been seeing a lot of sdwan configs that are misbehaving, causing users to believe that their internet links are worse than they actually are. I've therefore put together a couple of sample configurations that should be usable in the basic cases such that sdwan will do what we want without unduely preventing users from getting done what they need to.

The main issue I'm encountering is SLA-driven flap, where a link is repeatedly taken out of, then put back into, service. Usually this kills all the active sessions because when a route is removed from the routing table, the address used to NAT out the connections changes, which means the sessions get rejected by the far end.

In an ideal world all ISPs provide links of stellar quality and you don't have to worry about blips in packet loss or latency. However out here in the real world we have offices with some pretty terrible links, and while in theory you want to not use a link if it is terrible, you have to consider the alternative. And if there is no alternative, why are you messing with it? You have what you have, and in that absence of a better option a bad link beats a down link in 95% of the scenarios that come up.

What This Is Not

If your circumstances differ from the beginning premise of each example in any way, such that you might reasonably ask but what if... or how about... then this page is not for you. You will have to seek your answers elsewhere.

Designing sdwan

When thinking about sdwan, there are several conditions you need to think about:

  1. How do I want to pick the egress selected for a given purpose (ie do I have an egress order preference or sla criteria?)

This is the basic premise behind the sdwan. In the simple case we want to use a primary internet unless it is down, at which point we want to use the secondary instead.

At the same time, we might want to use the secondary for phone or video traffic, to increase the odds that primary traffic won't interfere; this might depend on latency, packet loss, or jitter settings to decide if the primary is better than the secondary.

This is also where you'd consider load balancing based on link usage or other criteria to get use out of both links, being aware that the decision to use this has its own consequences.

  1. Are there thresholds over which I don't want to use a given interface for a given purpose for new sessions?

In the simple case there are no purposes other than the general case so this question is meaningless. If you have phones or something that is latency, loss, or jitter-sensitive, you'd identify the thresholds of each that you would deem acceptable for use, beyond which you would prefer an alternate.

  1. Are there thresholds over which I don't want to use a given interface for new sessions?

This is the general case. In the general case you probably want to use the primary link exclusively unless it is down. For more sensitive service, you might prefer to use a secondary link once the primary has degraded slightly -- but not to the point where you want to stop using the primary entirely.

This is probably one of the biggest problems I see with sdwan. If you have tests and thresholds which are too sensitive, and you have the SLA test set to remove the route when the SLA is in "failed" mode, the firewall is more likely to pull a link out of service, effectively killing all the active sessions. (This is because once the routing table changes, all session path cache is flushed and the next packet gets run through the routing table again -- which will select a different interface, which will end up with a different NAT, which won't be recognized by the far end, and the connection will -- eventually -- time out and die.) Similarly, once the SLA passes again, the route gets re-added, and the sessions all switch interfaces again, which means they'll die again. If this happens excessively then the best case scenario is that the sdwan will flap, and users will get really irate.

Avoiding flap should be one of the highest priorities for your sdwan configuration. One of my new mantras is shitty internet is generally more useful than no internet. This means if a session is on a degraded line and the user can tolerate the degraded service level, it's better to leave the connection there than killing it by "moving" it to another link. If the user can't tollerate the service level they can kill and restart the session themselves.

  1. Are there thresholds over which I don't want to use a given interface at all (ie -- kill all existing sessions and re-point to an alternate, knowing that when the link is brought back up there could be implications)?

In the general case, the answer to this is the same as for question #3. In more complicated settings you might prefer new sessions go on a secondary once the primary has degraded to some level but not kill the sessions on the primary. But beyond some threshold you deem the link completely unusable and stop anything from trying to use it at all.

As far as the return to service implications go, you have to remember that the default answer to return to the primary immediately by replacing the route to the newly-returned link is how we can get into a flapping state. If you've placed a link hard down by removing the route then you're going to get a routing table change when it comes back, that's unavoidable. So ideally you only want to mark a link as down when it is completely unusable to the point where it's worth the pain of killing all long-running sessions. Arguably you'd almost never need to get into this state as most services will eventually time out on their own, and either restart or error out.

Performace SLAs and Service link selection

WHen you are defining a service, you can select specific criteria under which a rule will select from a set of interfaces. This can be cost, link quality in terms of latency/loss/jitter, a specific SLA, traffic levels, or combinations of these things.

If you set a specific SLA test that the link has to satisfy, the link obviously has to satisfy that SLA test in order to be selected. But what isn't obvious is that if you don't set a specific SLA test, the link has to be considered as being up by any SLA test. In other words: if a link isn't considered as up by any SLA test, it can't be selected by any service rule.

This is the other issue I see frequently -- sdwan zones completely down because both links included fail all the SLA tests.

The design

So when it comes to SLA tests, there are three types:

  • ones which selects a link based on performance metrics, if necessary;
  • ones which keeps a link up as long as possible; and
  • ones which provide useful extra context information

Two Internet Services

In the simplest condition, you have two ISPs. One is a primary, and the other is for failover in the event of the primary becoming unusable.

So the answers to our four questions becomes:

  1. I want to use the primary link unless it is down
  2. No, if the link is up I want to use it
  3. No, if the link is up I want to use it
  4. No, if the link is up I want to use it. I would argue that there's no circumstance under which we'd want to actively kill all outbound sessions when a link goes down and then kill them again when it comes back -- applications should be able to deal with the dropped line themselves. It will also mean that sessions moved to the backup connection will stay there when the primary comes back until they end for other reasons (naturally or otherwise).
(sdwan) # show
config system sdwan
    set status enable
    config zone
        edit "virtual-wan-link"
        next
    end
    config members
        edit 1
            set interface "wan1"
        next
        edit 2
            set interface "wan2"
            set gateway x.x.x.x
        next
    end
    config health-check
        edit "DNS_SLA"
            set server "8.8.8.8" "1.1.1.1"
            set protocol dns
            set interval 1000
            set probe-timeout 1000
            set update-static-route disable
            set members 1 2
        next
        edit "Ping_SLA"
            set server "8.8.4.4" "9.9.9.9"
            set interval 1000
            set probe-timeout 1000
            set update-static-route disable
            set members 1 2
        next
    end
    config service
        edit 1
            set name "Default_Outbound"
            set dst "all"
            set src "all"
            set priority-members 2 1
        next
    end
end

How This Works

Here we have two SLA rules defined.

In this configuration the Ping_SLA is doing most of the heavy lifting.

  • we've set the probe-timeout to 1 second, increasing the tollerance for slow replies;
  • we've set the probe frequency to 1 second, increasing the amount of time before a link is set as failed;
  • we do NOT remove the static route when the link fails.

The DNS_SLA is for additional context, generating graphs which might be useful for illustrating line quality issues, demonstrating that the quality issues are link-based, not target-based. It also has the increased

Then, in the service definition, we have a straightforward "use link 2 if its up, if not use link 1."

When the primary fails, new sessions should be put on the backup, while sessions still on the primary will time out in due course; when the primary returns, new sessions will go on there while sessions still on the backup continue until they end for other reasons.

One-Armed SDWAN

A one-armed sdwan is future-proofing. Since you have to remove or otherwise shuffle around rules and things before you can add interfaces to a sdwan zone, it makes sense to create the sdwan infrastructure ahead of time, so that adding a possible second interface later is much easier and requires less de-configuration in order to get to where you want to be.

So the answers to our four questions becomes:

(sdwan) # show
config system sdwan
    set status enable
    config zone
        edit "virtual-wan-link"
        next
    end
    config members
        edit 1
            set interface "wan1"
            set gateway x.x.x.x
        next
    end
    config health-check
        edit "DNS_SLA"
            set server "8.8.8.8" "1.1.1.1"
            set protocol dns
            set interval 1000
            set probe-timeout 1000
            set update-static-route disable
            set members 1
        next
        edit "Wan1-Latch"
            set server "x.x.x.x"
            set interval 1000
            set probe-timeout 1000
            set members 1
        next
    end
    config service
        edit 1
            set name "Default_Outbound"
            set dst "all"
            set src "all"
            set priority-members 1
        next
    end
end

How it works

Here I again have two SLA tests. Again the DNS_SLA is for link quality context of the full connection to the internet.

The "Wan1-Latch" is set to use its own next-hop into the internet, ie the first device on the ISP's network as far as you are concerned. If this device is not reachable, then the link is hard-down. Here it doesn't matter whether or not we're removing the route when the link is down, because it's hard-down and packets have nowhere to go. This should actively kill the sessions and let the active users know that the internet is gone.

Two ISP Links, One For Data, One For Phones, Mutual Fail-over

So if you have two good internet links you might as well split the more congestion-sensitive traffic off to a separate link so that the routine web browsing etc doesn't interfere as much.

Here because we have two services to worry about, we ask the four questions above once for each service.

Still, because we're not super concerned about thresholds, the answer for the questions are the same for both services:

config system sdwan
    set status enable
    config zone
        edit "virtual-wan-link"
        next
        edit "Internet"
        next
    end
    config members
        edit 1
            set interface "wan1"
            set zone "Internet"
            set gateway a.b.c.d
        next
        edit 2
            set interface "wan2"
            set zone "Internet"
            set gateway e.f.g.h
        next
    end
    config health-check
        edit "Internet"
            set server "1.1.1.1" "8.8.8.8"
            set interval 1000
            set probe-timeout 1000
            set update-static-route disable
            set members 0
        next
    end
    config service
        edit 1
            set name "Phones"
            set dst "all"
            set src "Phones Subnet"
            set priority-members 2 1
        next
        edit 2
            set name "Data"
            set dst "all"
            set src "all"
            set priority-members 1 2
        next
    end
end

How it works

This is the initial case, twice, with one of them backwards from the other. The phones traffic should be sent out the phones interface if its up; if the phones ISP is down, that traffic goes out the data interface.

Similarly, anything not caught by the Phones service is handled by the Data service through the data interface; with the fail-over being the phones over being the phones line.

Conclusion

You have probably noticed that none of my examples include specific SLA target values as criteria for link usability. This is because in my opinion, there is no need for them in these simple cases, mostly for the reasons put forward in the introduction:

What's the alternative?

If you have no alternative, don't be pre-emptively messing with link availability.

Secondly, unless you have a reason to be actively killing sessions when a link becomes "unusable" instead of waiting for the application to figure it out itself, don't mess with the route table.

If you have a more complicated situation, for example an ADVPN mesh with BGP routes and multiple paths and differing costs then yes there's call there for more complicated rules. But the vast majority of sites I see don't need these.

And finally, remember what I said off the top: If your circumstances differ from the beginning premise of each example in any way, such that you might reasonably ask but what if... or how about... then this page is not for you. You will have to seek your answers elsewhere.