Skip to main content

Basic SDWAN Setup

(2025-06)

What This Is About

In the course of my day-to-day duties I've been seeing a lot of sdwan configs that are misbehaving, causing users to believe that their internet links are worse than they actually are. I've therefore put together a couple of sample configurations that should be usable in the basic cases such that sdwan will do what we want without unduely preventing users from getting done what they need to.

What This Is Not

If your circumstances differ from the beginning premise of each example in any way, such that you might reasonably ask but what if... or how about... then this page is not for you. You will have to seek your answers elsewhere, although there might be some relevant discussion about this at the bottom of this page.

Designing sdwan

When thinking about sdwan, there are several conditions you need to think about:

  1. how do I want to pick the egress selected for a given purpose (ie do I have an egress order preference or sla criteria)

This is the basic premise behind the sdwan. In the simple case we want to use a primary internet unless it is down, at which point we want to use the secondary instead.

At the same time, we might want to use the secondary for phone or video traffic, to increase the odds that primary traffic won't interfere; this might depend on latency, packet loss, or jitter settings to decide if the primary is better than the secondary.

This is also where you'd consider load balancing based on link usage or other criteria to get use out of both links, being aware that the decision to use this has its own consequences.

  1. are there thresholds over which I don't want to use a given interface for a given purpose for new sessions

In the simple case there are no purposes other than the general case so this question is meaningless. If you have phones or something that is latency, loss, or jitter-sensitive, you'd identify the thresholds of each that you would deem acceptable for use, beyond which you would prefer an alternate.

  1. are there thresholds over which I don't want to use a given interface for new sessions

This is the general case. In the general case you probably want to use the primary link exclusively unless it is down. For more sensitive service, you might prefer to use a secondary link once the primary has degraded slightly -- but not to the point where you want to stop using the primary entirely.

This is probably one of the biggest problems I see with sdwan. If you have tests and thresholds which are too sensitive, and you have the SLA test set to remove the route when the SLA is in "failed" mode, the firewall is more likely to pull a link out of service, effectively killing all the active sessions. (This is because once the routing table changes, all session path cache is flushed and the next packet gets run through the routing table again -- which will select a different interface, which will end up with a different NAT, which won't be recognized by the far end, and the connection will -- eventually -- time out and die.) Similarly, once the SLA passes again, the route gets re-added, and the sessions all switch interfaces again, which means they'll die again. If this happens excessively then the best case scenario is that the sdwan will flap, and users will get really irate.

Avoiding flap should be one of the highest priorities for your sdwan configuration. One of my new mantras is hitty internet is generally more useful than no internet. This means if a session is on a degraded line and the user can tolerate the degraded service level, it's better to leave the connection there than killing it by "moving" it to another link. If the user can't tollerate the service level they can kill and restart the session themselves.

  1. are there thresholds over which I don't want to use a given interface at all (ie -- kill all existing sessions and re-point to an alternate)

In the general case, the answer to this is the same as for question #3. In more complicated settings you might prefer new sessions go on a secondary once the primary has degraded to some level but not kill the sessions on the primary. But beyond some threshold you deem the link completely unusable and stop anything from trying to use it at all.

  1. what do I want to happen when an interface which is down returns to service

The answer to this depends on how you got here. The default answer to return to the primary immediately by replacing the route to the newly-returned link is how we can get into a flapping state. If you've placed a link hard down by removing the route then you're going to get a routing table change when it comes back, that's unavoidable. So ideally you only want to mark a link as down when it is completely unusable to the point where it's worth the pain of killing all long-running sessions. Arguably you'd almost never need to get into this state as most services will eventually time out on their own, and either restart or error out.

The design

So when it comes to SLA tests, you want two:

  • one which selects a link based on performance metrics; and
  • one which keeps a link up as long as possible.

Two Internet Services

In the simplest condition, you have two ISPs. One is a primary, and the other is for failover in the event of the primary becoming unusable.

So the answers to our five questions becomes:

  1. I want to use the primary link unless it is down
  2. No, if the link is up I want to use it
  3. No, if the link is up I want to use it
(sdwan) # show
config system sdwan
    set status enable
    config zone
        edit "virtual-wan-link"
        next
    end
    config members
        edit 1
            set interface "wan1"
        next
        edit 2
            set interface "wan2"
            set gateway x.x.x.x
        next
    end
    config health-check
        edit "DNS_SLA"
            set server "8.8.8.8" "1.1.1.1"
            set protocol dns
            set interval 1000
            set probe-timeout 1000
            set update-static-route disable
            set members 1 2
        next
        edit "Ping_SLA"
            set server "8.8.4.4" "9.9.9.9"
            set interval 1000
            set probe-timeout 1000
            set members 1 2
        next
    end
    config service
        edit 1
            set name "Default_Outbound"
            set dst "all"
            set src "all"
            set priority-members 2 1
        next
    end
end