VPN tunnel keeps going down. I'm at a loss

tldr; My site-to-site goes down periodically. If I manually reset it, it lasts 7.5 hours. If it eventually resets itself, the time it lasts varies.


Hi everyone. I have a problem with my site to site tunnel and I can’t figure it out.

I have my main network and I have a satellite office.
I’ve set up a tunnel between the two and it periodically goes down for hours at a time.
The external interface on the satellite side never goes down. I can always ping it.

On my main side, I am using a Palo Alto (PA-3050 x2 (HA)) managed by Panorama.
On the satellite side, I am using a Juniper SRX 100.

[Palto Alto] <> [tunnel] <> [Juniper]

I’ve tried the following:

  • Lowering the MTU to 1350 on the Juniper.
  • Swapping one Juniper for another, both factory reset.
  • Disabled all ALG inspection on the Juniper.
  • Delete settings on Palo Alto side and recreate them.

I’m sure there’s more that I’m forgetting.

The only thing that seemed to make a difference was the last one.
I didn’t create the Palo Alto side of the tunnel and noticed some discrepencies such as the lifetime seconds were different on both sides.
I set it to 8 hours.

Changing the lifetime actually did make a difference.
Now instead of going down every hour or so, it lasts about 7 hours and 30 minutes.

Also, I notice the tunnel takes around 10 minutes to start passing traffic once the firewalls show that its up.
In other words, I’ll reset the tunnel and it shows both IKE and Ipsec are connected, but I can’t ping through it until about 10 minutes later.

It seems like the tunnel is dying before its 8 hour lifetime and then if it re-establishes itself eventually, the two sides get out of sync or something sooner or later.

Or who knows? Maybe the time has nothing to do with it.

So here are some logs and configurations.
100.50.10.33 is our home network (Palo Alto).
200.1.1.74 is our remote network (Juniper).
10.20.20.1 is the internal interface which I’m performing a continuous ping.

I replaced the real IPs with fake ones for this post.

Palo Alto config: https://imgur.com/a/awPM9Ut
Juniper config: set version 10.4R4.5set system host-name RemoteOffice_FWset system time-zo - Pastebin.com
Palo Alto logs: http://devante.org/pa_logs.html
Pings (warning, 11 MB text file): https://drive.google.com/open?id=1SimthgtZaV2eekD6iYWDKndw8dr-eI5s

Breakdown of pings:

04/27 23:34:30 Up 7 hours, 26 minutes and 1 second
04/28 07:00:31 Down

04/28 07:10:27 Up 7 hours, 26 minutes and 29 seconds
04/28 14:36:56 Down

04/28 22:23:21 Up 7 hours, 26 minutes and 20 seconds
04/29 5:49:41 Down

04/29 8:08:26 Up 7 hours, 28 minutes and 26 seconds
04/29 15:36:52 Down

04/29 16:15:53 Up 7 hours, 27 minutes and 26 seconds
04/29 23:43:19 Down

04/29 23:53:19 Up 7 hours, 26 minutes and 22 seconds
04/30 7:19:41 Down

04/30 8:09:42 Up 7 hours, 34 minutes and 3 seconds
04/30 15:43:45 Down

edit: Problem ended up being the keepalive time was different on the Juniper side from the Palo Alto side.

I had a similar issue with an SRX and ASR 1006-X, turning off PFS (wasn’t happy about it) resolved. IPSec’s monster list of parameters isn’t always interpreted the same by all vendors. It sounds like you’re on the right path, it’s not the most secure idea but I’d consider turning all settings to the barest of minimums, and add and remove Phase 1 and then Phase 2 values to see if it improves. Are there any debug values or logging information from either side you can provide?

Juniper Phase 2 show a 28000 sec timer while the other side has a timer of 8hours. Which is a difference of 800 seconds.

I’ve also noticed issues when phase 1 and 2 timeouts are too close to eachother. I usually go for 3600 P2 and 28800 P1.

Edit: or 28800 and 86400

Is it a keepalive issue? Is traffic going over the VPN continuously, or only intermittently?

DH group 2 is only 1024 bits. You really need to up that puppy. And I would up it from SHA 1 as well.

Check the time on both ends.

You’ll need to configure traceoptions, at least, on the Juniper. There isn’t much we can do without the debug output.

Do you by chance have ATT U-Verse as an ISP line at one of the sites? We’ve been seeing a ton of issues with their modems causing ipsec issues due to ESP-ALG on their modem.

PFS off is always the first thing I try when using endpoints from different vendors.

Just wanted to say thanks, this is the thing I missed and it has been up ever since!

I’ll take the different times under advice.

The remote office’s VoIP phones go through the VPN and they must have some type of continuous traffic because if they lose connection, they immediately go into discover mode.

Also, I’ve had continuous pings going over the tunnel this entire time.

I’ll try to get that this weekend. I actually don’t know much about traceoptions on the Junipers. Do you know what command I should use?

Thanks for the reply by the way.

I fixed the time being different on the Juniper side (28000 instead of 28800) yesterday and it has been up ever since so that might be it!

One wonders if they’re doing carrier grade NAT as well, not all devices have NAT-T turned on by default.

No worries. Timing’s definitely critical when it comes to IPSec tunnels functioning properly. Here’s some documentation on traceoptions:

https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/security-edit-traceoptions-ike.html

https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/traceoptions-edit-services-ipsec-vpn.html