DevHeads.net

AWS timeout

Hello list,

Bit of a weird one here. I have hosts at AWS sending mail across a
Checkpoint VPN to my main private relay server (it basically serves to relay
mail to O365 for in house applications). The problem is that the sending
client never receives BYE from server after QUIT. The mail goes through and
is delivered ok. This is bad because our timeout is 300s and if you have
anything more than a small amount of mail to send, your connections waiting
to timeout build up at the client and cause problems with applications. Mail
from non-AWS sources does not have this problem across other legs of our
Checkpoint VPN.

I have done packet captures at source, destination, and the two Checkpoint
FWs which are the VPN endpoints. I can clearly see what appears to be
filtering occurring on the QUIT, but I am really struggling to determine
where said filtering is coming from. I can recreate issue with a simple
mailx command, originally it was reported from a java mail client (so
multiple clients exhibit same issue).

And before you ask, we have already had AWS support remove the SMTP
throttling for this host. We have also looked at Checkpoint logging and can
find no evidence it is being throttled at those devices.

Source host is Centos 7. Destination host is Centos 6. Both have latest
kernel and patches for everything. Postfix version is
postfix-2.6.6-8.el6.x86_64.

Thanks for any suggestions.

Postfix -n below
alias_database = hash:/etc/aliases
alias_maps = hash:/etc/aliases
command_directory = /usr/sbin
config_directory = /etc/postfix
daemon_directory = /usr/libexec/postfix
data_directory = /var/lib/postfix
debug_peer_level = 2
html_directory = no
inet_interfaces = all
inet_protocols = all
mail_owner = postfix
mailbox_size_limit = 78643200
mailq_path = /usr/bin/mailq.postfix
manpage_directory = /usr/share/man
message_size_limit = 78643200
mydestination =
$myhostname,localhost.$mydomain,localhost,idcsmtp2.$mydomain,idcsmtp5.$mydomain,relay.$mydomain
mynetworks = 10.0.0.0/8, 198.112.99.0/24, ***.***.***.***/21
mynetworks_style = class
myorigin = $myhostname
newaliases_path = /usr/bin/newaliases.postfix
notify_classes = bounce, 2bounce, delay, protocol, resource, software
queue_directory = /var/spool/postfix
readme_directory = /usr/share/doc/postfix-2.6.6/README_FILES
relay_domains = $mydestination, $mydomain, example.com, example.com
relayhost = [example.protection.outlook.com]
sample_directory = /usr/share/doc/postfix-2.6.6/samples
sendmail_path = /usr/sbin/sendmail.postfix
setgid_group = postdrop
unknown_local_recipient_reject_code = 550

source pcap
<a href="https://insideidc-my.sharepoint.com/:u:/g/personal/fhare_idc_com/ETWH_87w1AhCtSdpgzeoNwYBitZBm4HecmF0WBR2RHKOfA?e=Ln03Cq" title="https://insideidc-my.sharepoint.com/:u:/g/personal/fhare_idc_com/ETWH_87w1AhCtSdpgzeoNwYBitZBm4HecmF0WBR2RHKOfA?e=Ln03Cq">https://insideidc-my.sharepoint.com/:u:/g/personal/fhare_idc_com/ETWH_87...</a>

source Checkpoint VPN endpoint pcap
<a href="https://insideidc-my.sharepoint.com/:u:/g/personal/fhare_idc_com/ET3RjKJ_3xdOuvog3fPe8xQBUUUK8qQ7VdOSsnA6oiK4yw?e=vADRze" title="https://insideidc-my.sharepoint.com/:u:/g/personal/fhare_idc_com/ET3RjKJ_3xdOuvog3fPe8xQBUUUK8qQ7VdOSsnA6oiK4yw?e=vADRze">https://insideidc-my.sharepoint.com/:u:/g/personal/fhare_idc_com/ET3RjKJ...</a>

destination Checkpoint VPN endpoint pcap
<a href="https://insideidc-my.sharepoint.com/:u:/g/personal/fhare_idc_com/EdGaYYr_pzBNiKkji2kb034BvBv3LJ6ooMRjXz-2ddp9NA?e=iDaBr0" title="https://insideidc-my.sharepoint.com/:u:/g/personal/fhare_idc_com/EdGaYYr_pzBNiKkji2kb034BvBv3LJ6ooMRjXz-2ddp9NA?e=iDaBr0">https://insideidc-my.sharepoint.com/:u:/g/personal/fhare_idc_com/EdGaYYr...</a>

destination postfix pcap
<a href="https://insideidc-my.sharepoint.com/:u:/g/personal/fhare_idc_com/EUKXDSXYitxArxc4EeVIELIBVrgufBDnYGIZ3K9BC6GCAQ?e=FnUDNM" title="https://insideidc-my.sharepoint.com/:u:/g/personal/fhare_idc_com/EUKXDSXYitxArxc4EeVIELIBVrgufBDnYGIZ3K9BC6GCAQ?e=FnUDNM">https://insideidc-my.sharepoint.com/:u:/g/personal/fhare_idc_com/EUKXDSX...</a>

Comments

Re: AWS timeout

By Wietse Venema at 05/13/2019 - 14:45

If you look at the non-VPN captures, then you will see the following:

- In one trace, we see a client ACK 138, followed by a client packet
with ".<CR><LF>" (data 443:446, ACK 128, and a timestamp field
tht is unlike those of al other packets in the stream).

- In the other trace, we see that the ACK and ".<CR><LF>" packets
are sent as one packet, with a normal timestamp field.

- After this, the TCP connection is messed up, the server keeps
transmitting "Queued as xxx", and the client keeps transmitting
QUIT.

This looks like the VPN mucks with TCP and screws up the protocol.

Get a better VPN. If yiu must use the VPN, maybe sending SMTP over TLS
will change the problem.

Wietse

Re: AWS timeout

By Wietse Venema at 05/13/2019 - 14:48

Wietse Venema:
There was one typo: 128 instead of 138. I fixed it below.

Re: AWS timeout

By Wietse Venema at 05/13/2019 - 19:27

Wietse Venema:
I verified this another time. On the SMTP client side, there is one packet,
and on the SMTP server side there are two.

Here is a diff -U output for client-side and server-side pcap files.

-TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 1
38, win 211, options [nop,nop,TS val 2968348176 ecr 687501956], length 3: SMTP:
.
+TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [.], ack 138, win 211, o
ptions [nop,nop,TS val 2968348176 ecr 687501956], length 0
+TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 138, win 211, options [nop,nop,TS val 1 ecr 0], length 3: SMTP: .

On the client side we see that the client sends one TCP packet with
three data bytes (offset 443:446, content ".<CR><LF>"), an ACK for
server offset 138, and Timestamp Value 2968348176 and Timestamp
Echo Reply 687501956.

On the server side we see that the server receives two TCP packets.

- One packet with the same ACK for server offset 138, and the same
Timestamp Value 2968348176 and Timestamp Echo Reply 687501956, as
the client sent, but no data bytes.

- One packet with the same ACK for server offset 138, and the same
three data bytes as the client sent, but with a Timestamp Value of
1 and a Timestamp Echo Reply of 0.

And from here on TCP appears to be stuck. The server repeatedly
sends "Ok: queued as 58CE8C0058<CR><LF>" and the client repeatedly
sends "QUIT<CR><LF>".

Anyway, something is breaking TCP and it isn't Postfix. Maybe the
VPN is trying to be too smart, maybe something in your TCP stack
is trying to shape traffic and messing up.

Wietse

Re: AWS timeout

By John Fawcett at 05/14/2019 - 03:01

On 14/05/2019 01:27, Wietse Venema wrote:
The TSval of 1 was generated on the server side VPN node. Seems that the
packet is accepted anyway on the server because we can see that Postfix
responds to that. When the next packet containing the QUIT command comes
throught the TSVal has jumped back to the original clock sequence (as
generated on the client) 2968348359. I can't see why the Linux kernel
would reject that jump, but maybe it's happening on a lower level.

I'd suggest looking into potential bugs in the tcp/ip stack of the vpn
node when it generates a TSval of 1 on that packet.

Another possibility would be to see if turning off tcp timestamps on the
client solves it (systctl net.ipv4.tcp_timestamps=0). I'm not sure I'd
leave it that way, but could be useful to test it.

Re: AWS timeout

By fhare at 05/14/2019 - 12:36

John,

Wow, good one dude! Turning off tcp timestamps on the client instantly
fixed it. So I guess I bring this to Checkpoint Support and see what they
say, THAT should be fun.

Thanks!

Ris: AWS timeout

By John Fawcett at 05/14/2019 - 15:20

&gt; Wietse Venema:<br>
&gt;&gt; If you look at the non-VPN captures, then you will see the following:<br>
&gt;&gt;<br>
&gt;&gt; - In one trace, we see a client ACK 138, followed by a client packet<br>
&gt;&gt;   with &quot;.&lt;CR&gt;&lt;LF&gt;&quot; (data 443:446, ACK 138, and a timestamp field<br>
&gt;&gt;   tht is unlike those of al other packets in the stream).<br>
&gt;&gt;<br>
&gt;&gt; - In the other trace, we see that the ACK and &quot;.&lt;CR&gt;&lt;LF&gt;&quot; packets<br>
&gt;&gt;   are sent as one packet, with a normal timestamp field.<br>
&gt;&gt;<br>
&gt;&gt; - After this, the TCP connection is messed up, the server keeps<br>
&gt;&gt;   transmitting &quot;Queued as xxx&quot;, and the client keeps transmitting<br>
&gt;&gt;   QUIT.<br>
&gt;&gt;<br>
&gt;&gt; This looks like the VPN mucks with TCP and screws up the protocol.<br>
&gt; I verified this another time. On the SMTP client side, there is one packet,<br>
&gt; and on the SMTP server side there are two.<br>
&gt;<br>
&gt; Here is a diff -U output for client-side and server-side pcap files.<br>
&gt;<br>
&gt; -TIMESTAMP IP 10.110.2.9.33256 &gt; 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 1<br>
&gt; 38, win 211, options [nop,nop,TS val 2968348176 ecr 687501956], length 3: SMTP: <br>
&gt; .<br>
&gt; +TIMESTAMP IP 10.110.2.9.33256 &gt; 10.1.3.134.smtp: Flags [.], ack 138, win 211, o<br>
&gt; ptions [nop,nop,TS val 2968348176 ecr 687501956], length 0<br>
&gt; +TIMESTAMP IP 10.110.2.9.33256 &gt; 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 138, win 211, options [nop,nop,TS val 1 ecr 0], length 3: SMTP: .<br>
&gt;<br>
&gt; On the client side we see that the client sends one TCP packet with<br>
&gt; three data bytes (offset 443:446, content &quot;.&lt;CR&gt;&lt;LF&gt;&quot;), an ACK for<br>
&gt; server offset 138, and Timestamp Value 2968348176 and Timestamp<br>
&gt; Echo Reply 687501956.<br>
&gt;<br>
&gt; On the server side we see that the server receives two TCP packets.<br>
&gt;<br>
&gt; - One packet with the same ACK for server offset 138, and the same<br>
&gt; Timestamp Value 2968348176 and Timestamp Echo Reply 687501956, as<br>
&gt; the client sent, but no data bytes.<br>
this looks like a duplicate ACK that was generated on the server side<br>
VPN node.<br>
&gt; - One packet with the same ACK for server offset 138, and the same<br>
&gt; three data bytes as the client sent, but with a Timestamp Value of<br>
&gt; 1 and a Timestamp Echo Reply of 0.<br>
<br>
The TSval of 1 was generated on the server side VPN node. Seems that the<br>
packet is accepted anyway on the server because we can see that Postfix<br>
responds to that. When the next packet containing the QUIT command comes<br>
throught the TSVal has jumped back to the original clock sequence (as<br>
generated on the client) 2968348359. I can&#39;t see why the Linux kernel<br>
would reject that jump, but maybe it&#39;s happening on a lower level.<br>
<br>
I&#39;d suggest looking into potential bugs in the tcp/ip stack of the vpn<br>
node when it generates a TSval of 1 on that packet.<br>
<br>
Another possibility would be to see if turning off tcp timestamps on the<br>
client solves it (systctl net.ipv4.tcp_timestamps=0). I&#39;m not sure I&#39;d<br>
leave it that way, but could be useful to test it.<br>
<br>
&gt;<br>
&gt; And from here on TCP appears to be stuck. The server repeatedly<br>
&gt; sends &quot;Ok: queued as 58CE8C0058&lt;CR&gt;&lt;LF&gt;&quot; and the client repeatedly<br>
&gt; sends &quot;QUIT&lt;CR&gt;&lt;LF&gt;&quot;.<br>
&gt;<br>
&gt; Anyway, something is breaking TCP and it isn&#39;t Postfix. Maybe the<br>
&gt; VPN is trying to be too smart, maybe something in your TCP stack<br>
&gt; is trying to shape traffic and messing up.<br>
&gt;<br>
&gt;       Wietse<br>
<br>
<br>
</blockquote></div>
</blockquote></div>

Re: Ris: AWS timeout

By Durga Prasad Malyala at 05/14/2019 - 22:39

No surprises here. Weitse ranks along the alltime greats of Computing.

Cheers/DP

Re: Ris: AWS timeout

By Ron Wheeler at 05/14/2019 - 23:35

You never see him interviewed on TV.

If people knew how much of the email travels over the internet as a
result of his work, he would be a tech star.

Ron

On 2019-05-14 10:39 p.m., Durga Prasad Malyala wrote:

Re: Ris: AWS timeout

By LuKreme at 05/15/2019 - 10:46

On 14 May 2019, at 21:35, Ron Wheeler <rwheeler@artifact-software.com> wrote:
I really doubt he is interested in notoriety.

Re: Ris: AWS timeout

By fhare at 05/14/2019 - 15:45

Yes, thank you Weitse as well, sorry for the omission!

Best Regards

On Tue, May 14, 2019 at 3:20 PM <a href="mailto: ... at voipsupport dot it"> ... at voipsupport dot it</a> < ... at voipsupport dot it>
wrote:

Re: AWS timeout

By John Fawcett at 05/13/2019 - 22:44

On 14/05/2019 01:27, Wietse Venema wrote:
The packet sent by the client with the QUIT command never makes it to
Postfix on the server, which therefore remains waiting for 300 seconds
before sending the 421 response for timeout. Specifically, Postfix never
sends a BYE since it doesn't see the QUIT. The TCP traffic after the
QUIT is retransmissions that start on the server due the QUIT packet not
being received.

If the QUIT packet were being lost along the way over the VPN, what I
would expect is for it to show up in the source pcap but not in the
destination pcap, but it is in both and that seems to suggest it is
being lost afterwards.

fhare, can I ask if you took the destination trace on the server itself
or on an intermediate node (for example on a vmware host)?

John