Michal Kubecek 7838ee
From: Neal Cardwell <ncardwell@google.com>
Michal Kubecek 7838ee
Date: Thu, 3 Aug 2017 09:19:54 -0400
Michal Kubecek 7838ee
Subject: tcp: fix xmit timer to only be reset if data ACKed/SACKed
Michal Kubecek 7838ee
Patch-mainline: v4.13-rc5
Michal Kubecek 7838ee
Git-commit: df92c8394e6ea0469e8056946ef8add740ab8046
Michal Kubecek 7838ee
References: bsc#1076830
Michal Kubecek 7838ee
Michal Kubecek 7838ee
Fix a TCP loss recovery performance bug raised recently on the netdev
Michal Kubecek 7838ee
list, in two threads:
Michal Kubecek 7838ee
Michal Kubecek 7838ee
(i)  July 26, 2017: netdev thread "TCP fast retransmit issues"
Michal Kubecek 7838ee
(ii) July 26, 2017: netdev thread:
Michal Kubecek 7838ee
     "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
Michal Kubecek 7838ee
     outstanding TLP retransmission"
Michal Kubecek 7838ee
Michal Kubecek 7838ee
The basic problem is that incoming TCP packets that did not indicate
Michal Kubecek 7838ee
forward progress could cause the xmit timer (TLP or RTO) to be rearmed
Michal Kubecek 7838ee
and pushed back in time. In certain corner cases this could result in
Michal Kubecek 7838ee
the following problems noted in these threads:
Michal Kubecek 7838ee
Michal Kubecek 7838ee
 - Repeated ACKs coming in with bogus SACKs corrupted by middleboxes
Michal Kubecek 7838ee
   could cause TCP to repeatedly schedule TLPs forever. We kept
Michal Kubecek 7838ee
   sending TLPs after every ~200ms, which elicited bogus SACKs, which
Michal Kubecek 7838ee
   caused more TLPs, ad infinitum; we never fired an RTO to fill in
Michal Kubecek 7838ee
   the holes.
Michal Kubecek 7838ee
Michal Kubecek 7838ee
 - Incoming data segments could, in some cases, cause us to reschedule
Michal Kubecek 7838ee
   our RTO or TLP timer further out in time, for no good reason. This
Michal Kubecek 7838ee
   could cause repeated inbound data to result in stalls in outbound
Michal Kubecek 7838ee
   data, in the presence of packet loss.
Michal Kubecek 7838ee
Michal Kubecek 7838ee
This commit fixes these bugs by changing the TLP and RTO ACK
Michal Kubecek 7838ee
processing to:
Michal Kubecek 7838ee
Michal Kubecek 7838ee
 (a) Only reschedule the xmit timer once per ACK.
Michal Kubecek 7838ee
Michal Kubecek 7838ee
 (b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the
Michal Kubecek 7838ee
     ACK indicates sufficient forward progress (a packet was
Michal Kubecek 7838ee
     cumulatively ACKed, or we got a SACK for a packet that was sent
Michal Kubecek 7838ee
     before the most recent retransmit of the write queue head).
Michal Kubecek 7838ee
Michal Kubecek 7838ee
This brings us back into closer compliance with the RFCs, since, as
Michal Kubecek 7838ee
the comment for tcp_rearm_rto() notes, we should only restart the RTO
Michal Kubecek 7838ee
timer after forward progress on the connection. Previously we were
Michal Kubecek 7838ee
restarting the xmit timer even in these cases where there was no
Michal Kubecek 7838ee
forward progress.
Michal Kubecek 7838ee
Michal Kubecek 7838ee
As a side benefit, this commit simplifies and speeds up the TCP timer
Michal Kubecek 7838ee
arming logic. We had been calling inet_csk_reset_xmit_timer() three
Michal Kubecek 7838ee
times on normal ACKs that cumulatively acknowledged some data:
Michal Kubecek 7838ee
Michal Kubecek 7838ee
1) Once near the top of tcp_ack() to switch from TLP timer to RTO:
Michal Kubecek 7838ee
        if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
Michal Kubecek 7838ee
               tcp_rearm_rto(sk);
Michal Kubecek 7838ee
Michal Kubecek 7838ee
2) Once in tcp_clean_rtx_queue(), to update the RTO:
Michal Kubecek 7838ee
        if (flag & FLAG_ACKED) {
Michal Kubecek 7838ee
               tcp_rearm_rto(sk);
Michal Kubecek 7838ee
Michal Kubecek 7838ee
3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO
Michal Kubecek 7838ee
   to TLP:
Michal Kubecek 7838ee
        if (icsk->icsk_pending == ICSK_TIME_RETRANS)
Michal Kubecek 7838ee
               tcp_schedule_loss_probe(sk);
Michal Kubecek 7838ee
Michal Kubecek 7838ee
This commit, by only rescheduling the xmit timer once per ACK,
Michal Kubecek 7838ee
simplifies the code and reduces CPU overhead.
Michal Kubecek 7838ee
Michal Kubecek 7838ee
This commit was tested in an A/B test with Google web server
Michal Kubecek 7838ee
traffic. SNMP stats and request latency metrics were within noise
Michal Kubecek 7838ee
levels, substantiating that for normal web traffic patterns this is a
Michal Kubecek 7838ee
rare issue. This commit was also tested with packetdrill tests to
Michal Kubecek 7838ee
verify that it fixes the timer behavior in the corner cases discussed
Michal Kubecek 7838ee
in the netdev threads mentioned above.
Michal Kubecek 7838ee
Michal Kubecek 7838ee
This patch is a bug fix patch intended to be queued for -stable
Michal Kubecek 7838ee
relases.
Michal Kubecek 7838ee
Michal Kubecek 7838ee
Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)")
Michal Kubecek 7838ee
Reported-by: Klavs Klavsen <kl@vsen.dk>
Michal Kubecek 7838ee
Reported-by: Mao Wenan <maowenan@huawei.com>
Michal Kubecek 7838ee
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Michal Kubecek 7838ee
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Michal Kubecek 7838ee
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Michal Kubecek 7838ee
Acked-by: Eric Dumazet <edumazet@google.com>
Michal Kubecek 7838ee
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek 7838ee
Acked-by: Michal Kubecek <mkubecek@suse.cz>
Michal Kubecek 7838ee
Michal Kubecek 7838ee
---
Michal Kubecek 7838ee
 net/ipv4/tcp_input.c  | 25 ++++++++++++++++---------
Michal Kubecek 7838ee
 net/ipv4/tcp_output.c |  9 ---------
Michal Kubecek 7838ee
 2 files changed, 16 insertions(+), 18 deletions(-)
Michal Kubecek 7838ee
Michal Kubecek 7838ee
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
Michal Kubecek 7838ee
index 932c532c7a34..48141dec4e2d 100644
Michal Kubecek 7838ee
--- a/net/ipv4/tcp_input.c
Michal Kubecek 7838ee
+++ b/net/ipv4/tcp_input.c
Michal Kubecek 7838ee
@@ -110,6 +110,7 @@ int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
Michal Kubecek 7838ee
 #define FLAG_ORIG_SACK_ACKED	0x200 /* Never retransmitted data are (s)acked	*/
Michal Kubecek 7838ee
 #define FLAG_SND_UNA_ADVANCED	0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
Michal Kubecek 7838ee
 #define FLAG_DSACKING_ACK	0x800 /* SACK blocks contained D-SACK info */
Michal Kubecek 7838ee
+#define FLAG_SET_XMIT_TIMER	0x1000 /* Set TLP or RTO timer */
Michal Kubecek 7838ee
 #define FLAG_SACK_RENEGING	0x2000 /* snd_una advanced to a sacked seq */
Michal Kubecek 7838ee
 #define FLAG_UPDATE_TS_RECENT	0x4000 /* tcp_replace_ts_recent() */
Michal Kubecek 7838ee
 #define FLAG_NO_CHALLENGE_ACK	0x8000 /* do not call tcp_send_challenge_ack()	*/
Michal Kubecek 7838ee
@@ -3015,6 +3016,13 @@ void tcp_rearm_rto(struct sock *sk)
Michal Kubecek 7838ee
 	}
Michal Kubecek 7838ee
 }
Michal Kubecek 7838ee
 
Michal Kubecek 7838ee
+/* Try to schedule a loss probe; if that doesn't work, then schedule an RTO. */
Michal Kubecek 7838ee
+static void tcp_set_xmit_timer(struct sock *sk)
Michal Kubecek 7838ee
+{
Michal Kubecek 7838ee
+	if (!tcp_schedule_loss_probe(sk))
Michal Kubecek 7838ee
+		tcp_rearm_rto(sk);
Michal Kubecek 7838ee
+}
Michal Kubecek 7838ee
+
Michal Kubecek 7838ee
 /* If we get here, the whole TSO packet has not been acked. */
Michal Kubecek 7838ee
 static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
Michal Kubecek 7838ee
 {
Michal Kubecek 7838ee
@@ -3177,7 +3185,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
Michal Kubecek 7838ee
 					ca_rtt_us);
Michal Kubecek 7838ee
 
Michal Kubecek 7838ee
 	if (flag & FLAG_ACKED) {
Michal Kubecek 7838ee
-		tcp_rearm_rto(sk);
Michal Kubecek 7838ee
+		flag |= FLAG_SET_XMIT_TIMER;  /* set TLP or RTO timer */
Michal Kubecek 7838ee
 		if (unlikely(icsk->icsk_mtup.probe_size &&
Michal Kubecek 7838ee
 			     !after(tp->mtu_probe.probe_seq_end, tp->snd_una))) {
Michal Kubecek 7838ee
 			tcp_mtup_probe_success(sk);
Michal Kubecek 7838ee
@@ -3205,7 +3213,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
Michal Kubecek 7838ee
 		 * after when the head was last (re)transmitted. Otherwise the
Michal Kubecek 7838ee
 		 * timeout may continue to extend in loss recovery.
Michal Kubecek 7838ee
 		 */
Michal Kubecek 7838ee
-		tcp_rearm_rto(sk);
Michal Kubecek 7838ee
+		flag |= FLAG_SET_XMIT_TIMER;  /* set TLP or RTO timer */
Michal Kubecek 7838ee
 	}
Michal Kubecek 7838ee
 
Michal Kubecek 7838ee
 	if (icsk->icsk_ca_ops->pkts_acked) {
Michal Kubecek 7838ee
@@ -3577,9 +3585,6 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
Michal Kubecek 7838ee
 	if (after(ack, tp->snd_nxt))
Michal Kubecek 7838ee
 		goto invalid_ack;
Michal Kubecek 7838ee
 
Michal Kubecek 7838ee
-	if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
Michal Kubecek 7838ee
-		tcp_rearm_rto(sk);
Michal Kubecek 7838ee
-
Michal Kubecek 7838ee
 	if (after(ack, prior_snd_una)) {
Michal Kubecek 7838ee
 		flag |= FLAG_SND_UNA_ADVANCED;
Michal Kubecek 7838ee
 		icsk->icsk_retransmits = 0;
Michal Kubecek 7838ee
@@ -3644,18 +3649,20 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
Michal Kubecek 7838ee
 	flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una, &acked,
Michal Kubecek 7838ee
 				    &sack_state);
Michal Kubecek 7838ee
 
Michal Kubecek 7838ee
+	if (tp->tlp_high_seq)
Michal Kubecek 7838ee
+		tcp_process_tlp_ack(sk, ack, flag);
Michal Kubecek 7838ee
+	/* If needed, reset TLP/RTO timer; RACK may later override this. */
Michal Kubecek 7838ee
+	if (flag & FLAG_SET_XMIT_TIMER)
Michal Kubecek 7838ee
+		tcp_set_xmit_timer(sk);
Michal Kubecek 7838ee
+
Michal Kubecek 7838ee
 	if (tcp_ack_is_dubious(sk, flag)) {
Michal Kubecek 7838ee
 		is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
Michal Kubecek 7838ee
 		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit);
Michal Kubecek 7838ee
 	}
Michal Kubecek 7838ee
-	if (tp->tlp_high_seq)
Michal Kubecek 7838ee
-		tcp_process_tlp_ack(sk, ack, flag);
Michal Kubecek 7838ee
 
Michal Kubecek 7838ee
 	if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP))
Michal Kubecek 7838ee
 		sk_dst_confirm(sk);
Michal Kubecek 7838ee
 
Michal Kubecek 7838ee
-	if (icsk->icsk_pending == ICSK_TIME_RETRANS)
Michal Kubecek 7838ee
-		tcp_schedule_loss_probe(sk);
Michal Kubecek 7838ee
 	delivered = tp->delivered - delivered;	/* freshly ACKed or SACKed */
Michal Kubecek 7838ee
 	lost = tp->lost - lost;			/* freshly marked lost */
Michal Kubecek 7838ee
 	tcp_rate_gen(sk, delivered, lost, sack_state.rate);
Michal Kubecek 7838ee
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
Michal Kubecek 7838ee
index de3428c97751..c8cdfac1b3a9 100644
Michal Kubecek 7838ee
--- a/net/ipv4/tcp_output.c
Michal Kubecek 7838ee
+++ b/net/ipv4/tcp_output.c
Michal Kubecek 7838ee
@@ -2349,21 +2349,12 @@ bool tcp_schedule_loss_probe(struct sock *sk)
Michal Kubecek 7838ee
 	u32 rtt = usecs_to_jiffies(tp->srtt_us >> 3);
Michal Kubecek 7838ee
 	u32 timeout, rto_delta_us;
Michal Kubecek 7838ee
 
Michal Kubecek 7838ee
-	/* No consecutive loss probes. */
Michal Kubecek 7838ee
-	if (WARN_ON(icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)) {
Michal Kubecek 7838ee
-		tcp_rearm_rto(sk);
Michal Kubecek 7838ee
-		return false;
Michal Kubecek 7838ee
-	}
Michal Kubecek 7838ee
 	/* Don't do any loss probe on a Fast Open connection before 3WHS
Michal Kubecek 7838ee
 	 * finishes.
Michal Kubecek 7838ee
 	 */
Michal Kubecek 7838ee
 	if (tp->fastopen_rsk)
Michal Kubecek 7838ee
 		return false;
Michal Kubecek 7838ee
 
Michal Kubecek 7838ee
-	/* TLP is only scheduled when next timer event is RTO. */
Michal Kubecek 7838ee
-	if (icsk->icsk_pending != ICSK_TIME_RETRANS)
Michal Kubecek 7838ee
-		return false;
Michal Kubecek 7838ee
-
Michal Kubecek 7838ee
 	/* Schedule a loss probe in 2*RTT for SACK capable connections
Michal Kubecek 7838ee
 	 * in Open state, that are either limited by cwnd or application.
Michal Kubecek 7838ee
 	 */
Michal Kubecek 7838ee
-- 
Michal Kubecek 7838ee
2.16.2
Michal Kubecek 7838ee