All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
@ 2014-04-14 19:45 ` Daniel Borkmann
  0 siblings, 0 replies; 37+ messages in thread
From: Daniel Borkmann @ 2014-04-14 19:45 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-sctp, Matija Glavinic Pecotic, Alexander Sverdlin,
	Vlad Yasevich

This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
to reflect real state of the receiver's buffer") as it introduced a
serious performance regression on SCTP over IPv4 and IPv6, though a not
as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.

Current state:

[root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
Time: Fri, 11 Apr 2014 17:56:21 GMT
Connecting to host 192.168.241.3, port 5201
      Cookie: Lab200slot2.1397238981.812898.548918
[  4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.09   sec  20.8 MBytes   161 Mbits/sec
[  4]   1.09-2.13   sec  10.8 MBytes  86.8 Mbits/sec
[  4]   2.13-3.15   sec  3.57 MBytes  29.5 Mbits/sec
[  4]   3.15-4.16   sec  4.33 MBytes  35.7 Mbits/sec
[  4]   4.16-6.21   sec  10.4 MBytes  42.7 Mbits/sec
[  4]   6.21-6.21   sec  0.00 Bytes    0.00 bits/sec
[  4]   6.21-7.35   sec  34.6 MBytes   253 Mbits/sec
[  4]   7.35-11.45  sec  22.0 MBytes  45.0 Mbits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-12.51  sec  16.0 MBytes   126 Mbits/sec
[  4]  12.51-13.59  sec  20.3 MBytes   158 Mbits/sec
[  4]  13.59-14.65  sec  13.4 MBytes   107 Mbits/sec
[  4]  14.65-16.79  sec  33.3 MBytes   130 Mbits/sec
[  4]  16.79-16.79  sec  0.00 Bytes    0.00 bits/sec
[  4]  16.79-17.82  sec  5.94 MBytes  48.7 Mbits/sec
(etc)

[root@Lab200slot2 ~]#  iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
Time: Fri, 11 Apr 2014 19:08:41 GMT
Connecting to host 2001:db8:0:f101::1, port 5201
      Cookie: Lab200slot2.1397243321.714295.2b3f7c
[  4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   169 MBytes  1.42 Gbits/sec
[  4]   1.00-2.00   sec   201 MBytes  1.69 Gbits/sec
[  4]   2.00-3.00   sec   188 MBytes  1.58 Gbits/sec
[  4]   3.00-4.00   sec   174 MBytes  1.46 Gbits/sec
[  4]   4.00-5.00   sec   165 MBytes  1.39 Gbits/sec
[  4]   5.00-6.00   sec   199 MBytes  1.67 Gbits/sec
[  4]   6.00-7.00   sec   163 MBytes  1.36 Gbits/sec
[  4]   7.00-8.00   sec   174 MBytes  1.46 Gbits/sec
[  4]   8.00-9.00   sec   193 MBytes  1.62 Gbits/sec
[  4]   9.00-10.00  sec   196 MBytes  1.65 Gbits/sec
[  4]  10.00-11.00  sec   157 MBytes  1.31 Gbits/sec
[  4]  11.00-12.00  sec   175 MBytes  1.47 Gbits/sec
[  4]  12.00-13.00  sec   192 MBytes  1.61 Gbits/sec
[  4]  13.00-14.00  sec   199 MBytes  1.67 Gbits/sec
(etc)

After patch:

[root@Lab200slot2 ~]#  iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
Time: Mon, 14 Apr 2014 16:40:48 GMT
Connecting to host 192.168.240.3, port 5201
      Cookie: Lab200slot2.1397493648.413274.65e131
[  4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   240 MBytes  2.02 Gbits/sec
[  4]   1.00-2.00   sec   239 MBytes  2.01 Gbits/sec
[  4]   2.00-3.00   sec   240 MBytes  2.01 Gbits/sec
[  4]   3.00-4.00   sec   239 MBytes  2.00 Gbits/sec
[  4]   4.00-5.00   sec   245 MBytes  2.05 Gbits/sec
[  4]   5.00-6.00   sec   240 MBytes  2.01 Gbits/sec
[  4]   6.00-7.00   sec   240 MBytes  2.02 Gbits/sec
[  4]   7.00-8.00   sec   239 MBytes  2.01 Gbits/sec

With the reverted patch applied, the SCTP/IPv4 performance is back
to normal on latest upstream for IPv4 and IPv6 and has same throughput
as 3.4.2 test kernel, steady and interval reports are smooth again.

Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
Reported-by: Peter Butler <pbutler@sonusnet.com>
Reported-by: Dongsheng Song <dongsheng.song@gmail.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Peter Butler <pbutler@sonusnet.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
---
 As commit ef2820a735f7 affects kernels with 3.11 and onwards, this
 needs a rework by Matija for net-next again, so that this fix can
 go back to -stable and restore performance for 3.11-3.15 kernels.

 include/net/sctp/structs.h | 14 +++++++-
 net/sctp/associola.c       | 82 ++++++++++++++++++++++++++++++++++++----------
 net/sctp/sm_statefuns.c    |  2 +-
 net/sctp/socket.c          |  6 ++++
 net/sctp/ulpevent.c        |  8 ++---
 5 files changed, 87 insertions(+), 25 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 6ee76c8..d992ca3 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1653,6 +1653,17 @@ struct sctp_association {
 	/* This is the last advertised value of rwnd over a SACK chunk. */
 	__u32 a_rwnd;
 
+	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
+	 * to slop over a maximum of the association's frag_point.
+	 */
+	__u32 rwnd_over;
+
+	/* Keeps treack of rwnd pressure.  This happens when we have
+	 * a window, but not recevie buffer (i.e small packets).  This one
+	 * is releases slowly (1 PMTU at a time ).
+	 */
+	__u32 rwnd_press;
+
 	/* This is the sndbuf size in use for the association.
 	 * This corresponds to the sndbuf size for the association,
 	 * as specified in the sk->sndbuf.
@@ -1881,7 +1892,8 @@ void sctp_assoc_update(struct sctp_association *old,
 __u32 sctp_association_get_next_tsn(struct sctp_association *);
 
 void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
-void sctp_assoc_rwnd_update(struct sctp_association *, bool);
+void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
+void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
 void sctp_assoc_set_primary(struct sctp_association *,
 			    struct sctp_transport *);
 void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 4f6d6f9..39579c3 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -1395,35 +1395,44 @@ static inline bool sctp_peer_needs_update(struct sctp_association *asoc)
 	return false;
 }
 
-/* Update asoc's rwnd for the approximated state in the buffer,
- * and check whether SACK needs to be sent.
- */
-void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
+/* Increase asoc's rwnd by len and send any window update SACK if needed. */
+void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
 {
-	int rx_count;
 	struct sctp_chunk *sack;
 	struct timer_list *timer;
 
-	if (asoc->ep->rcvbuf_policy)
-		rx_count = atomic_read(&asoc->rmem_alloc);
-	else
-		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
+	if (asoc->rwnd_over) {
+		if (asoc->rwnd_over >= len) {
+			asoc->rwnd_over -= len;
+		} else {
+			asoc->rwnd += (len - asoc->rwnd_over);
+			asoc->rwnd_over = 0;
+		}
+	} else {
+		asoc->rwnd += len;
+	}
 
-	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
-		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
-	else
-		asoc->rwnd = 0;
+	/* If we had window pressure, start recovering it
+	 * once our rwnd had reached the accumulated pressure
+	 * threshold.  The idea is to recover slowly, but up
+	 * to the initial advertised window.
+	 */
+	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
+		int change = min(asoc->pathmtu, asoc->rwnd_press);
+		asoc->rwnd += change;
+		asoc->rwnd_press -= change;
+	}
 
-	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
-		 __func__, asoc, asoc->rwnd, rx_count,
-		 asoc->base.sk->sk_rcvbuf);
+	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
+		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
+		 asoc->a_rwnd);
 
 	/* Send a window update SACK if the rwnd has increased by at least the
 	 * minimum of the association's PMTU and half of the receive buffer.
 	 * The algorithm used is similar to the one described in
 	 * Section 4.2.3.3 of RFC 1122.
 	 */
-	if (update_peer && sctp_peer_needs_update(asoc)) {
+	if (sctp_peer_needs_update(asoc)) {
 		asoc->a_rwnd = asoc->rwnd;
 
 		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
@@ -1445,6 +1454,45 @@ void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
 	}
 }
 
+/* Decrease asoc's rwnd by len. */
+void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
+{
+	int rx_count;
+	int over = 0;
+
+	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
+		pr_debug("%s: association:%p has asoc->rwnd:%u, "
+			 "asoc->rwnd_over:%u!\n", __func__, asoc,
+			 asoc->rwnd, asoc->rwnd_over);
+
+	if (asoc->ep->rcvbuf_policy)
+		rx_count = atomic_read(&asoc->rmem_alloc);
+	else
+		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
+
+	/* If we've reached or overflowed our receive buffer, announce
+	 * a 0 rwnd if rwnd would still be positive.  Store the
+	 * the potential pressure overflow so that the window can be restored
+	 * back to original value.
+	 */
+	if (rx_count >= asoc->base.sk->sk_rcvbuf)
+		over = 1;
+
+	if (asoc->rwnd >= len) {
+		asoc->rwnd -= len;
+		if (over) {
+			asoc->rwnd_press += asoc->rwnd;
+			asoc->rwnd = 0;
+		}
+	} else {
+		asoc->rwnd_over = len - asoc->rwnd;
+		asoc->rwnd = 0;
+	}
+
+	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
+		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
+		 asoc->rwnd_press);
+}
 
 /* Build the bind address list for the association based on info from the
  * local endpoint and the remote peer.
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 01e0024..ae9fbeb 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -6178,7 +6178,7 @@ static int sctp_eat_data(const struct sctp_association *asoc,
 	 * PMTU.  In cases, such as loopback, this might be a rather
 	 * large spill over.
 	 */
-	if ((!chunk->data_accepted) && (!asoc->rwnd ||
+	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
 	    (datalen > asoc->rwnd + asoc->frag_point))) {
 
 		/* If this is the next TSN, consider reneging to make
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index e13519e..ff20e2d 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -2115,6 +2115,12 @@ static int sctp_recvmsg(struct kiocb *iocb, struct sock *sk,
 		sctp_skb_pull(skb, copied);
 		skb_queue_head(&sk->sk_receive_queue, skb);
 
+		/* When only partial message is copied to the user, increase
+		 * rwnd by that amount. If all the data in the skb is read,
+		 * rwnd is updated when the event is freed.
+		 */
+		if (!sctp_ulpevent_is_notification(event))
+			sctp_assoc_rwnd_increase(event->asoc, copied);
 		goto out;
 	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
 		   (event->msg_flags & MSG_EOR))
diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c
index 8d198ae..85c6465 100644
--- a/net/sctp/ulpevent.c
+++ b/net/sctp/ulpevent.c
@@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(struct sctp_ulpevent *event,
 	skb = sctp_event2skb(event);
 	/* Set the owner and charge rwnd for bytes received.  */
 	sctp_ulpevent_set_owner(event, asoc);
-	sctp_assoc_rwnd_update(asoc, false);
+	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
 
 	if (!skb->data_len)
 		return;
@@ -1011,7 +1011,6 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
 {
 	struct sk_buff *skb, *frag;
 	unsigned int	len;
-	struct sctp_association *asoc;
 
 	/* Current stack structures assume that the rcv buffer is
 	 * per socket.   For UDP style sockets this is not true as
@@ -1036,11 +1035,8 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
 	}
 
 done:
-	asoc = event->asoc;
-	sctp_association_hold(asoc);
+	sctp_assoc_rwnd_increase(event->asoc, len);
 	sctp_ulpevent_release_owner(event);
-	sctp_assoc_rwnd_update(asoc, true);
-	sctp_association_put(asoc);
 }
 
 static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's bu
@ 2014-04-14 19:45 ` Daniel Borkmann
  0 siblings, 0 replies; 37+ messages in thread
From: Daniel Borkmann @ 2014-04-14 19:45 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-sctp, Matija Glavinic Pecotic, Alexander Sverdlin,
	Vlad Yasevich

This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
to reflect real state of the receiver's buffer") as it introduced a
serious performance regression on SCTP over IPv4 and IPv6, though a not
as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.

Current state:

[root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
Time: Fri, 11 Apr 2014 17:56:21 GMT
Connecting to host 192.168.241.3, port 5201
      Cookie: Lab200slot2.1397238981.812898.548918
[  4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.09   sec  20.8 MBytes   161 Mbits/sec
[  4]   1.09-2.13   sec  10.8 MBytes  86.8 Mbits/sec
[  4]   2.13-3.15   sec  3.57 MBytes  29.5 Mbits/sec
[  4]   3.15-4.16   sec  4.33 MBytes  35.7 Mbits/sec
[  4]   4.16-6.21   sec  10.4 MBytes  42.7 Mbits/sec
[  4]   6.21-6.21   sec  0.00 Bytes    0.00 bits/sec
[  4]   6.21-7.35   sec  34.6 MBytes   253 Mbits/sec
[  4]   7.35-11.45  sec  22.0 MBytes  45.0 Mbits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-12.51  sec  16.0 MBytes   126 Mbits/sec
[  4]  12.51-13.59  sec  20.3 MBytes   158 Mbits/sec
[  4]  13.59-14.65  sec  13.4 MBytes   107 Mbits/sec
[  4]  14.65-16.79  sec  33.3 MBytes   130 Mbits/sec
[  4]  16.79-16.79  sec  0.00 Bytes    0.00 bits/sec
[  4]  16.79-17.82  sec  5.94 MBytes  48.7 Mbits/sec
(etc)

[root@Lab200slot2 ~]#  iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
Time: Fri, 11 Apr 2014 19:08:41 GMT
Connecting to host 2001:db8:0:f101::1, port 5201
      Cookie: Lab200slot2.1397243321.714295.2b3f7c
[  4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   169 MBytes  1.42 Gbits/sec
[  4]   1.00-2.00   sec   201 MBytes  1.69 Gbits/sec
[  4]   2.00-3.00   sec   188 MBytes  1.58 Gbits/sec
[  4]   3.00-4.00   sec   174 MBytes  1.46 Gbits/sec
[  4]   4.00-5.00   sec   165 MBytes  1.39 Gbits/sec
[  4]   5.00-6.00   sec   199 MBytes  1.67 Gbits/sec
[  4]   6.00-7.00   sec   163 MBytes  1.36 Gbits/sec
[  4]   7.00-8.00   sec   174 MBytes  1.46 Gbits/sec
[  4]   8.00-9.00   sec   193 MBytes  1.62 Gbits/sec
[  4]   9.00-10.00  sec   196 MBytes  1.65 Gbits/sec
[  4]  10.00-11.00  sec   157 MBytes  1.31 Gbits/sec
[  4]  11.00-12.00  sec   175 MBytes  1.47 Gbits/sec
[  4]  12.00-13.00  sec   192 MBytes  1.61 Gbits/sec
[  4]  13.00-14.00  sec   199 MBytes  1.67 Gbits/sec
(etc)

After patch:

[root@Lab200slot2 ~]#  iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
Time: Mon, 14 Apr 2014 16:40:48 GMT
Connecting to host 192.168.240.3, port 5201
      Cookie: Lab200slot2.1397493648.413274.65e131
[  4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   240 MBytes  2.02 Gbits/sec
[  4]   1.00-2.00   sec   239 MBytes  2.01 Gbits/sec
[  4]   2.00-3.00   sec   240 MBytes  2.01 Gbits/sec
[  4]   3.00-4.00   sec   239 MBytes  2.00 Gbits/sec
[  4]   4.00-5.00   sec   245 MBytes  2.05 Gbits/sec
[  4]   5.00-6.00   sec   240 MBytes  2.01 Gbits/sec
[  4]   6.00-7.00   sec   240 MBytes  2.02 Gbits/sec
[  4]   7.00-8.00   sec   239 MBytes  2.01 Gbits/sec

With the reverted patch applied, the SCTP/IPv4 performance is back
to normal on latest upstream for IPv4 and IPv6 and has same throughput
as 3.4.2 test kernel, steady and interval reports are smooth again.

Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
Reported-by: Peter Butler <pbutler@sonusnet.com>
Reported-by: Dongsheng Song <dongsheng.song@gmail.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Peter Butler <pbutler@sonusnet.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
---
 As commit ef2820a735f7 affects kernels with 3.11 and onwards, this
 needs a rework by Matija for net-next again, so that this fix can
 go back to -stable and restore performance for 3.11-3.15 kernels.

 include/net/sctp/structs.h | 14 +++++++-
 net/sctp/associola.c       | 82 ++++++++++++++++++++++++++++++++++++----------
 net/sctp/sm_statefuns.c    |  2 +-
 net/sctp/socket.c          |  6 ++++
 net/sctp/ulpevent.c        |  8 ++---
 5 files changed, 87 insertions(+), 25 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 6ee76c8..d992ca3 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1653,6 +1653,17 @@ struct sctp_association {
 	/* This is the last advertised value of rwnd over a SACK chunk. */
 	__u32 a_rwnd;
 
+	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
+	 * to slop over a maximum of the association's frag_point.
+	 */
+	__u32 rwnd_over;
+
+	/* Keeps treack of rwnd pressure.  This happens when we have
+	 * a window, but not recevie buffer (i.e small packets).  This one
+	 * is releases slowly (1 PMTU at a time ).
+	 */
+	__u32 rwnd_press;
+
 	/* This is the sndbuf size in use for the association.
 	 * This corresponds to the sndbuf size for the association,
 	 * as specified in the sk->sndbuf.
@@ -1881,7 +1892,8 @@ void sctp_assoc_update(struct sctp_association *old,
 __u32 sctp_association_get_next_tsn(struct sctp_association *);
 
 void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
-void sctp_assoc_rwnd_update(struct sctp_association *, bool);
+void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
+void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
 void sctp_assoc_set_primary(struct sctp_association *,
 			    struct sctp_transport *);
 void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 4f6d6f9..39579c3 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -1395,35 +1395,44 @@ static inline bool sctp_peer_needs_update(struct sctp_association *asoc)
 	return false;
 }
 
-/* Update asoc's rwnd for the approximated state in the buffer,
- * and check whether SACK needs to be sent.
- */
-void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
+/* Increase asoc's rwnd by len and send any window update SACK if needed. */
+void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
 {
-	int rx_count;
 	struct sctp_chunk *sack;
 	struct timer_list *timer;
 
-	if (asoc->ep->rcvbuf_policy)
-		rx_count = atomic_read(&asoc->rmem_alloc);
-	else
-		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
+	if (asoc->rwnd_over) {
+		if (asoc->rwnd_over >= len) {
+			asoc->rwnd_over -= len;
+		} else {
+			asoc->rwnd += (len - asoc->rwnd_over);
+			asoc->rwnd_over = 0;
+		}
+	} else {
+		asoc->rwnd += len;
+	}
 
-	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
-		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
-	else
-		asoc->rwnd = 0;
+	/* If we had window pressure, start recovering it
+	 * once our rwnd had reached the accumulated pressure
+	 * threshold.  The idea is to recover slowly, but up
+	 * to the initial advertised window.
+	 */
+	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
+		int change = min(asoc->pathmtu, asoc->rwnd_press);
+		asoc->rwnd += change;
+		asoc->rwnd_press -= change;
+	}
 
-	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
-		 __func__, asoc, asoc->rwnd, rx_count,
-		 asoc->base.sk->sk_rcvbuf);
+	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
+		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
+		 asoc->a_rwnd);
 
 	/* Send a window update SACK if the rwnd has increased by at least the
 	 * minimum of the association's PMTU and half of the receive buffer.
 	 * The algorithm used is similar to the one described in
 	 * Section 4.2.3.3 of RFC 1122.
 	 */
-	if (update_peer && sctp_peer_needs_update(asoc)) {
+	if (sctp_peer_needs_update(asoc)) {
 		asoc->a_rwnd = asoc->rwnd;
 
 		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
@@ -1445,6 +1454,45 @@ void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
 	}
 }
 
+/* Decrease asoc's rwnd by len. */
+void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
+{
+	int rx_count;
+	int over = 0;
+
+	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
+		pr_debug("%s: association:%p has asoc->rwnd:%u, "
+			 "asoc->rwnd_over:%u!\n", __func__, asoc,
+			 asoc->rwnd, asoc->rwnd_over);
+
+	if (asoc->ep->rcvbuf_policy)
+		rx_count = atomic_read(&asoc->rmem_alloc);
+	else
+		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
+
+	/* If we've reached or overflowed our receive buffer, announce
+	 * a 0 rwnd if rwnd would still be positive.  Store the
+	 * the potential pressure overflow so that the window can be restored
+	 * back to original value.
+	 */
+	if (rx_count >= asoc->base.sk->sk_rcvbuf)
+		over = 1;
+
+	if (asoc->rwnd >= len) {
+		asoc->rwnd -= len;
+		if (over) {
+			asoc->rwnd_press += asoc->rwnd;
+			asoc->rwnd = 0;
+		}
+	} else {
+		asoc->rwnd_over = len - asoc->rwnd;
+		asoc->rwnd = 0;
+	}
+
+	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
+		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
+		 asoc->rwnd_press);
+}
 
 /* Build the bind address list for the association based on info from the
  * local endpoint and the remote peer.
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 01e0024..ae9fbeb 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -6178,7 +6178,7 @@ static int sctp_eat_data(const struct sctp_association *asoc,
 	 * PMTU.  In cases, such as loopback, this might be a rather
 	 * large spill over.
 	 */
-	if ((!chunk->data_accepted) && (!asoc->rwnd ||
+	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
 	    (datalen > asoc->rwnd + asoc->frag_point))) {
 
 		/* If this is the next TSN, consider reneging to make
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index e13519e..ff20e2d 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -2115,6 +2115,12 @@ static int sctp_recvmsg(struct kiocb *iocb, struct sock *sk,
 		sctp_skb_pull(skb, copied);
 		skb_queue_head(&sk->sk_receive_queue, skb);
 
+		/* When only partial message is copied to the user, increase
+		 * rwnd by that amount. If all the data in the skb is read,
+		 * rwnd is updated when the event is freed.
+		 */
+		if (!sctp_ulpevent_is_notification(event))
+			sctp_assoc_rwnd_increase(event->asoc, copied);
 		goto out;
 	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
 		   (event->msg_flags & MSG_EOR))
diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c
index 8d198ae..85c6465 100644
--- a/net/sctp/ulpevent.c
+++ b/net/sctp/ulpevent.c
@@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(struct sctp_ulpevent *event,
 	skb = sctp_event2skb(event);
 	/* Set the owner and charge rwnd for bytes received.  */
 	sctp_ulpevent_set_owner(event, asoc);
-	sctp_assoc_rwnd_update(asoc, false);
+	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
 
 	if (!skb->data_len)
 		return;
@@ -1011,7 +1011,6 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
 {
 	struct sk_buff *skb, *frag;
 	unsigned int	len;
-	struct sctp_association *asoc;
 
 	/* Current stack structures assume that the rcv buffer is
 	 * per socket.   For UDP style sockets this is not true as
@@ -1036,11 +1035,8 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
 	}
 
 done:
-	asoc = event->asoc;
-	sctp_association_hold(asoc);
+	sctp_assoc_rwnd_increase(event->asoc, len);
 	sctp_ulpevent_release_owner(event);
-	sctp_assoc_rwnd_update(asoc, true);
-	sctp_association_put(asoc);
 }
 
 static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-14 19:45 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's bu Daniel Borkmann
@ 2014-04-14 19:57   ` Vlad Yasevich
  -1 siblings, 0 replies; 37+ messages in thread
From: Vlad Yasevich @ 2014-04-14 19:57 UTC (permalink / raw)
  To: Daniel Borkmann, davem
  Cc: netdev, linux-sctp, Matija Glavinic Pecotic, Alexander Sverdlin

On 04/14/2014 03:45 PM, Daniel Borkmann wrote:
> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
> to reflect real state of the receiver's buffer") as it introduced a
> serious performance regression on SCTP over IPv4 and IPv6, though a not
> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
> 
> Current state:
> 
> [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
> iperf version 3.0.1 (10 January 2014)
> Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
> Time: Fri, 11 Apr 2014 17:56:21 GMT
> Connecting to host 192.168.241.3, port 5201
>       Cookie: Lab200slot2.1397238981.812898.548918
> [  4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
> Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
> [ ID] Interval           Transfer     Bandwidth
> [  4]   0.00-1.09   sec  20.8 MBytes   161 Mbits/sec
> [  4]   1.09-2.13   sec  10.8 MBytes  86.8 Mbits/sec
> [  4]   2.13-3.15   sec  3.57 MBytes  29.5 Mbits/sec
> [  4]   3.15-4.16   sec  4.33 MBytes  35.7 Mbits/sec
> [  4]   4.16-6.21   sec  10.4 MBytes  42.7 Mbits/sec
> [  4]   6.21-6.21   sec  0.00 Bytes    0.00 bits/sec
> [  4]   6.21-7.35   sec  34.6 MBytes   253 Mbits/sec
> [  4]   7.35-11.45  sec  22.0 MBytes  45.0 Mbits/sec
> [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
> [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
> [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
> [  4]  11.45-12.51  sec  16.0 MBytes   126 Mbits/sec
> [  4]  12.51-13.59  sec  20.3 MBytes   158 Mbits/sec
> [  4]  13.59-14.65  sec  13.4 MBytes   107 Mbits/sec
> [  4]  14.65-16.79  sec  33.3 MBytes   130 Mbits/sec
> [  4]  16.79-16.79  sec  0.00 Bytes    0.00 bits/sec
> [  4]  16.79-17.82  sec  5.94 MBytes  48.7 Mbits/sec
> (etc)
> 
> [root@Lab200slot2 ~]#  iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
> iperf version 3.0.1 (10 January 2014)
> Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
> Time: Fri, 11 Apr 2014 19:08:41 GMT
> Connecting to host 2001:db8:0:f101::1, port 5201
>       Cookie: Lab200slot2.1397243321.714295.2b3f7c
> [  4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
> Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
> [ ID] Interval           Transfer     Bandwidth
> [  4]   0.00-1.00   sec   169 MBytes  1.42 Gbits/sec
> [  4]   1.00-2.00   sec   201 MBytes  1.69 Gbits/sec
> [  4]   2.00-3.00   sec   188 MBytes  1.58 Gbits/sec
> [  4]   3.00-4.00   sec   174 MBytes  1.46 Gbits/sec
> [  4]   4.00-5.00   sec   165 MBytes  1.39 Gbits/sec
> [  4]   5.00-6.00   sec   199 MBytes  1.67 Gbits/sec
> [  4]   6.00-7.00   sec   163 MBytes  1.36 Gbits/sec
> [  4]   7.00-8.00   sec   174 MBytes  1.46 Gbits/sec
> [  4]   8.00-9.00   sec   193 MBytes  1.62 Gbits/sec
> [  4]   9.00-10.00  sec   196 MBytes  1.65 Gbits/sec
> [  4]  10.00-11.00  sec   157 MBytes  1.31 Gbits/sec
> [  4]  11.00-12.00  sec   175 MBytes  1.47 Gbits/sec
> [  4]  12.00-13.00  sec   192 MBytes  1.61 Gbits/sec
> [  4]  13.00-14.00  sec   199 MBytes  1.67 Gbits/sec
> (etc)
> 
> After patch:
> 
> [root@Lab200slot2 ~]#  iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
> iperf version 3.0.1 (10 January 2014)
> Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
> Time: Mon, 14 Apr 2014 16:40:48 GMT
> Connecting to host 192.168.240.3, port 5201
>       Cookie: Lab200slot2.1397493648.413274.65e131
> [  4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
> Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
> [ ID] Interval           Transfer     Bandwidth
> [  4]   0.00-1.00   sec   240 MBytes  2.02 Gbits/sec
> [  4]   1.00-2.00   sec   239 MBytes  2.01 Gbits/sec
> [  4]   2.00-3.00   sec   240 MBytes  2.01 Gbits/sec
> [  4]   3.00-4.00   sec   239 MBytes  2.00 Gbits/sec
> [  4]   4.00-5.00   sec   245 MBytes  2.05 Gbits/sec
> [  4]   5.00-6.00   sec   240 MBytes  2.01 Gbits/sec
> [  4]   6.00-7.00   sec   240 MBytes  2.02 Gbits/sec
> [  4]   7.00-8.00   sec   239 MBytes  2.01 Gbits/sec
> 
> With the reverted patch applied, the SCTP/IPv4 performance is back
> to normal on latest upstream for IPv4 and IPv6 and has same throughput
> as 3.4.2 test kernel, steady and interval reports are smooth again.
> 
> Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
> Reported-by: Peter Butler <pbutler@sonusnet.com>
> Reported-by: Dongsheng Song <dongsheng.song@gmail.com>
> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
> Tested-by: Peter Butler <pbutler@sonusnet.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
> Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
> Cc: Vlad Yasevich <vyasevich@gmail.com>

Acked-by: Vlad Yasevich <vyasevich@gmail.com>

The base approach is sound.  The idea is to calculate rwnd based
on receiver buffer available.  The algorithm chosen however, is
gives a much higher preference to small data and penalizes large
data transfers.  We need to figure our something else here..

-vlad

> ---
>  As commit ef2820a735f7 affects kernels with 3.11 and onwards, this
>  needs a rework by Matija for net-next again, so that this fix can
>  go back to -stable and restore performance for 3.11-3.15 kernels.
> 
>  include/net/sctp/structs.h | 14 +++++++-
>  net/sctp/associola.c       | 82 ++++++++++++++++++++++++++++++++++++----------
>  net/sctp/sm_statefuns.c    |  2 +-
>  net/sctp/socket.c          |  6 ++++
>  net/sctp/ulpevent.c        |  8 ++---
>  5 files changed, 87 insertions(+), 25 deletions(-)
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 6ee76c8..d992ca3 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -1653,6 +1653,17 @@ struct sctp_association {
>  	/* This is the last advertised value of rwnd over a SACK chunk. */
>  	__u32 a_rwnd;
>  
> +	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
> +	 * to slop over a maximum of the association's frag_point.
> +	 */
> +	__u32 rwnd_over;
> +
> +	/* Keeps treack of rwnd pressure.  This happens when we have
> +	 * a window, but not recevie buffer (i.e small packets).  This one
> +	 * is releases slowly (1 PMTU at a time ).
> +	 */
> +	__u32 rwnd_press;
> +
>  	/* This is the sndbuf size in use for the association.
>  	 * This corresponds to the sndbuf size for the association,
>  	 * as specified in the sk->sndbuf.
> @@ -1881,7 +1892,8 @@ void sctp_assoc_update(struct sctp_association *old,
>  __u32 sctp_association_get_next_tsn(struct sctp_association *);
>  
>  void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
> -void sctp_assoc_rwnd_update(struct sctp_association *, bool);
> +void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
> +void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
>  void sctp_assoc_set_primary(struct sctp_association *,
>  			    struct sctp_transport *);
>  void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 4f6d6f9..39579c3 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -1395,35 +1395,44 @@ static inline bool sctp_peer_needs_update(struct sctp_association *asoc)
>  	return false;
>  }
>  
> -/* Update asoc's rwnd for the approximated state in the buffer,
> - * and check whether SACK needs to be sent.
> - */
> -void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
> +/* Increase asoc's rwnd by len and send any window update SACK if needed. */
> +void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
>  {
> -	int rx_count;
>  	struct sctp_chunk *sack;
>  	struct timer_list *timer;
>  
> -	if (asoc->ep->rcvbuf_policy)
> -		rx_count = atomic_read(&asoc->rmem_alloc);
> -	else
> -		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
> +	if (asoc->rwnd_over) {
> +		if (asoc->rwnd_over >= len) {
> +			asoc->rwnd_over -= len;
> +		} else {
> +			asoc->rwnd += (len - asoc->rwnd_over);
> +			asoc->rwnd_over = 0;
> +		}
> +	} else {
> +		asoc->rwnd += len;
> +	}
>  
> -	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> -		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> -	else
> -		asoc->rwnd = 0;
> +	/* If we had window pressure, start recovering it
> +	 * once our rwnd had reached the accumulated pressure
> +	 * threshold.  The idea is to recover slowly, but up
> +	 * to the initial advertised window.
> +	 */
> +	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
> +		int change = min(asoc->pathmtu, asoc->rwnd_press);
> +		asoc->rwnd += change;
> +		asoc->rwnd_press -= change;
> +	}
>  
> -	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
> -		 __func__, asoc, asoc->rwnd, rx_count,
> -		 asoc->base.sk->sk_rcvbuf);
> +	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
> +		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> +		 asoc->a_rwnd);
>  
>  	/* Send a window update SACK if the rwnd has increased by at least the
>  	 * minimum of the association's PMTU and half of the receive buffer.
>  	 * The algorithm used is similar to the one described in
>  	 * Section 4.2.3.3 of RFC 1122.
>  	 */
> -	if (update_peer && sctp_peer_needs_update(asoc)) {
> +	if (sctp_peer_needs_update(asoc)) {
>  		asoc->a_rwnd = asoc->rwnd;
>  
>  		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
> @@ -1445,6 +1454,45 @@ void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
>  	}
>  }
>  
> +/* Decrease asoc's rwnd by len. */
> +void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
> +{
> +	int rx_count;
> +	int over = 0;
> +
> +	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
> +		pr_debug("%s: association:%p has asoc->rwnd:%u, "
> +			 "asoc->rwnd_over:%u!\n", __func__, asoc,
> +			 asoc->rwnd, asoc->rwnd_over);
> +
> +	if (asoc->ep->rcvbuf_policy)
> +		rx_count = atomic_read(&asoc->rmem_alloc);
> +	else
> +		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
> +
> +	/* If we've reached or overflowed our receive buffer, announce
> +	 * a 0 rwnd if rwnd would still be positive.  Store the
> +	 * the potential pressure overflow so that the window can be restored
> +	 * back to original value.
> +	 */
> +	if (rx_count >= asoc->base.sk->sk_rcvbuf)
> +		over = 1;
> +
> +	if (asoc->rwnd >= len) {
> +		asoc->rwnd -= len;
> +		if (over) {
> +			asoc->rwnd_press += asoc->rwnd;
> +			asoc->rwnd = 0;
> +		}
> +	} else {
> +		asoc->rwnd_over = len - asoc->rwnd;
> +		asoc->rwnd = 0;
> +	}
> +
> +	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
> +		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> +		 asoc->rwnd_press);
> +}
>  
>  /* Build the bind address list for the association based on info from the
>   * local endpoint and the remote peer.
> diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
> index 01e0024..ae9fbeb 100644
> --- a/net/sctp/sm_statefuns.c
> +++ b/net/sctp/sm_statefuns.c
> @@ -6178,7 +6178,7 @@ static int sctp_eat_data(const struct sctp_association *asoc,
>  	 * PMTU.  In cases, such as loopback, this might be a rather
>  	 * large spill over.
>  	 */
> -	if ((!chunk->data_accepted) && (!asoc->rwnd ||
> +	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
>  	    (datalen > asoc->rwnd + asoc->frag_point))) {
>  
>  		/* If this is the next TSN, consider reneging to make
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index e13519e..ff20e2d 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -2115,6 +2115,12 @@ static int sctp_recvmsg(struct kiocb *iocb, struct sock *sk,
>  		sctp_skb_pull(skb, copied);
>  		skb_queue_head(&sk->sk_receive_queue, skb);
>  
> +		/* When only partial message is copied to the user, increase
> +		 * rwnd by that amount. If all the data in the skb is read,
> +		 * rwnd is updated when the event is freed.
> +		 */
> +		if (!sctp_ulpevent_is_notification(event))
> +			sctp_assoc_rwnd_increase(event->asoc, copied);
>  		goto out;
>  	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
>  		   (event->msg_flags & MSG_EOR))
> diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c
> index 8d198ae..85c6465 100644
> --- a/net/sctp/ulpevent.c
> +++ b/net/sctp/ulpevent.c
> @@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(struct sctp_ulpevent *event,
>  	skb = sctp_event2skb(event);
>  	/* Set the owner and charge rwnd for bytes received.  */
>  	sctp_ulpevent_set_owner(event, asoc);
> -	sctp_assoc_rwnd_update(asoc, false);
> +	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
>  
>  	if (!skb->data_len)
>  		return;
> @@ -1011,7 +1011,6 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
>  {
>  	struct sk_buff *skb, *frag;
>  	unsigned int	len;
> -	struct sctp_association *asoc;
>  
>  	/* Current stack structures assume that the rcv buffer is
>  	 * per socket.   For UDP style sockets this is not true as
> @@ -1036,11 +1035,8 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
>  	}
>  
>  done:
> -	asoc = event->asoc;
> -	sctp_association_hold(asoc);
> +	sctp_assoc_rwnd_increase(event->asoc, len);
>  	sctp_ulpevent_release_owner(event);
> -	sctp_assoc_rwnd_update(asoc, true);
> -	sctp_association_put(asoc);
>  }
>  
>  static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-14 19:57   ` Vlad Yasevich
  0 siblings, 0 replies; 37+ messages in thread
From: Vlad Yasevich @ 2014-04-14 19:57 UTC (permalink / raw)
  To: Daniel Borkmann, davem
  Cc: netdev, linux-sctp, Matija Glavinic Pecotic, Alexander Sverdlin

On 04/14/2014 03:45 PM, Daniel Borkmann wrote:
> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
> to reflect real state of the receiver's buffer") as it introduced a
> serious performance regression on SCTP over IPv4 and IPv6, though a not
> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
> 
> Current state:
> 
> [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
> iperf version 3.0.1 (10 January 2014)
> Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
> Time: Fri, 11 Apr 2014 17:56:21 GMT
> Connecting to host 192.168.241.3, port 5201
>       Cookie: Lab200slot2.1397238981.812898.548918
> [  4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
> Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
> [ ID] Interval           Transfer     Bandwidth
> [  4]   0.00-1.09   sec  20.8 MBytes   161 Mbits/sec
> [  4]   1.09-2.13   sec  10.8 MBytes  86.8 Mbits/sec
> [  4]   2.13-3.15   sec  3.57 MBytes  29.5 Mbits/sec
> [  4]   3.15-4.16   sec  4.33 MBytes  35.7 Mbits/sec
> [  4]   4.16-6.21   sec  10.4 MBytes  42.7 Mbits/sec
> [  4]   6.21-6.21   sec  0.00 Bytes    0.00 bits/sec
> [  4]   6.21-7.35   sec  34.6 MBytes   253 Mbits/sec
> [  4]   7.35-11.45  sec  22.0 MBytes  45.0 Mbits/sec
> [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
> [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
> [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
> [  4]  11.45-12.51  sec  16.0 MBytes   126 Mbits/sec
> [  4]  12.51-13.59  sec  20.3 MBytes   158 Mbits/sec
> [  4]  13.59-14.65  sec  13.4 MBytes   107 Mbits/sec
> [  4]  14.65-16.79  sec  33.3 MBytes   130 Mbits/sec
> [  4]  16.79-16.79  sec  0.00 Bytes    0.00 bits/sec
> [  4]  16.79-17.82  sec  5.94 MBytes  48.7 Mbits/sec
> (etc)
> 
> [root@Lab200slot2 ~]#  iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
> iperf version 3.0.1 (10 January 2014)
> Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
> Time: Fri, 11 Apr 2014 19:08:41 GMT
> Connecting to host 2001:db8:0:f101::1, port 5201
>       Cookie: Lab200slot2.1397243321.714295.2b3f7c
> [  4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
> Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
> [ ID] Interval           Transfer     Bandwidth
> [  4]   0.00-1.00   sec   169 MBytes  1.42 Gbits/sec
> [  4]   1.00-2.00   sec   201 MBytes  1.69 Gbits/sec
> [  4]   2.00-3.00   sec   188 MBytes  1.58 Gbits/sec
> [  4]   3.00-4.00   sec   174 MBytes  1.46 Gbits/sec
> [  4]   4.00-5.00   sec   165 MBytes  1.39 Gbits/sec
> [  4]   5.00-6.00   sec   199 MBytes  1.67 Gbits/sec
> [  4]   6.00-7.00   sec   163 MBytes  1.36 Gbits/sec
> [  4]   7.00-8.00   sec   174 MBytes  1.46 Gbits/sec
> [  4]   8.00-9.00   sec   193 MBytes  1.62 Gbits/sec
> [  4]   9.00-10.00  sec   196 MBytes  1.65 Gbits/sec
> [  4]  10.00-11.00  sec   157 MBytes  1.31 Gbits/sec
> [  4]  11.00-12.00  sec   175 MBytes  1.47 Gbits/sec
> [  4]  12.00-13.00  sec   192 MBytes  1.61 Gbits/sec
> [  4]  13.00-14.00  sec   199 MBytes  1.67 Gbits/sec
> (etc)
> 
> After patch:
> 
> [root@Lab200slot2 ~]#  iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
> iperf version 3.0.1 (10 January 2014)
> Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
> Time: Mon, 14 Apr 2014 16:40:48 GMT
> Connecting to host 192.168.240.3, port 5201
>       Cookie: Lab200slot2.1397493648.413274.65e131
> [  4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
> Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
> [ ID] Interval           Transfer     Bandwidth
> [  4]   0.00-1.00   sec   240 MBytes  2.02 Gbits/sec
> [  4]   1.00-2.00   sec   239 MBytes  2.01 Gbits/sec
> [  4]   2.00-3.00   sec   240 MBytes  2.01 Gbits/sec
> [  4]   3.00-4.00   sec   239 MBytes  2.00 Gbits/sec
> [  4]   4.00-5.00   sec   245 MBytes  2.05 Gbits/sec
> [  4]   5.00-6.00   sec   240 MBytes  2.01 Gbits/sec
> [  4]   6.00-7.00   sec   240 MBytes  2.02 Gbits/sec
> [  4]   7.00-8.00   sec   239 MBytes  2.01 Gbits/sec
> 
> With the reverted patch applied, the SCTP/IPv4 performance is back
> to normal on latest upstream for IPv4 and IPv6 and has same throughput
> as 3.4.2 test kernel, steady and interval reports are smooth again.
> 
> Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
> Reported-by: Peter Butler <pbutler@sonusnet.com>
> Reported-by: Dongsheng Song <dongsheng.song@gmail.com>
> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
> Tested-by: Peter Butler <pbutler@sonusnet.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
> Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
> Cc: Vlad Yasevich <vyasevich@gmail.com>

Acked-by: Vlad Yasevich <vyasevich@gmail.com>

The base approach is sound.  The idea is to calculate rwnd based
on receiver buffer available.  The algorithm chosen however, is
gives a much higher preference to small data and penalizes large
data transfers.  We need to figure our something else here..

-vlad

> ---
>  As commit ef2820a735f7 affects kernels with 3.11 and onwards, this
>  needs a rework by Matija for net-next again, so that this fix can
>  go back to -stable and restore performance for 3.11-3.15 kernels.
> 
>  include/net/sctp/structs.h | 14 +++++++-
>  net/sctp/associola.c       | 82 ++++++++++++++++++++++++++++++++++++----------
>  net/sctp/sm_statefuns.c    |  2 +-
>  net/sctp/socket.c          |  6 ++++
>  net/sctp/ulpevent.c        |  8 ++---
>  5 files changed, 87 insertions(+), 25 deletions(-)
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 6ee76c8..d992ca3 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -1653,6 +1653,17 @@ struct sctp_association {
>  	/* This is the last advertised value of rwnd over a SACK chunk. */
>  	__u32 a_rwnd;
>  
> +	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
> +	 * to slop over a maximum of the association's frag_point.
> +	 */
> +	__u32 rwnd_over;
> +
> +	/* Keeps treack of rwnd pressure.  This happens when we have
> +	 * a window, but not recevie buffer (i.e small packets).  This one
> +	 * is releases slowly (1 PMTU at a time ).
> +	 */
> +	__u32 rwnd_press;
> +
>  	/* This is the sndbuf size in use for the association.
>  	 * This corresponds to the sndbuf size for the association,
>  	 * as specified in the sk->sndbuf.
> @@ -1881,7 +1892,8 @@ void sctp_assoc_update(struct sctp_association *old,
>  __u32 sctp_association_get_next_tsn(struct sctp_association *);
>  
>  void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
> -void sctp_assoc_rwnd_update(struct sctp_association *, bool);
> +void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
> +void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
>  void sctp_assoc_set_primary(struct sctp_association *,
>  			    struct sctp_transport *);
>  void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 4f6d6f9..39579c3 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -1395,35 +1395,44 @@ static inline bool sctp_peer_needs_update(struct sctp_association *asoc)
>  	return false;
>  }
>  
> -/* Update asoc's rwnd for the approximated state in the buffer,
> - * and check whether SACK needs to be sent.
> - */
> -void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
> +/* Increase asoc's rwnd by len and send any window update SACK if needed. */
> +void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
>  {
> -	int rx_count;
>  	struct sctp_chunk *sack;
>  	struct timer_list *timer;
>  
> -	if (asoc->ep->rcvbuf_policy)
> -		rx_count = atomic_read(&asoc->rmem_alloc);
> -	else
> -		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
> +	if (asoc->rwnd_over) {
> +		if (asoc->rwnd_over >= len) {
> +			asoc->rwnd_over -= len;
> +		} else {
> +			asoc->rwnd += (len - asoc->rwnd_over);
> +			asoc->rwnd_over = 0;
> +		}
> +	} else {
> +		asoc->rwnd += len;
> +	}
>  
> -	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> -		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> -	else
> -		asoc->rwnd = 0;
> +	/* If we had window pressure, start recovering it
> +	 * once our rwnd had reached the accumulated pressure
> +	 * threshold.  The idea is to recover slowly, but up
> +	 * to the initial advertised window.
> +	 */
> +	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
> +		int change = min(asoc->pathmtu, asoc->rwnd_press);
> +		asoc->rwnd += change;
> +		asoc->rwnd_press -= change;
> +	}
>  
> -	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
> -		 __func__, asoc, asoc->rwnd, rx_count,
> -		 asoc->base.sk->sk_rcvbuf);
> +	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
> +		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> +		 asoc->a_rwnd);
>  
>  	/* Send a window update SACK if the rwnd has increased by at least the
>  	 * minimum of the association's PMTU and half of the receive buffer.
>  	 * The algorithm used is similar to the one described in
>  	 * Section 4.2.3.3 of RFC 1122.
>  	 */
> -	if (update_peer && sctp_peer_needs_update(asoc)) {
> +	if (sctp_peer_needs_update(asoc)) {
>  		asoc->a_rwnd = asoc->rwnd;
>  
>  		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
> @@ -1445,6 +1454,45 @@ void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
>  	}
>  }
>  
> +/* Decrease asoc's rwnd by len. */
> +void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
> +{
> +	int rx_count;
> +	int over = 0;
> +
> +	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
> +		pr_debug("%s: association:%p has asoc->rwnd:%u, "
> +			 "asoc->rwnd_over:%u!\n", __func__, asoc,
> +			 asoc->rwnd, asoc->rwnd_over);
> +
> +	if (asoc->ep->rcvbuf_policy)
> +		rx_count = atomic_read(&asoc->rmem_alloc);
> +	else
> +		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
> +
> +	/* If we've reached or overflowed our receive buffer, announce
> +	 * a 0 rwnd if rwnd would still be positive.  Store the
> +	 * the potential pressure overflow so that the window can be restored
> +	 * back to original value.
> +	 */
> +	if (rx_count >= asoc->base.sk->sk_rcvbuf)
> +		over = 1;
> +
> +	if (asoc->rwnd >= len) {
> +		asoc->rwnd -= len;
> +		if (over) {
> +			asoc->rwnd_press += asoc->rwnd;
> +			asoc->rwnd = 0;
> +		}
> +	} else {
> +		asoc->rwnd_over = len - asoc->rwnd;
> +		asoc->rwnd = 0;
> +	}
> +
> +	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
> +		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> +		 asoc->rwnd_press);
> +}
>  
>  /* Build the bind address list for the association based on info from the
>   * local endpoint and the remote peer.
> diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
> index 01e0024..ae9fbeb 100644
> --- a/net/sctp/sm_statefuns.c
> +++ b/net/sctp/sm_statefuns.c
> @@ -6178,7 +6178,7 @@ static int sctp_eat_data(const struct sctp_association *asoc,
>  	 * PMTU.  In cases, such as loopback, this might be a rather
>  	 * large spill over.
>  	 */
> -	if ((!chunk->data_accepted) && (!asoc->rwnd ||
> +	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
>  	    (datalen > asoc->rwnd + asoc->frag_point))) {
>  
>  		/* If this is the next TSN, consider reneging to make
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index e13519e..ff20e2d 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -2115,6 +2115,12 @@ static int sctp_recvmsg(struct kiocb *iocb, struct sock *sk,
>  		sctp_skb_pull(skb, copied);
>  		skb_queue_head(&sk->sk_receive_queue, skb);
>  
> +		/* When only partial message is copied to the user, increase
> +		 * rwnd by that amount. If all the data in the skb is read,
> +		 * rwnd is updated when the event is freed.
> +		 */
> +		if (!sctp_ulpevent_is_notification(event))
> +			sctp_assoc_rwnd_increase(event->asoc, copied);
>  		goto out;
>  	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
>  		   (event->msg_flags & MSG_EOR))
> diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c
> index 8d198ae..85c6465 100644
> --- a/net/sctp/ulpevent.c
> +++ b/net/sctp/ulpevent.c
> @@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(struct sctp_ulpevent *event,
>  	skb = sctp_event2skb(event);
>  	/* Set the owner and charge rwnd for bytes received.  */
>  	sctp_ulpevent_set_owner(event, asoc);
> -	sctp_assoc_rwnd_update(asoc, false);
> +	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
>  
>  	if (!skb->data_len)
>  		return;
> @@ -1011,7 +1011,6 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
>  {
>  	struct sk_buff *skb, *frag;
>  	unsigned int	len;
> -	struct sctp_association *asoc;
>  
>  	/* Current stack structures assume that the rcv buffer is
>  	 * per socket.   For UDP style sockets this is not true as
> @@ -1036,11 +1035,8 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
>  	}
>  
>  done:
> -	asoc = event->asoc;
> -	sctp_association_hold(asoc);
> +	sctp_assoc_rwnd_increase(event->asoc, len);
>  	sctp_ulpevent_release_owner(event);
> -	sctp_assoc_rwnd_update(asoc, true);
> -	sctp_association_put(asoc);
>  }
>  
>  static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-14 19:45 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's bu Daniel Borkmann
@ 2014-04-14 20:48   ` David Miller
  -1 siblings, 0 replies; 37+ messages in thread
From: David Miller @ 2014-04-14 20:48 UTC (permalink / raw)
  To: dborkman
  Cc: netdev, linux-sctp, matija.glavinic-pecotic.ext,
	alexander.sverdlin, vyasevich

From: Daniel Borkmann <dborkman@redhat.com>
Date: Mon, 14 Apr 2014 21:45:17 +0200

> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
> to reflect real state of the receiver's buffer") as it introduced a
> serious performance regression on SCTP over IPv4 and IPv6, though a not
> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
> 
> Current state:
 ...
> With the reverted patch applied, the SCTP/IPv4 performance is back
> to normal on latest upstream for IPv4 and IPv6 and has same throughput
> as 3.4.2 test kernel, steady and interval reports are smooth again.
> 
> Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
> Reported-by: Peter Butler <pbutler@sonusnet.com>
> Reported-by: Dongsheng Song <dongsheng.song@gmail.com>
> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
> Tested-by: Peter Butler <pbutler@sonusnet.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>

Applied and queued up for -stable.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-14 20:48   ` David Miller
  0 siblings, 0 replies; 37+ messages in thread
From: David Miller @ 2014-04-14 20:48 UTC (permalink / raw)
  To: dborkman
  Cc: netdev, linux-sctp, matija.glavinic-pecotic.ext,
	alexander.sverdlin, vyasevich

From: Daniel Borkmann <dborkman@redhat.com>
Date: Mon, 14 Apr 2014 21:45:17 +0200

> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
> to reflect real state of the receiver's buffer") as it introduced a
> serious performance regression on SCTP over IPv4 and IPv6, though a not
> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
> 
> Current state:
 ...
> With the reverted patch applied, the SCTP/IPv4 performance is back
> to normal on latest upstream for IPv4 and IPv6 and has same throughput
> as 3.4.2 test kernel, steady and interval reports are smooth again.
> 
> Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
> Reported-by: Peter Butler <pbutler@sonusnet.com>
> Reported-by: Dongsheng Song <dongsheng.song@gmail.com>
> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
> Tested-by: Peter Butler <pbutler@sonusnet.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>

Applied and queued up for -stable.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-14 19:45 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's bu Daniel Borkmann
@ 2014-04-15  6:43   ` Alexander Sverdlin
  -1 siblings, 0 replies; 37+ messages in thread
From: Alexander Sverdlin @ 2014-04-15  6:43 UTC (permalink / raw)
  To: ext Daniel Borkmann, davem
  Cc: netdev, linux-sctp, Matija Glavinic Pecotic, Vlad Yasevich

Hello Daniel,

On 14/04/14 21:45, ext Daniel Borkmann wrote:
> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
> to reflect real state of the receiver's buffer") as it introduced a
> serious performance regression on SCTP over IPv4 and IPv6, though a not
> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.

Could you please share other HW details? I wonder how much CPU power one needs for such a throughput?

-- 
Best regards,
Alexander Sverdlin.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-15  6:43   ` Alexander Sverdlin
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Sverdlin @ 2014-04-15  6:43 UTC (permalink / raw)
  To: ext Daniel Borkmann, davem
  Cc: netdev, linux-sctp, Matija Glavinic Pecotic, Vlad Yasevich

Hello Daniel,

On 14/04/14 21:45, ext Daniel Borkmann wrote:
> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
> to reflect real state of the receiver's buffer") as it introduced a
> serious performance regression on SCTP over IPv4 and IPv6, though a not
> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.

Could you please share other HW details? I wonder how much CPU power one needs for such a throughput?

-- 
Best regards,
Alexander Sverdlin.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-15  6:43   ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Alexander Sverdlin
@ 2014-04-15  7:08     ` Daniel Borkmann
  -1 siblings, 0 replies; 37+ messages in thread
From: Daniel Borkmann @ 2014-04-15  7:08 UTC (permalink / raw)
  To: Alexander Sverdlin
  Cc: davem, netdev, linux-sctp, Matija Glavinic Pecotic,
	Vlad Yasevich, pbutler

Hi Matija,

[cc'ing Peter]

On 04/15/2014 08:43 AM, Alexander Sverdlin wrote:
> On 14/04/14 21:45, ext Daniel Borkmann wrote:
>> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
>> to reflect real state of the receiver's buffer") as it introduced a
>> serious performance regression on SCTP over IPv4 and IPv6, though a not
>> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
>
> Could you please share other HW details? I wonder how much CPU power one needs for such a throughput?

If you would like to get to know the exact specifics from the bug report,
resp. numbers from the commit message, I refer you to Peter's setup, i.e.
he used ixgbe NICs as these are one of the few with SCTP checksum
offloading available.

Thanks,

Daniel

  [1] http://www.spinics.net/lists/linux-sctp/msg03290.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-15  7:08     ` Daniel Borkmann
  0 siblings, 0 replies; 37+ messages in thread
From: Daniel Borkmann @ 2014-04-15  7:08 UTC (permalink / raw)
  To: Alexander Sverdlin
  Cc: davem, netdev, linux-sctp, Matija Glavinic Pecotic,
	Vlad Yasevich, pbutler

Hi Matija,

[cc'ing Peter]

On 04/15/2014 08:43 AM, Alexander Sverdlin wrote:
> On 14/04/14 21:45, ext Daniel Borkmann wrote:
>> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
>> to reflect real state of the receiver's buffer") as it introduced a
>> serious performance regression on SCTP over IPv4 and IPv6, though a not
>> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
>
> Could you please share other HW details? I wonder how much CPU power one needs for such a throughput?

If you would like to get to know the exact specifics from the bug report,
resp. numbers from the commit message, I refer you to Peter's setup, i.e.
he used ixgbe NICs as these are one of the few with SCTP checksum
offloading available.

Thanks,

Daniel

  [1] http://www.spinics.net/lists/linux-sctp/msg03290.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-14 20:48   ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' David Miller
@ 2014-04-15  8:46     ` Alexander Sverdlin
  -1 siblings, 0 replies; 37+ messages in thread
From: Alexander Sverdlin @ 2014-04-15  8:46 UTC (permalink / raw)
  To: ext David Miller, dborkman
  Cc: netdev, linux-sctp, matija.glavinic-pecotic.ext, vyasevich

Hi!

On 14/04/14 22:48, ext David Miller wrote:
>> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
>> to reflect real state of the receiver's buffer") as it introduced a
>> serious performance regression on SCTP over IPv4 and IPv6, though a not
>> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
>>
>> Current state:
>  ...
>> With the reverted patch applied, the SCTP/IPv4 performance is back
>> to normal on latest upstream for IPv4 and IPv6 and has same throughput
>> as 3.4.2 test kernel, steady and interval reports are smooth again.
>>
>> Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
>> Reported-by: Peter Butler <pbutler@sonusnet.com>
>> Reported-by: Dongsheng Song <dongsheng.song@gmail.com>
>> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
>> Tested-by: Peter Butler <pbutler@sonusnet.com>
>> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> 
> Applied and queued up for -stable.

Should not this be fixed actually in SCTP congestion control part?
RWND calculation is actually not responsible for congestion control.
And this revert actually introduces serious bug again, which leads to SCTP being stuck completely in particular 
multi-homed use-cases (refer to http://www.spinics.net/lists/linux-sctp/msg02516.html).

We are not arguing against another version of the patch, but:
- you are choosing speed instead of stability here
- you are masking the problem reverting the code, which is not responsible for the problem observed

-- 
Best regards,
Alexander Sverdlin.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-15  8:46     ` Alexander Sverdlin
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Sverdlin @ 2014-04-15  8:46 UTC (permalink / raw)
  To: ext David Miller, dborkman
  Cc: netdev, linux-sctp, matija.glavinic-pecotic.ext, vyasevich

Hi!

On 14/04/14 22:48, ext David Miller wrote:
>> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
>> to reflect real state of the receiver's buffer") as it introduced a
>> serious performance regression on SCTP over IPv4 and IPv6, though a not
>> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
>>
>> Current state:
>  ...
>> With the reverted patch applied, the SCTP/IPv4 performance is back
>> to normal on latest upstream for IPv4 and IPv6 and has same throughput
>> as 3.4.2 test kernel, steady and interval reports are smooth again.
>>
>> Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
>> Reported-by: Peter Butler <pbutler@sonusnet.com>
>> Reported-by: Dongsheng Song <dongsheng.song@gmail.com>
>> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
>> Tested-by: Peter Butler <pbutler@sonusnet.com>
>> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> 
> Applied and queued up for -stable.

Should not this be fixed actually in SCTP congestion control part?
RWND calculation is actually not responsible for congestion control.
And this revert actually introduces serious bug again, which leads to SCTP being stuck completely in particular 
multi-homed use-cases (refer to http://www.spinics.net/lists/linux-sctp/msg02516.html).

We are not arguing against another version of the patch, but:
- you are choosing speed instead of stability here
- you are masking the problem reverting the code, which is not responsible for the problem observed

-- 
Best regards,
Alexander Sverdlin.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-15  8:46     ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Alexander Sverdlin
@ 2014-04-15  8:57       ` Daniel Borkmann
  -1 siblings, 0 replies; 37+ messages in thread
From: Daniel Borkmann @ 2014-04-15  8:57 UTC (permalink / raw)
  To: Alexander Sverdlin
  Cc: ext David Miller, netdev, linux-sctp,
	matija.glavinic-pecotic.ext, vyasevich, Peter Butler

On 04/15/2014 10:46 AM, Alexander Sverdlin wrote:
...
> Should not this be fixed actually in SCTP congestion control part?
> RWND calculation is actually not responsible for congestion control.
> And this revert actually introduces serious bug again, which leads to SCTP being stuck completely in particular
> multi-homed use-cases (refer to http://www.spinics.net/lists/linux-sctp/msg02516.html).
>
> We are not arguing against another version of the patch, but:
> - you are choosing speed instead of stability here
> - you are masking the problem reverting the code, which is not responsible for the problem observed

So on 10Gbit Ethernet it is reasonable to regress from 2Gbit down to 50Mbit???
Did you actually measure that? I'm not arguing against the original approach to
the fix, but you need to rework that for net-next just differently as Vlad
already stated, that's all.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-15  8:57       ` Daniel Borkmann
  0 siblings, 0 replies; 37+ messages in thread
From: Daniel Borkmann @ 2014-04-15  8:57 UTC (permalink / raw)
  To: Alexander Sverdlin
  Cc: ext David Miller, netdev, linux-sctp,
	matija.glavinic-pecotic.ext, vyasevich, Peter Butler

On 04/15/2014 10:46 AM, Alexander Sverdlin wrote:
...
> Should not this be fixed actually in SCTP congestion control part?
> RWND calculation is actually not responsible for congestion control.
> And this revert actually introduces serious bug again, which leads to SCTP being stuck completely in particular
> multi-homed use-cases (refer to http://www.spinics.net/lists/linux-sctp/msg02516.html).
>
> We are not arguing against another version of the patch, but:
> - you are choosing speed instead of stability here
> - you are masking the problem reverting the code, which is not responsible for the problem observed

So on 10Gbit Ethernet it is reasonable to regress from 2Gbit down to 50Mbit???
Did you actually measure that? I'm not arguing against the original approach to
the fix, but you need to rework that for net-next just differently as Vlad
already stated, that's all.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-15  6:43   ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Alexander Sverdlin
@ 2014-04-15 14:27     ` Butler, Peter
  -1 siblings, 0 replies; 37+ messages in thread
From: Butler, Peter @ 2014-04-15 14:27 UTC (permalink / raw)
  To: Alexander Sverdlin, ext Daniel Borkmann, davem
  Cc: netdev, linux-sctp, Matija Glavinic Pecotic, Vlad Yasevich

I'm not sure if the following processing power is actually required to reproduce the problem, but this is my setup:

- 2 single-board computers (SBCs), each a Xeon C5528 @ 2.13GHz (4-core with HT = 8 cores each)
- 10Gb Intel 82599EB NICs (IXGBE); each SBC has 2 of these but only 1 was used for this SCTP testing (i.e. NOT multi-homed)
- 10Gb backplane, approx. 0.2 ms RTT (however I also tested this with RTT = 10 ms, 20 ms, 50 ms and it is reproducible in all scenarios)
- send buffer size = 1200000, receive buffer size = 3000000




-----Original Message-----
From: linux-sctp-owner@vger.kernel.org [mailto:linux-sctp-owner@vger.kernel.org] On Behalf Of Alexander Sverdlin
Sent: April-15-14 2:44 AM
To: ext Daniel Borkmann; davem@davemloft.net
Cc: netdev@vger.kernel.org; linux-sctp@vger.kernel.org; Matija Glavinic Pecotic; Vlad Yasevich
Subject: Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"

Hello Daniel,

On 14/04/14 21:45, ext Daniel Borkmann wrote:
> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd 
> management to reflect real state of the receiver's buffer") as it 
> introduced a serious performance regression on SCTP over IPv4 and 
> IPv6, though a not as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.

Could you please share other HW details? I wonder how much CPU power one needs for such a throughput?

--
Best regards,
Alexander Sverdlin.
--
To unsubscribe from this list: send the line "unsubscribe linux-sctp" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-15 14:27     ` Butler, Peter
  0 siblings, 0 replies; 37+ messages in thread
From: Butler, Peter @ 2014-04-15 14:27 UTC (permalink / raw)
  To: Alexander Sverdlin, ext Daniel Borkmann, davem
  Cc: netdev, linux-sctp, Matija Glavinic Pecotic, Vlad Yasevich

SSdtIG5vdCBzdXJlIGlmIHRoZSBmb2xsb3dpbmcgcHJvY2Vzc2luZyBwb3dlciBpcyBhY3R1YWxs
eSByZXF1aXJlZCB0byByZXByb2R1Y2UgdGhlIHByb2JsZW0sIGJ1dCB0aGlzIGlzIG15IHNldHVw
Og0KDQotIDIgc2luZ2xlLWJvYXJkIGNvbXB1dGVycyAoU0JDcyksIGVhY2ggYSBYZW9uIEM1NTI4
IEAgMi4xM0dIeiAoNC1jb3JlIHdpdGggSFQgPSA4IGNvcmVzIGVhY2gpDQotIDEwR2IgSW50ZWwg
ODI1OTlFQiBOSUNzIChJWEdCRSk7IGVhY2ggU0JDIGhhcyAyIG9mIHRoZXNlIGJ1dCBvbmx5IDEg
d2FzIHVzZWQgZm9yIHRoaXMgU0NUUCB0ZXN0aW5nIChpLmUuIE5PVCBtdWx0aS1ob21lZCkNCi0g
MTBHYiBiYWNrcGxhbmUsIGFwcHJveC4gMC4yIG1zIFJUVCAoaG93ZXZlciBJIGFsc28gdGVzdGVk
IHRoaXMgd2l0aCBSVFQgPSAxMCBtcywgMjAgbXMsIDUwIG1zIGFuZCBpdCBpcyByZXByb2R1Y2li
bGUgaW4gYWxsIHNjZW5hcmlvcykNCi0gc2VuZCBidWZmZXIgc2l6ZSA9IDEyMDAwMDAsIHJlY2Vp
dmUgYnVmZmVyIHNpemUgPSAzMDAwMDAwDQoNCg0KDQoNCi0tLS0tT3JpZ2luYWwgTWVzc2FnZS0t
LS0tDQpGcm9tOiBsaW51eC1zY3RwLW93bmVyQHZnZXIua2VybmVsLm9yZyBbbWFpbHRvOmxpbnV4
LXNjdHAtb3duZXJAdmdlci5rZXJuZWwub3JnXSBPbiBCZWhhbGYgT2YgQWxleGFuZGVyIFN2ZXJk
bGluDQpTZW50OiBBcHJpbC0xNS0xNCAyOjQ0IEFNDQpUbzogZXh0IERhbmllbCBCb3JrbWFubjsg
ZGF2ZW1AZGF2ZW1sb2Z0Lm5ldA0KQ2M6IG5ldGRldkB2Z2VyLmtlcm5lbC5vcmc7IGxpbnV4LXNj
dHBAdmdlci5rZXJuZWwub3JnOyBNYXRpamEgR2xhdmluaWMgUGVjb3RpYzsgVmxhZCBZYXNldmlj
aA0KU3ViamVjdDogUmU6IFtQQVRDSCBuZXRdIFJldmVydCAibmV0OiBzY3RwOiBGaXggYV9yd25k
L3J3bmQgbWFuYWdlbWVudCB0byByZWZsZWN0IHJlYWwgc3RhdGUgb2YgdGhlIHJlY2VpdmVyJ3Mg
YnVmZmVyIg0KDQpIZWxsbyBEYW5pZWwsDQoNCk9uIDE0LzA0LzE0IDIxOjQ1LCBleHQgRGFuaWVs
IEJvcmttYW5uIHdyb3RlOg0KPiBUaGlzIHJldmVydHMgY29tbWl0IGVmMjgyMGE3MzVmNyAoIm5l
dDogc2N0cDogRml4IGFfcnduZC9yd25kIA0KPiBtYW5hZ2VtZW50IHRvIHJlZmxlY3QgcmVhbCBz
dGF0ZSBvZiB0aGUgcmVjZWl2ZXIncyBidWZmZXIiKSBhcyBpdCANCj4gaW50cm9kdWNlZCBhIHNl
cmlvdXMgcGVyZm9ybWFuY2UgcmVncmVzc2lvbiBvbiBTQ1RQIG92ZXIgSVB2NCBhbmQgDQo+IElQ
djYsIHRob3VnaCBhIG5vdCBhcyBkcmFtYXRpYyBvbiB0aGUgbGF0dGVyLiBNZWFzdXJlbWVudHMg
YXJlIG9uIDEwR2JpdC9zIHdpdGggaXhnYmUgTklDcy4NCg0KQ291bGQgeW91IHBsZWFzZSBzaGFy
ZSBvdGhlciBIVyBkZXRhaWxzPyBJIHdvbmRlciBob3cgbXVjaCBDUFUgcG93ZXIgb25lIG5lZWRz
IGZvciBzdWNoIGEgdGhyb3VnaHB1dD8NCg0KLS0NCkJlc3QgcmVnYXJkcywNCkFsZXhhbmRlciBT
dmVyZGxpbi4NCi0tDQpUbyB1bnN1YnNjcmliZSBmcm9tIHRoaXMgbGlzdDogc2VuZCB0aGUgbGlu
ZSAidW5zdWJzY3JpYmUgbGludXgtc2N0cCIgaW4gdGhlIGJvZHkgb2YgYSBtZXNzYWdlIHRvIG1h
am9yZG9tb0B2Z2VyLmtlcm5lbC5vcmcgTW9yZSBtYWpvcmRvbW8gaW5mbyBhdCAgaHR0cDovL3Zn
ZXIua2VybmVsLm9yZy9tYWpvcmRvbW8taW5mby5odG1sDQo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-14 19:57   ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Vlad Yasevich
@ 2014-04-16  6:57     ` Matija Glavinic Pecotic
  -1 siblings, 0 replies; 37+ messages in thread
From: Matija Glavinic Pecotic @ 2014-04-16  6:57 UTC (permalink / raw)
  To: ext Vlad Yasevich
  Cc: Daniel Borkmann, davem, netdev, linux-sctp, Alexander Sverdlin

Hello Vlad,

On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
> The base approach is sound.  The idea is to calculate rwnd based
> on receiver buffer available.  The algorithm chosen however, is
> gives a much higher preference to small data and penalizes large
> data transfers.  We need to figure our something else here..

I don't follow you here. Could you please explain what do you see as penalty?

Thanks,

Matija

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-16  6:57     ` Matija Glavinic Pecotic
  0 siblings, 0 replies; 37+ messages in thread
From: Matija Glavinic Pecotic @ 2014-04-16  6:57 UTC (permalink / raw)
  To: ext Vlad Yasevich
  Cc: Daniel Borkmann, davem, netdev, linux-sctp, Alexander Sverdlin

Hello Vlad,

On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
> The base approach is sound.  The idea is to calculate rwnd based
> on receiver buffer available.  The algorithm chosen however, is
> gives a much higher preference to small data and penalizes large
> data transfers.  We need to figure our something else here..

I don't follow you here. Could you please explain what do you see as penalty?

Thanks,

Matija

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-16  6:57     ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Matija Glavinic Pecotic
@ 2014-04-16  8:39       ` Dongsheng Song
  -1 siblings, 0 replies; 37+ messages in thread
From: Dongsheng Song @ 2014-04-16  8:39 UTC (permalink / raw)
  To: Matija Glavinic Pecotic
  Cc: ext Vlad Yasevich, Daniel Borkmann, davem, netdev, linux-sctp,
	Alexander Sverdlin

>From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
the penalty is 99 %.

http://www.spinics.net/lists/linux-sctp/msg03308.html


On Wed, Apr 16, 2014 at 2:57 PM, Matija Glavinic Pecotic
<matija.glavinic-pecotic.ext@nsn.com> wrote:
>
> Hello Vlad,
>
> On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
> > The base approach is sound.  The idea is to calculate rwnd based
> > on receiver buffer available.  The algorithm chosen however, is
> > gives a much higher preference to small data and penalizes large
> > data transfers.  We need to figure our something else here..
>
> I don't follow you here. Could you please explain what do you see as penalty?
>
> Thanks,
>
> Matija
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-16  8:39       ` Dongsheng Song
  0 siblings, 0 replies; 37+ messages in thread
From: Dongsheng Song @ 2014-04-16  8:39 UTC (permalink / raw)
  To: Matija Glavinic Pecotic
  Cc: ext Vlad Yasevich, Daniel Borkmann, davem, netdev, linux-sctp,
	Alexander Sverdlin

From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
the penalty is 99 %.

http://www.spinics.net/lists/linux-sctp/msg03308.html


On Wed, Apr 16, 2014 at 2:57 PM, Matija Glavinic Pecotic
<matija.glavinic-pecotic.ext@nsn.com> wrote:
>
> Hello Vlad,
>
> On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
> > The base approach is sound.  The idea is to calculate rwnd based
> > on receiver buffer available.  The algorithm chosen however, is
> > gives a much higher preference to small data and penalizes large
> > data transfers.  We need to figure our something else here..
>
> I don't follow you here. Could you please explain what do you see as penalty?
>
> Thanks,
>
> Matija
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-16  8:39       ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Dongsheng Song
@ 2014-04-16  9:02         ` Alexander Sverdlin
  -1 siblings, 0 replies; 37+ messages in thread
From: Alexander Sverdlin @ 2014-04-16  9:02 UTC (permalink / raw)
  To: ext Dongsheng Song, Matija Glavinic Pecotic
  Cc: ext Vlad Yasevich, Daniel Borkmann, davem, netdev, linux-sctp

Hi Dongsheng!

On 16/04/14 10:39, ext Dongsheng Song wrote:
>>From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
> the penalty is 99 %.

The question was, do you see this as a problem of the new rwnd algorithm?
If yes, how exactly? The algorithm actually has no preference to any amount of data.
It was fine-tuned before to serve as congestion control algorithm, but this should
be located elsewhere. Perhaps, indeed, a re-use of congestion control modules from
TCP would be possible...

> http://www.spinics.net/lists/linux-sctp/msg03308.html
> 
> 
> On Wed, Apr 16, 2014 at 2:57 PM, Matija Glavinic Pecotic
> <matija.glavinic-pecotic.ext@nsn.com> wrote:
>>
>> Hello Vlad,
>>
>> On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
>>> The base approach is sound.  The idea is to calculate rwnd based
>>> on receiver buffer available.  The algorithm chosen however, is
>>> gives a much higher preference to small data and penalizes large
>>> data transfers.  We need to figure our something else here..
>>
>> I don't follow you here. Could you please explain what do you see as penalty?
>>
>> Thanks,
>>
>> Matija
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

-- 
Best regards,
Alexander Sverdlin.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-16  9:02         ` Alexander Sverdlin
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Sverdlin @ 2014-04-16  9:02 UTC (permalink / raw)
  To: ext Dongsheng Song, Matija Glavinic Pecotic
  Cc: ext Vlad Yasevich, Daniel Borkmann, davem, netdev, linux-sctp

Hi Dongsheng!

On 16/04/14 10:39, ext Dongsheng Song wrote:
>>From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
> the penalty is 99 %.

The question was, do you see this as a problem of the new rwnd algorithm?
If yes, how exactly? The algorithm actually has no preference to any amount of data.
It was fine-tuned before to serve as congestion control algorithm, but this should
be located elsewhere. Perhaps, indeed, a re-use of congestion control modules from
TCP would be possible...

> http://www.spinics.net/lists/linux-sctp/msg03308.html
> 
> 
> On Wed, Apr 16, 2014 at 2:57 PM, Matija Glavinic Pecotic
> <matija.glavinic-pecotic.ext@nsn.com> wrote:
>>
>> Hello Vlad,
>>
>> On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
>>> The base approach is sound.  The idea is to calculate rwnd based
>>> on receiver buffer available.  The algorithm chosen however, is
>>> gives a much higher preference to small data and penalizes large
>>> data transfers.  We need to figure our something else here..
>>
>> I don't follow you here. Could you please explain what do you see as penalty?
>>
>> Thanks,
>>
>> Matija
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

-- 
Best regards,
Alexander Sverdlin.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-16  9:02         ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Alexander Sverdlin
@ 2014-04-16 11:55           ` Matija Glavinic Pecotic
  -1 siblings, 0 replies; 37+ messages in thread
From: Matija Glavinic Pecotic @ 2014-04-16 11:55 UTC (permalink / raw)
  To: Alexander Sverdlin, ext Dongsheng Song, ext Vlad Yasevich,
	Daniel Borkmann
  Cc: davem, netdev, linux-sctp

Hello,

On 16.04.2014 11:02, Alexander Sverdlin wrote:
> Hi Dongsheng!
>
> On 16/04/14 10:39, ext Dongsheng Song wrote:
>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>> the penalty is 99 %.
>
> The question was, do you see this as a problem of the new rwnd algorithm?
> If yes, how exactly? The algorithm actually has no preference to any amount of data.
> It was fine-tuned before to serve as congestion control algorithm, but this should
> be located elsewhere. Perhaps, indeed, a re-use of congestion control modules from
> TCP would be possible...

Its also worth to note that sctp specifies rfc2581 for congestion 
control. TCP obsoleted that one in favor of 5681.

@Vlad, after Alexanders comment, it seems to be that you were referring 
to performance penalty. At first, I understood you refer to some penalty 
in rwnd calculation against buffer/rwnd value/something else. Thats why 
I asked that.

What also might be is that we are hitting SWS. I remember us observing 
some scenarios in which SWS is broken, new rwnd might have triggered it 
fully.

In any case, after some thought in the meantime, I'm pretty much sure 
that we need to improve congestion control and that new rwnd calculation 
is correct approach.

>> http://www.spinics.net/lists/linux-sctp/msg03308.html
>>
>>
>> On Wed, Apr 16, 2014 at 2:57 PM, Matija Glavinic Pecotic
>> <matija.glavinic-pecotic.ext@nsn.com> wrote:
>>>
>>> Hello Vlad,
>>>
>>> On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
>>>> The base approach is sound.  The idea is to calculate rwnd based
>>>> on receiver buffer available.  The algorithm chosen however, is
>>>> gives a much higher preference to small data and penalizes large
>>>> data transfers.  We need to figure our something else here..
>>>
>>> I don't follow you here. Could you please explain what do you see as penalty?
>>>
>>> Thanks,
>>>
>>> Matija
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-16 11:55           ` Matija Glavinic Pecotic
  0 siblings, 0 replies; 37+ messages in thread
From: Matija Glavinic Pecotic @ 2014-04-16 11:55 UTC (permalink / raw)
  To: Alexander Sverdlin, ext Dongsheng Song, ext Vlad Yasevich,
	Daniel Borkmann
  Cc: davem, netdev, linux-sctp

Hello,

On 16.04.2014 11:02, Alexander Sverdlin wrote:
> Hi Dongsheng!
>
> On 16/04/14 10:39, ext Dongsheng Song wrote:
>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>> the penalty is 99 %.
>
> The question was, do you see this as a problem of the new rwnd algorithm?
> If yes, how exactly? The algorithm actually has no preference to any amount of data.
> It was fine-tuned before to serve as congestion control algorithm, but this should
> be located elsewhere. Perhaps, indeed, a re-use of congestion control modules from
> TCP would be possible...

Its also worth to note that sctp specifies rfc2581 for congestion 
control. TCP obsoleted that one in favor of 5681.

@Vlad, after Alexanders comment, it seems to be that you were referring 
to performance penalty. At first, I understood you refer to some penalty 
in rwnd calculation against buffer/rwnd value/something else. Thats why 
I asked that.

What also might be is that we are hitting SWS. I remember us observing 
some scenarios in which SWS is broken, new rwnd might have triggered it 
fully.

In any case, after some thought in the meantime, I'm pretty much sure 
that we need to improve congestion control and that new rwnd calculation 
is correct approach.

>> http://www.spinics.net/lists/linux-sctp/msg03308.html
>>
>>
>> On Wed, Apr 16, 2014 at 2:57 PM, Matija Glavinic Pecotic
>> <matija.glavinic-pecotic.ext@nsn.com> wrote:
>>>
>>> Hello Vlad,
>>>
>>> On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
>>>> The base approach is sound.  The idea is to calculate rwnd based
>>>> on receiver buffer available.  The algorithm chosen however, is
>>>> gives a much higher preference to small data and penalizes large
>>>> data transfers.  We need to figure our something else here..
>>>
>>> I don't follow you here. Could you please explain what do you see as penalty?
>>>
>>> Thanks,
>>>
>>> Matija
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-16 11:55           ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Matija Glavinic Pecotic
@ 2014-04-16 13:32             ` Vlad Yasevich
  -1 siblings, 0 replies; 37+ messages in thread
From: Vlad Yasevich @ 2014-04-16 13:32 UTC (permalink / raw)
  To: Matija Glavinic Pecotic, Alexander Sverdlin, ext Dongsheng Song,
	Daniel Borkmann
  Cc: davem, netdev, linux-sctp

On 04/16/2014 07:55 AM, Matija Glavinic Pecotic wrote:
> Hello,
> 
> On 16.04.2014 11:02, Alexander Sverdlin wrote:
>> Hi Dongsheng!
>>
>> On 16/04/14 10:39, ext Dongsheng Song wrote:
>>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>>> the penalty is 99 %.
>>
>> The question was, do you see this as a problem of the new rwnd algorithm?
>> If yes, how exactly? The algorithm actually has no preference to any
>> amount of data.
>> It was fine-tuned before to serve as congestion control algorithm, but
>> this should
>> be located elsewhere. Perhaps, indeed, a re-use of congestion control
>> modules from
>> TCP would be possible...
> 
> Its also worth to note that sctp specifies rfc2581 for congestion
> control. TCP obsoleted that one in favor of 5681.
> 
> @Vlad, after Alexanders comment, it seems to be that you were referring
> to performance penalty. At first, I understood you refer to some penalty
> in rwnd calculation against buffer/rwnd value/something else. Thats why
> I asked that.
> 
> What also might be is that we are hitting SWS. I remember us observing
> some scenarios in which SWS is broken, new rwnd might have triggered it
> fully.
> 
> In any case, after some thought in the meantime, I'm pretty much sure
> that we need to improve congestion control and that new rwnd calculation
> is correct approach.

I am not sure where congestion control is broken.  It might be nice to
add a periodic SCTP_STATUS call to netperf/iperf to see what the state
of the congestion window and peer receive window is.

Alternatively, an quick stap script to examine these values could also
be useful.

-vlad

> 
>>> http://www.spinics.net/lists/linux-sctp/msg03308.html
>>>
>>>
>>> On Wed, Apr 16, 2014 at 2:57 PM, Matija Glavinic Pecotic
>>> <matija.glavinic-pecotic.ext@nsn.com> wrote:
>>>>
>>>> Hello Vlad,
>>>>
>>>> On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
>>>>> The base approach is sound.  The idea is to calculate rwnd based
>>>>> on receiver buffer available.  The algorithm chosen however, is
>>>>> gives a much higher preference to small data and penalizes large
>>>>> data transfers.  We need to figure our something else here..
>>>>
>>>> I don't follow you here. Could you please explain what do you see as
>>>> penalty?
>>>>
>>>> Thanks,
>>>>
>>>> Matija
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> linux-sctp" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-16 13:32             ` Vlad Yasevich
  0 siblings, 0 replies; 37+ messages in thread
From: Vlad Yasevich @ 2014-04-16 13:32 UTC (permalink / raw)
  To: Matija Glavinic Pecotic, Alexander Sverdlin, ext Dongsheng Song,
	Daniel Borkmann
  Cc: davem, netdev, linux-sctp

On 04/16/2014 07:55 AM, Matija Glavinic Pecotic wrote:
> Hello,
> 
> On 16.04.2014 11:02, Alexander Sverdlin wrote:
>> Hi Dongsheng!
>>
>> On 16/04/14 10:39, ext Dongsheng Song wrote:
>>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>>> the penalty is 99 %.
>>
>> The question was, do you see this as a problem of the new rwnd algorithm?
>> If yes, how exactly? The algorithm actually has no preference to any
>> amount of data.
>> It was fine-tuned before to serve as congestion control algorithm, but
>> this should
>> be located elsewhere. Perhaps, indeed, a re-use of congestion control
>> modules from
>> TCP would be possible...
> 
> Its also worth to note that sctp specifies rfc2581 for congestion
> control. TCP obsoleted that one in favor of 5681.
> 
> @Vlad, after Alexanders comment, it seems to be that you were referring
> to performance penalty. At first, I understood you refer to some penalty
> in rwnd calculation against buffer/rwnd value/something else. Thats why
> I asked that.
> 
> What also might be is that we are hitting SWS. I remember us observing
> some scenarios in which SWS is broken, new rwnd might have triggered it
> fully.
> 
> In any case, after some thought in the meantime, I'm pretty much sure
> that we need to improve congestion control and that new rwnd calculation
> is correct approach.

I am not sure where congestion control is broken.  It might be nice to
add a periodic SCTP_STATUS call to netperf/iperf to see what the state
of the congestion window and peer receive window is.

Alternatively, an quick stap script to examine these values could also
be useful.

-vlad

> 
>>> http://www.spinics.net/lists/linux-sctp/msg03308.html
>>>
>>>
>>> On Wed, Apr 16, 2014 at 2:57 PM, Matija Glavinic Pecotic
>>> <matija.glavinic-pecotic.ext@nsn.com> wrote:
>>>>
>>>> Hello Vlad,
>>>>
>>>> On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
>>>>> The base approach is sound.  The idea is to calculate rwnd based
>>>>> on receiver buffer available.  The algorithm chosen however, is
>>>>> gives a much higher preference to small data and penalizes large
>>>>> data transfers.  We need to figure our something else here..
>>>>
>>>> I don't follow you here. Could you please explain what do you see as
>>>> penalty?
>>>>
>>>> Thanks,
>>>>
>>>> Matija
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> linux-sctp" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
  2014-04-14 19:45 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's bu Daniel Borkmann
                   ` (3 preceding siblings ...)
  (?)
@ 2014-04-16 18:36 ` Vlad Yasevich
  -1 siblings, 0 replies; 37+ messages in thread
From: Vlad Yasevich @ 2014-04-16 18:36 UTC (permalink / raw)
  To: linux-sctp



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-16  9:02         ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Alexander Sverdlin
@ 2014-04-16 18:50           ` Vlad Yasevich
  -1 siblings, 0 replies; 37+ messages in thread
From: Vlad Yasevich @ 2014-04-16 18:50 UTC (permalink / raw)
  To: Alexander Sverdlin, ext Dongsheng Song, Matija Glavinic Pecotic
  Cc: Daniel Borkmann, davem, netdev, linux-sctp

On 04/16/2014 05:02 AM, Alexander Sverdlin wrote:
> Hi Dongsheng!
>
> On 16/04/14 10:39, ext Dongsheng Song wrote:
>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>> the penalty is 99 %.
>
> The question was, do you see this as a problem of the new rwnd algorithm?
> If yes, how exactly?

The algorithm isn't wrong, but the implementation appears to have
a bug with window update SACKs.  The problem is that
sk->sk_rmem_alloc is updated by the skb destructor when
skb is freed.  This happens after we call sctp_assoc_rwnd_update()
which tries to send the update SACK.  As a result, in default
config with per-socket accounting, the test
    if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
uses the wrong values for rx_count and results in advertisement
of decreased rwnd instead of what is really available.

Can you try this patch without the revert applied.

Thanks
-vlad

diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c
index 8d198ae..cc2d440 100644
--- a/net/sctp/ulpevent.c
+++ b/net/sctp/ulpevent.c
@@ -1011,7 +1011,6 @@ static void sctp_ulpevent_release_data(struct
sctp_ulpevent *event)
 {
 	struct sk_buff *skb, *frag;
 	unsigned int	len;
-	struct sctp_association *asoc;

 	/* Current stack structures assume that the rcv buffer is
 	 * per socket.   For UDP style sockets this is not true as
@@ -1036,11 +1035,7 @@ static void sctp_ulpevent_release_data(struct
sctp_ulpevent *event)
 	}

 done:
-	asoc = event->asoc;
-	sctp_association_hold(asoc);
 	sctp_ulpevent_release_owner(event);
-	sctp_assoc_rwnd_update(asoc, true);
-	sctp_association_put(asoc);
 }

 static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)
@@ -1071,12 +1066,21 @@ done:
  */
 void sctp_ulpevent_free(struct sctp_ulpevent *event)
 {
+	struct sctp_association *assoc = event->asoc;
+
 	if (sctp_ulpevent_is_notification(event))
 		sctp_ulpevent_release_owner(event);
 	else
 		sctp_ulpevent_release_data(event);

 	kfree_skb(sctp_event2skb(event));
+	/* The socket is locked and the assocaiton can't go anywhere
+	 * since we are walking the uplqueue.  No need to hold
+	 * another ref on the association.  Now that the skb has been
+	 * freed and accounted for everywhere, see if we need to send
+	 * a window update SACK.
+	 */
+	sctp_assoc_rwnd_update(asoc, true);
 }

 /* Purge the skb lists holding ulpevents. */


> The algorithm actually has no preference to any amount of data.
> It was fine-tuned before to serve as congestion control algorithm, but
this should
> be located elsewhere. Perhaps, indeed, a re-use of congestion control
modules from
> TCP would be possible...
>
>> http://www.spinics.net/lists/linux-sctp/msg03308.html
>>
>>
>> On Wed, Apr 16, 2014 at 2:57 PM, Matija Glavinic Pecotic
>> <matija.glavinic-pecotic.ext@nsn.com> wrote:
>>>
>>> Hello Vlad,
>>>
>>> On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
>>>> The base approach is sound.  The idea is to calculate rwnd based
>>>> on receiver buffer available.  The algorithm chosen however, is
>>>> gives a much higher preference to small data and penalizes large
>>>> data transfers.  We need to figure our something else here..
>>>
>>> I don't follow you here. Could you please explain what do you see as
penalty?
>>>
>>> Thanks,
>>>
>>> Matija
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-16 18:50           ` Vlad Yasevich
  0 siblings, 0 replies; 37+ messages in thread
From: Vlad Yasevich @ 2014-04-16 18:50 UTC (permalink / raw)
  To: Alexander Sverdlin, ext Dongsheng Song, Matija Glavinic Pecotic
  Cc: Daniel Borkmann, davem, netdev, linux-sctp

On 04/16/2014 05:02 AM, Alexander Sverdlin wrote:
> Hi Dongsheng!
>
> On 16/04/14 10:39, ext Dongsheng Song wrote:
>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>> the penalty is 99 %.
>
> The question was, do you see this as a problem of the new rwnd algorithm?
> If yes, how exactly?

The algorithm isn't wrong, but the implementation appears to have
a bug with window update SACKs.  The problem is that
sk->sk_rmem_alloc is updated by the skb destructor when
skb is freed.  This happens after we call sctp_assoc_rwnd_update()
which tries to send the update SACK.  As a result, in default
config with per-socket accounting, the test
    if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
uses the wrong values for rx_count and results in advertisement
of decreased rwnd instead of what is really available.

Can you try this patch without the revert applied.

Thanks
-vlad

diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c
index 8d198ae..cc2d440 100644
--- a/net/sctp/ulpevent.c
+++ b/net/sctp/ulpevent.c
@@ -1011,7 +1011,6 @@ static void sctp_ulpevent_release_data(struct
sctp_ulpevent *event)
 {
 	struct sk_buff *skb, *frag;
 	unsigned int	len;
-	struct sctp_association *asoc;

 	/* Current stack structures assume that the rcv buffer is
 	 * per socket.   For UDP style sockets this is not true as
@@ -1036,11 +1035,7 @@ static void sctp_ulpevent_release_data(struct
sctp_ulpevent *event)
 	}

 done:
-	asoc = event->asoc;
-	sctp_association_hold(asoc);
 	sctp_ulpevent_release_owner(event);
-	sctp_assoc_rwnd_update(asoc, true);
-	sctp_association_put(asoc);
 }

 static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)
@@ -1071,12 +1066,21 @@ done:
  */
 void sctp_ulpevent_free(struct sctp_ulpevent *event)
 {
+	struct sctp_association *assoc = event->asoc;
+
 	if (sctp_ulpevent_is_notification(event))
 		sctp_ulpevent_release_owner(event);
 	else
 		sctp_ulpevent_release_data(event);

 	kfree_skb(sctp_event2skb(event));
+	/* The socket is locked and the assocaiton can't go anywhere
+	 * since we are walking the uplqueue.  No need to hold
+	 * another ref on the association.  Now that the skb has been
+	 * freed and accounted for everywhere, see if we need to send
+	 * a window update SACK.
+	 */
+	sctp_assoc_rwnd_update(asoc, true);
 }

 /* Purge the skb lists holding ulpevents. */


> The algorithm actually has no preference to any amount of data.
> It was fine-tuned before to serve as congestion control algorithm, but
this should
> be located elsewhere. Perhaps, indeed, a re-use of congestion control
modules from
> TCP would be possible...
>
>> http://www.spinics.net/lists/linux-sctp/msg03308.html
>>
>>
>> On Wed, Apr 16, 2014 at 2:57 PM, Matija Glavinic Pecotic
>> <matija.glavinic-pecotic.ext@nsn.com> wrote:
>>>
>>> Hello Vlad,
>>>
>>> On 04/14/2014 09:57 PM, ext Vlad Yasevich wrote:
>>>> The base approach is sound.  The idea is to calculate rwnd based
>>>> on receiver buffer available.  The algorithm chosen however, is
>>>> gives a much higher preference to small data and penalizes large
>>>> data transfers.  We need to figure our something else here..
>>>
>>> I don't follow you here. Could you please explain what do you see as
penalty?
>>>
>>> Thanks,
>>>
>>> Matija
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-16 18:50           ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Vlad Yasevich
@ 2014-04-16 19:05             ` Daniel Borkmann
  -1 siblings, 0 replies; 37+ messages in thread
From: Daniel Borkmann @ 2014-04-16 19:05 UTC (permalink / raw)
  To: Vlad Yasevich
  Cc: Alexander Sverdlin, ext Dongsheng Song, Matija Glavinic Pecotic,
	davem, netdev, linux-sctp

On 04/16/2014 08:50 PM, Vlad Yasevich wrote:
> On 04/16/2014 05:02 AM, Alexander Sverdlin wrote:
>> Hi Dongsheng!
>>
>> On 16/04/14 10:39, ext Dongsheng Song wrote:
>>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>>> the penalty is 99 %.
>>
>> The question was, do you see this as a problem of the new rwnd algorithm?
>> If yes, how exactly?

[ Default config ./test_timetolive from lksctp-test suite triggered
   that as well actually it appears, i.e. showing that the app never
   woke up from the 3 sec timeout. ]

> The algorithm isn't wrong, but the implementation appears to have
> a bug with window update SACKs.  The problem is that
> sk->sk_rmem_alloc is updated by the skb destructor when
> skb is freed.  This happens after we call sctp_assoc_rwnd_update()
> which tries to send the update SACK.  As a result, in default
> config with per-socket accounting, the test
>      if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> uses the wrong values for rx_count and results in advertisement
> of decreased rwnd instead of what is really available.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-16 19:05             ` Daniel Borkmann
  0 siblings, 0 replies; 37+ messages in thread
From: Daniel Borkmann @ 2014-04-16 19:05 UTC (permalink / raw)
  To: Vlad Yasevich
  Cc: Alexander Sverdlin, ext Dongsheng Song, Matija Glavinic Pecotic,
	davem, netdev, linux-sctp

On 04/16/2014 08:50 PM, Vlad Yasevich wrote:
> On 04/16/2014 05:02 AM, Alexander Sverdlin wrote:
>> Hi Dongsheng!
>>
>> On 16/04/14 10:39, ext Dongsheng Song wrote:
>>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>>> the penalty is 99 %.
>>
>> The question was, do you see this as a problem of the new rwnd algorithm?
>> If yes, how exactly?

[ Default config ./test_timetolive from lksctp-test suite triggered
   that as well actually it appears, i.e. showing that the app never
   woke up from the 3 sec timeout. ]

> The algorithm isn't wrong, but the implementation appears to have
> a bug with window update SACKs.  The problem is that
> sk->sk_rmem_alloc is updated by the skb destructor when
> skb is freed.  This happens after we call sctp_assoc_rwnd_update()
> which tries to send the update SACK.  As a result, in default
> config with per-socket accounting, the test
>      if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> uses the wrong values for rx_count and results in advertisement
> of decreased rwnd instead of what is really available.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-16 19:05             ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Daniel Borkmann
@ 2014-04-16 19:24               ` Matija Glavinic Pecotic
  -1 siblings, 0 replies; 37+ messages in thread
From: Matija Glavinic Pecotic @ 2014-04-16 19:24 UTC (permalink / raw)
  To: ext Daniel Borkmann
  Cc: Vlad Yasevich, Alexander Sverdlin, ext Dongsheng Song, davem,
	netdev, linux-sctp

On 04/16/2014 09:05 PM, ext Daniel Borkmann wrote:
> On 04/16/2014 08:50 PM, Vlad Yasevich wrote:
>> On 04/16/2014 05:02 AM, Alexander Sverdlin wrote:
>>> Hi Dongsheng!
>>>
>>> On 16/04/14 10:39, ext Dongsheng Song wrote:
>>>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>>>> the penalty is 99 %.
>>>
>>> The question was, do you see this as a problem of the new rwnd algorithm?
>>> If yes, how exactly?
> 
> [ Default config ./test_timetolive from lksctp-test suite triggered
>   that as well actually it appears, i.e. showing that the app never
>   woke up from the 3 sec timeout. ]

We had a different case there. Test wasnt hanging due to decreased performance, but due to fact that with the patch sender created very large message, as opposed to situation before the patch where test message was of much smaller size.

http://www.spinics.net/lists/linux-sctp/msg03185.html

>> The algorithm isn't wrong, but the implementation appears to have
>> a bug with window update SACKs.  The problem is that
>> sk->sk_rmem_alloc is updated by the skb destructor when
>> skb is freed.  This happens after we call sctp_assoc_rwnd_update()
>> which tries to send the update SACK.  As a result, in default
>> config with per-socket accounting, the test
>>      if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>> uses the wrong values for rx_count and results in advertisement
>> of decreased rwnd instead of what is really available.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-16 19:24               ` Matija Glavinic Pecotic
  0 siblings, 0 replies; 37+ messages in thread
From: Matija Glavinic Pecotic @ 2014-04-16 19:24 UTC (permalink / raw)
  To: ext Daniel Borkmann
  Cc: Vlad Yasevich, Alexander Sverdlin, ext Dongsheng Song, davem,
	netdev, linux-sctp

On 04/16/2014 09:05 PM, ext Daniel Borkmann wrote:
> On 04/16/2014 08:50 PM, Vlad Yasevich wrote:
>> On 04/16/2014 05:02 AM, Alexander Sverdlin wrote:
>>> Hi Dongsheng!
>>>
>>> On 16/04/14 10:39, ext Dongsheng Song wrote:
>>>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>>>> the penalty is 99 %.
>>>
>>> The question was, do you see this as a problem of the new rwnd algorithm?
>>> If yes, how exactly?
> 
> [ Default config ./test_timetolive from lksctp-test suite triggered
>   that as well actually it appears, i.e. showing that the app never
>   woke up from the 3 sec timeout. ]

We had a different case there. Test wasnt hanging due to decreased performance, but due to fact that with the patch sender created very large message, as opposed to situation before the patch where test message was of much smaller size.

http://www.spinics.net/lists/linux-sctp/msg03185.html

>> The algorithm isn't wrong, but the implementation appears to have
>> a bug with window update SACKs.  The problem is that
>> sk->sk_rmem_alloc is updated by the skb destructor when
>> skb is freed.  This happens after we call sctp_assoc_rwnd_update()
>> which tries to send the update SACK.  As a result, in default
>> config with per-socket accounting, the test
>>      if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>> uses the wrong values for rx_count and results in advertisement
>> of decreased rwnd instead of what is really available.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-16 19:24               ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Matija Glavinic Pecotic
@ 2014-04-16 19:47                 ` Vlad Yasevich
  -1 siblings, 0 replies; 37+ messages in thread
From: Vlad Yasevich @ 2014-04-16 19:47 UTC (permalink / raw)
  To: Matija Glavinic Pecotic, ext Daniel Borkmann
  Cc: Alexander Sverdlin, ext Dongsheng Song, davem, netdev, linux-sctp

On 04/16/2014 03:24 PM, Matija Glavinic Pecotic wrote:
> On 04/16/2014 09:05 PM, ext Daniel Borkmann wrote:
>> On 04/16/2014 08:50 PM, Vlad Yasevich wrote:
>>> On 04/16/2014 05:02 AM, Alexander Sverdlin wrote:
>>>> Hi Dongsheng!
>>>>
>>>> On 16/04/14 10:39, ext Dongsheng Song wrote:
>>>>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>>>>> the penalty is 99 %.
>>>>
>>>> The question was, do you see this as a problem of the new rwnd algorithm?
>>>> If yes, how exactly?
>>
>> [ Default config ./test_timetolive from lksctp-test suite triggered
>>   that as well actually it appears, i.e. showing that the app never
>>   woke up from the 3 sec timeout. ]
> 
> We had a different case there. Test wasnt hanging due to decreased performance, but due to fact that with the patch sender created very large message, as opposed to situation before the patch where test message was of much smaller size.
> 
> http://www.spinics.net/lists/linux-sctp/msg03185.html

The problem with the test is that it tries to completely fill the
receive window by using a single SCTP message.  This all goes well
and the test expects a 0-rwnd to be advertised.

The test then consumes said message.  At this point, the test expects
the window to be opened and subsequent messages be sent or timed-out.
This doesn't happen, because the window update is not sent.  So
the sender thinks that the window is closed which it technically is
since we never actually update asoc->rwnd.  But the receive buffer
is empty since we drained the data.
We have a stuck association.

Hard to do when traffic is always flowing one way or the other, but
in a test, it's easy.

-vlad

> 
>>> The algorithm isn't wrong, but the implementation appears to have
>>> a bug with window update SACKs.  The problem is that
>>> sk->sk_rmem_alloc is updated by the skb destructor when
>>> skb is freed.  This happens after we call sctp_assoc_rwnd_update()
>>> which tries to send the update SACK.  As a result, in default
>>> config with per-socket accounting, the test
>>>      if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>>> uses the wrong values for rx_count and results in advertisement
>>> of decreased rwnd instead of what is really available.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-16 19:47                 ` Vlad Yasevich
  0 siblings, 0 replies; 37+ messages in thread
From: Vlad Yasevich @ 2014-04-16 19:47 UTC (permalink / raw)
  To: Matija Glavinic Pecotic, ext Daniel Borkmann
  Cc: Alexander Sverdlin, ext Dongsheng Song, davem, netdev, linux-sctp

On 04/16/2014 03:24 PM, Matija Glavinic Pecotic wrote:
> On 04/16/2014 09:05 PM, ext Daniel Borkmann wrote:
>> On 04/16/2014 08:50 PM, Vlad Yasevich wrote:
>>> On 04/16/2014 05:02 AM, Alexander Sverdlin wrote:
>>>> Hi Dongsheng!
>>>>
>>>> On 16/04/14 10:39, ext Dongsheng Song wrote:
>>>>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>>>>> the penalty is 99 %.
>>>>
>>>> The question was, do you see this as a problem of the new rwnd algorithm?
>>>> If yes, how exactly?
>>
>> [ Default config ./test_timetolive from lksctp-test suite triggered
>>   that as well actually it appears, i.e. showing that the app never
>>   woke up from the 3 sec timeout. ]
> 
> We had a different case there. Test wasnt hanging due to decreased performance, but due to fact that with the patch sender created very large message, as opposed to situation before the patch where test message was of much smaller size.
> 
> http://www.spinics.net/lists/linux-sctp/msg03185.html

The problem with the test is that it tries to completely fill the
receive window by using a single SCTP message.  This all goes well
and the test expects a 0-rwnd to be advertised.

The test then consumes said message.  At this point, the test expects
the window to be opened and subsequent messages be sent or timed-out.
This doesn't happen, because the window update is not sent.  So
the sender thinks that the window is closed which it technically is
since we never actually update asoc->rwnd.  But the receive buffer
is empty since we drained the data.
We have a stuck association.

Hard to do when traffic is always flowing one way or the other, but
in a test, it's easy.

-vlad

> 
>>> The algorithm isn't wrong, but the implementation appears to have
>>> a bug with window update SACKs.  The problem is that
>>> sk->sk_rmem_alloc is updated by the skb destructor when
>>> skb is freed.  This happens after we call sctp_assoc_rwnd_update()
>>> which tries to send the update SACK.  As a result, in default
>>> config with per-socket accounting, the test
>>>      if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>>> uses the wrong values for rx_count and results in advertisement
>>> of decreased rwnd instead of what is really available.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
  2014-04-16 19:47                 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Vlad Yasevich
@ 2014-04-21 19:12                   ` Matija Glavinic Pecotic
  -1 siblings, 0 replies; 37+ messages in thread
From: Matija Glavinic Pecotic @ 2014-04-21 19:12 UTC (permalink / raw)
  To: ext Vlad Yasevich
  Cc: ext Daniel Borkmann, Alexander Sverdlin, ext Dongsheng Song,
	davem, netdev, linux-sctp

On 04/16/2014 09:47 PM, ext Vlad Yasevich wrote:
> On 04/16/2014 03:24 PM, Matija Glavinic Pecotic wrote:
>> On 04/16/2014 09:05 PM, ext Daniel Borkmann wrote:
>>> On 04/16/2014 08:50 PM, Vlad Yasevich wrote:
>>>> On 04/16/2014 05:02 AM, Alexander Sverdlin wrote:
>>>>> Hi Dongsheng!
>>>>>
>>>>> On 16/04/14 10:39, ext Dongsheng Song wrote:
>>>>>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>>>>>> the penalty is 99 %.
>>>>>
>>>>> The question was, do you see this as a problem of the new rwnd algorithm?
>>>>> If yes, how exactly?
>>>
>>> [ Default config ./test_timetolive from lksctp-test suite triggered
>>>   that as well actually it appears, i.e. showing that the app never
>>>   woke up from the 3 sec timeout. ]
>>
>> We had a different case there. Test wasnt hanging due to decreased performance, but due to fact that with the patch sender created very large message, as opposed to situation before the patch where test message was of much smaller size.
>>
>> http://www.spinics.net/lists/linux-sctp/msg03185.html
> 
> The problem with the test is that it tries to completely fill the
> receive window by using a single SCTP message.  This all goes well
> and the test expects a 0-rwnd to be advertised.
> 
> The test then consumes said message.  At this point, the test expects
> the window to be opened and subsequent messages be sent or timed-out.
> This doesn't happen, because the window update is not sent.  So
> the sender thinks that the window is closed which it technically is
> since we never actually update asoc->rwnd.  But the receive buffer
> is empty since we drained the data.
> We have a stuck association.
> 
> Hard to do when traffic is always flowing one way or the other, but
> in a test, it's easy.

I'm not sure we hit exactly this scenario in this test case.

The problem with this TC is that it relied on the fact that once SO_RCVBUF is set on the socket, and later changed, a_rwnd will stay at the initial value (the same what was discussed when this TC was fixed few months ago -> http://www.spinics.net/lists/linux-sctp/msg03185.html).

What happened once rwnd became "honest" is that TC did not use small value to create fill message (fillmsg = malloc(gstatus.sstat_rwnd+RWND_SLOP);) ->  (SMALL_RCVBUF+RWND_SLOP), but due to new behavior, it used later advertised, or what is referred in TC as the original value. This value is even bigger then REALLY_BIG value in TC, and in my case, it is 164k:

> Sending the message of size 163837...

With these parameters, we will deplete receiver in just two sctp packets on lo (~65k of data plus really big overhead due to MAXSEG set to 100). We can confirm this by looking at the assocs state at the time of TC hanging:

glavinic@slon:~$ cat /proc/net/sctp/assocs 
 ASSOC     SOCK   STY SST ST HBKT ASSOC-ID TX_QUEUE RX_QUEUE UID INODE LPORT RPORT LADDRS <-> RADDRS HBINT INS OUTS MAXRT T1X T2X RTXC wmema wmemq sndbuf rcvbuf
f2c0d800 f599c780 0   7   3  8332   13   753877        0    1000 27635 1024   1025  127.0.0.1 <-> *127.0.0.1 	    7500    10    10   10    0    0    15604  1428953  1153600   163840   163840
f2c09800 f599cb40 0   10  3  9101   14        0   327916    1000 27636 1025   1024  127.0.0.1 <-> *127.0.0.1 	    7500    10    10   10    0    0        0        1        0   163840   327680
glavinic@slon:~$ 

What happens is that TC hangs on the send message. Since TC hanged there, we dont get to the point that we start reading and we have locked ourselves forever.

On the other side, I see a possible pitfall with this late rwnd update, especially for the large messages accompanied with large MTUs and when we come close to closing the rwnd, so I'm looking forward for the retest.

Regards,

Matija
 
> -vlad
> 
>>
>>>> The algorithm isn't wrong, but the implementation appears to have
>>>> a bug with window update SACKs.  The problem is that
>>>> sk->sk_rmem_alloc is updated by the skb destructor when
>>>> skb is freed.  This happens after we call sctp_assoc_rwnd_update()
>>>> which tries to send the update SACK.  As a result, in default
>>>> config with per-socket accounting, the test
>>>>      if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>>>> uses the wrong values for rx_count and results in advertisement
>>>> of decreased rwnd instead of what is really available.
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
@ 2014-04-21 19:12                   ` Matija Glavinic Pecotic
  0 siblings, 0 replies; 37+ messages in thread
From: Matija Glavinic Pecotic @ 2014-04-21 19:12 UTC (permalink / raw)
  To: ext Vlad Yasevich
  Cc: ext Daniel Borkmann, Alexander Sverdlin, ext Dongsheng Song,
	davem, netdev, linux-sctp

On 04/16/2014 09:47 PM, ext Vlad Yasevich wrote:
> On 04/16/2014 03:24 PM, Matija Glavinic Pecotic wrote:
>> On 04/16/2014 09:05 PM, ext Daniel Borkmann wrote:
>>> On 04/16/2014 08:50 PM, Vlad Yasevich wrote:
>>>> On 04/16/2014 05:02 AM, Alexander Sverdlin wrote:
>>>>> Hi Dongsheng!
>>>>>
>>>>> On 16/04/14 10:39, ext Dongsheng Song wrote:
>>>>>> >From my testing, netperf throughput from 600 Mbit/s drop to 6 Mbit/s,
>>>>>> the penalty is 99 %.
>>>>>
>>>>> The question was, do you see this as a problem of the new rwnd algorithm?
>>>>> If yes, how exactly?
>>>
>>> [ Default config ./test_timetolive from lksctp-test suite triggered
>>>   that as well actually it appears, i.e. showing that the app never
>>>   woke up from the 3 sec timeout. ]
>>
>> We had a different case there. Test wasnt hanging due to decreased performance, but due to fact that with the patch sender created very large message, as opposed to situation before the patch where test message was of much smaller size.
>>
>> http://www.spinics.net/lists/linux-sctp/msg03185.html
> 
> The problem with the test is that it tries to completely fill the
> receive window by using a single SCTP message.  This all goes well
> and the test expects a 0-rwnd to be advertised.
> 
> The test then consumes said message.  At this point, the test expects
> the window to be opened and subsequent messages be sent or timed-out.
> This doesn't happen, because the window update is not sent.  So
> the sender thinks that the window is closed which it technically is
> since we never actually update asoc->rwnd.  But the receive buffer
> is empty since we drained the data.
> We have a stuck association.
> 
> Hard to do when traffic is always flowing one way or the other, but
> in a test, it's easy.

I'm not sure we hit exactly this scenario in this test case.

The problem with this TC is that it relied on the fact that once SO_RCVBUF is set on the socket, and later changed, a_rwnd will stay at the initial value (the same what was discussed when this TC was fixed few months ago -> http://www.spinics.net/lists/linux-sctp/msg03185.html).

What happened once rwnd became "honest" is that TC did not use small value to create fill message (fillmsg = malloc(gstatus.sstat_rwnd+RWND_SLOP);) ->  (SMALL_RCVBUF+RWND_SLOP), but due to new behavior, it used later advertised, or what is referred in TC as the original value. This value is even bigger then REALLY_BIG value in TC, and in my case, it is 164k:

> Sending the message of size 163837...

With these parameters, we will deplete receiver in just two sctp packets on lo (~65k of data plus really big overhead due to MAXSEG set to 100). We can confirm this by looking at the assocs state at the time of TC hanging:

glavinic@slon:~$ cat /proc/net/sctp/assocs 
 ASSOC     SOCK   STY SST ST HBKT ASSOC-ID TX_QUEUE RX_QUEUE UID INODE LPORT RPORT LADDRS <-> RADDRS HBINT INS OUTS MAXRT T1X T2X RTXC wmema wmemq sndbuf rcvbuf
f2c0d800 f599c780 0   7   3  8332   13   753877        0    1000 27635 1024   1025  127.0.0.1 <-> *127.0.0.1 	    7500    10    10   10    0    0    15604  1428953  1153600   163840   163840
f2c09800 f599cb40 0   10  3  9101   14        0   327916    1000 27636 1025   1024  127.0.0.1 <-> *127.0.0.1 	    7500    10    10   10    0    0        0        1        0   163840   327680
glavinic@slon:~$ 

What happens is that TC hangs on the send message. Since TC hanged there, we dont get to the point that we start reading and we have locked ourselves forever.

On the other side, I see a possible pitfall with this late rwnd update, especially for the large messages accompanied with large MTUs and when we come close to closing the rwnd, so I'm looking forward for the retest.

Regards,

Matija
 
> -vlad
> 
>>
>>>> The algorithm isn't wrong, but the implementation appears to have
>>>> a bug with window update SACKs.  The problem is that
>>>> sk->sk_rmem_alloc is updated by the skb destructor when
>>>> skb is freed.  This happens after we call sctp_assoc_rwnd_update()
>>>> which tries to send the update SACK.  As a result, in default
>>>> config with per-socket accounting, the test
>>>>      if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>>>> uses the wrong values for rx_count and results in advertisement
>>>> of decreased rwnd instead of what is really available.
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2014-04-21 19:14 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-14 19:45 [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Daniel Borkmann
2014-04-14 19:45 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's bu Daniel Borkmann
2014-04-14 19:57 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Vlad Yasevich
2014-04-14 19:57   ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Vlad Yasevich
2014-04-16  6:57   ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Matija Glavinic Pecotic
2014-04-16  6:57     ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Matija Glavinic Pecotic
2014-04-16  8:39     ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Dongsheng Song
2014-04-16  8:39       ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Dongsheng Song
2014-04-16  9:02       ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Alexander Sverdlin
2014-04-16  9:02         ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Alexander Sverdlin
2014-04-16 11:55         ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Matija Glavinic Pecotic
2014-04-16 11:55           ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Matija Glavinic Pecotic
2014-04-16 13:32           ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Vlad Yasevich
2014-04-16 13:32             ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Vlad Yasevich
2014-04-16 18:50         ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Vlad Yasevich
2014-04-16 18:50           ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Vlad Yasevich
2014-04-16 19:05           ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Daniel Borkmann
2014-04-16 19:05             ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Daniel Borkmann
2014-04-16 19:24             ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Matija Glavinic Pecotic
2014-04-16 19:24               ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Matija Glavinic Pecotic
2014-04-16 19:47               ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Vlad Yasevich
2014-04-16 19:47                 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Vlad Yasevich
2014-04-21 19:12                 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Matija Glavinic Pecotic
2014-04-21 19:12                   ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Matija Glavinic Pecotic
2014-04-14 20:48 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" David Miller
2014-04-14 20:48   ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' David Miller
2014-04-15  8:46   ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Alexander Sverdlin
2014-04-15  8:46     ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Alexander Sverdlin
2014-04-15  8:57     ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Daniel Borkmann
2014-04-15  8:57       ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Daniel Borkmann
2014-04-15  6:43 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Alexander Sverdlin
2014-04-15  6:43   ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Alexander Sverdlin
2014-04-15  7:08   ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Daniel Borkmann
2014-04-15  7:08     ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Daniel Borkmann
2014-04-15 14:27   ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Butler, Peter
2014-04-15 14:27     ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Butler, Peter
2014-04-16 18:36 ` Vlad Yasevich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.