All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 net-next 0/7] tcp: Make use of MSG_EOR in tcp_sendmsg
@ 2016-04-18 22:46 Martin KaFai Lau
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 1/7] tcp: Carry txstamp_ack in tcp_fragment_tstamp Martin KaFai Lau
                   ` (6 more replies)
  0 siblings, 7 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2016-04-18 22:46 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

v2:
~ Rework based on the recent work
  "add TX timestamping via cmsg" by
  Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
~ This version takes the MSG_EOR bit as a signal of
  end-of-response-message and leave the selective
  timestamping job to the cmsg
~ Changes based on the v1 feedback (like avoid
  unlikely check in a loop and adding tcp_sendpage
  support)
~ The first 3 patches are bug fixes.  The fixes in this
  series depend on the newly introduced txstamp_ack in
  net-next.  I will make relevant patches against net after
  getting some feedback.
~ The test results are based on the recently posted net fix:
  "tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks"
~ Due to the lacking cmsg support in packetdrill (or I just
  could not find it), a BPF prog is used to kprobe
  to sock_queue_err_skb() and print out the value of
  serr->ee.ee_data.  The BPF prog (run-able from bcc) is
  attached at the end.

Some of the following is stolen from a commit message of
the following patch to serve as a high level description of
the objective in this series:

This patchset allows the user process to use MSG_EOR during
tcp_sendmsg to tell the kernel that it is the last byte
of an application response message.

It is currently useful when the end-user has turned on any bit of the
SOF_TIMESTAMPING_TX_RECORD_MASK (either by setsockopt or cmsg).
The kernel will then mark the newly added tcb->eor_info bit so
that the shinfo->tskey will not be overwritten (i.e. lost) in
the later skb append/collapse operation.

Each skb can only track one tskey (which is the seq number of the
last byte of the message).   To allow tracking the last byte of
multiple response messages (e.g. HTTP2), this patch takes an
approach by not appending to the previous skb during tcp_sendmsg
if this previous skb's eor information (only shinfo->tskey for now)
will be overwritten.  A similar case also happens during
retransmission.

This approach avoids introducing another list to track the tskey.
The downside is that it will have less tso benefit and/or more
outgoing packets.  Practically, due to the amount of measurement
data generated, sampling is usually used in production. (i.e. not
every connection is tracked).

One of our use case is at the webserver.  The webserver tracks
the HTTP2 response latency by measuring when the webserver sends
the first byte to the socket till the TCP ACK of the last byte
is received.  In the cases where we don't have client side
measurement, measuring from the server side is the only option.
In the cases we have the client side measurement, the server side
data can also be used to justify/cross-check-with the client
side data.

The TCP PRR paper [1] also measures a similar metrics:
"The TCP latency of a HTTP response when the server sends the first
byte until it receives the acknowledgment (ACK) for the last byte."

[1] Proportional Rate Reduction for TCP:
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37486.pdf

BPF prog used for testing:
~~~~~
#!/usr/bin/env python

from __future__ import print_function
from bcc import BPF

bpf_text = """
#include <uapi/linux/ptrace.h>
#include <net/sock.h>
#include <bcc/proto.h>
#include <linux/errqueue.h>

#ifdef memset
#undef memset
#endif

int trace_err_skb(struct pt_regs *ctx)
{
	struct sk_buff *skb = (struct sk_buff *)ctx->si;
	struct sock *sk = (struct sock *)ctx->di;
	struct sock_exterr_skb *serr;
	u32 ee_data = 0;

	if (!sk || !skb)
		return 0;

	serr = SKB_EXT_ERR(skb);
	bpf_probe_read(&ee_data, sizeof(ee_data), &serr->ee.ee_data);
	bpf_trace_printk("ee_data:%u\\n", ee_data);

	return 0;
};
"""

b = BPF(text=bpf_text)
b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb")
print("Attached to kprobe")
b.trace_print()

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 net-next 1/7] tcp: Carry txstamp_ack in tcp_fragment_tstamp
  2016-04-18 22:46 [RFC PATCH v2 net-next 0/7] tcp: Make use of MSG_EOR in tcp_sendmsg Martin KaFai Lau
@ 2016-04-18 22:46 ` Martin KaFai Lau
  2016-04-19  5:21   ` Soheil Hassas Yeganeh
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 2/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_collapse_retrans Martin KaFai Lau
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2016-04-18 22:46 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

When a tcp skb is sliced into two smaller skbs (e.g. in
tcp_fragment() and tso_fragment()),  it does not carry
the txstamp_ack bit to the newly created skb if it is needed.
The end result is a timestamping event (SCM_TSTAMP_ACK) will
be missing from the sk->sk_error_queue.

This patch carries this bit to the new skb2 (if needed)
in tcp_fragment_tstamp().

BPF Output Before:
~~~~~~
<No output due to missing SCM_TSTAMP_ACK timestamp>

BPF Output After:
~~~~~~
<...>-2050  [000] d.s.   100.928763: : ee_data:14599

Packetdrill Script:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 14600) = 14600
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0

0.200 > . 1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1

0.300 < . 1:1(0) ack 14601 win 257

0.300 close(4) = 0
0.300 > F. 14601:14601(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 14602:14602(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp_output.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6451b83..0527ce9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1123,6 +1123,8 @@ static void tcp_fragment_tstamp(struct sk_buff *skb, struct sk_buff *skb2)
 		shinfo->tx_flags &= ~tsflags;
 		shinfo2->tx_flags |= tsflags;
 		swap(shinfo->tskey, shinfo2->tskey);
+		TCP_SKB_CB(skb2)->txstamp_ack = TCP_SKB_CB(skb)->txstamp_ack;
+		TCP_SKB_CB(skb)->txstamp_ack = 0;
 	}
 }
 
-- 
2.5.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 net-next 2/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_collapse_retrans
  2016-04-18 22:46 [RFC PATCH v2 net-next 0/7] tcp: Make use of MSG_EOR in tcp_sendmsg Martin KaFai Lau
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 1/7] tcp: Carry txstamp_ack in tcp_fragment_tstamp Martin KaFai Lau
@ 2016-04-18 22:46 ` Martin KaFai Lau
  2016-04-19  5:32   ` Soheil Hassas Yeganeh
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 3/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_shifted_skb Martin KaFai Lau
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2016-04-18 22:46 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

If two skbs are merged/collapsed during retransmission, the current
logic does not merge the tx_flags, tskey and txstamp_ack.  The end
result is the SCM_TSTAMP_ACK timestamp could be missing for a
packet that the end-user has specifically turned on
SOF_TIMESTAMPING_TX_ACK (e.g. by cmsg).

The patch:
1. Merge the tx_flags and txstamp_ack
2. Overwrite the tskey with the later skb (next_skb)

BPF Output Before:
~~~~~~
<no-output-due-to-missing-tstamp-event>

BPF Output After:
~~~~~~
packetdrill-2092  [001] d.s.   453.998486: : ee_data:1459

Packetdrill Script:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
0.200 write(4, ..., 11680) = 11680

0.200 > P. 1:731(730) ack 1
0.200 > P. 731:1461(730) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:13141(4380) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 13141 win 257

0.400 close(4) = 0
0.400 > F. 13141:13141(0) ack 1
0.500 < F. 1:1(0) ack 13142 win 257
0.500 > . 13142:13142(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp_output.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0527ce9..889ed96 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2443,6 +2443,22 @@ u32 __tcp_select_window(struct sock *sk)
 	return window;
 }
 
+static void tcp_skb_collapse_tstamp(struct sk_buff *skb,
+				    const struct sk_buff *next_skb)
+{
+	const struct skb_shared_info *next_shinfo = skb_shinfo(next_skb);
+
+	if (unlikely(next_shinfo->tx_flags & SKBTX_ANY_TSTAMP)) {
+		struct skb_shared_info *shinfo = skb_shinfo(skb);
+		u8 tsflags = next_shinfo->tx_flags & SKBTX_ANY_TSTAMP;
+
+		shinfo->tx_flags |= tsflags;
+		shinfo->tskey = next_shinfo->tskey;
+		TCP_SKB_CB(skb)->txstamp_ack =
+			!!(shinfo->tx_flags & SKBTX_ACK_TSTAMP);
+	}
+}
+
 /* Collapses two adjacent SKB's during retransmission. */
 static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 {
@@ -2486,6 +2502,8 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 
 	tcp_adjust_pcount(sk, next_skb, tcp_skb_pcount(next_skb));
 
+	tcp_skb_collapse_tstamp(skb, next_skb);
+
 	sk_wmem_free_skb(sk, next_skb);
 }
 
-- 
2.5.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 net-next 3/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_shifted_skb
  2016-04-18 22:46 [RFC PATCH v2 net-next 0/7] tcp: Make use of MSG_EOR in tcp_sendmsg Martin KaFai Lau
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 1/7] tcp: Carry txstamp_ack in tcp_fragment_tstamp Martin KaFai Lau
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 2/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_collapse_retrans Martin KaFai Lau
@ 2016-04-18 22:46 ` Martin KaFai Lau
  2016-04-19  5:38   ` Soheil Hassas Yeganeh
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg Martin KaFai Lau
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2016-04-18 22:46 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

After receiving sacks, tcp_shifted_skb() will collapse
skbs if possible.  tx_flags/tskey/txstamp_ack also has
to be merged in this case.

This patch resues the tcp_skb_collapse_tstamp() to handle
them.

BPF Output Before:
~~~~~
<no-output-due-to-missing-tstamp-event>

BPF Output After:
~~~~~
<...>-2024  [007] d.s.    88.644374: : ee_data:14599

Packetdrill Script:
~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 1460) = 1460
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 13140) = 13140
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0

0.200 > P. 1:1461(1460) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:14601(5840) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 14601 win 257

0.400 close(4) = 0
0.400 > F. 14601:14601(0) ack 1
0.500 < F. 1:1(0) ack 14602 win 257
0.500 > . 14602:14602(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h     | 2 ++
 net/ipv4/tcp_input.c  | 1 +
 net/ipv4/tcp_output.c | 4 ++--
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index fd40f8c..c0ef054 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -557,6 +557,8 @@ void tcp_send_ack(struct sock *sk);
 void tcp_send_delayed_ack(struct sock *sk);
 void tcp_send_loss_probe(struct sock *sk);
 bool tcp_schedule_loss_probe(struct sock *sk);
+void tcp_skb_collapse_tstamp(struct sk_buff *skb,
+			     const struct sk_buff *next_skb);
 
 /* tcp_input.c */
 void tcp_resume_early_retransmit(struct sock *sk);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5e45a9c..75e8336 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1309,6 +1309,7 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *skb,
 	if (skb == tcp_highest_sack(sk))
 		tcp_advance_highest_sack(sk, skb);
 
+	tcp_skb_collapse_tstamp(prev, skb);
 	tcp_unlink_write_queue(skb, sk);
 	sk_wmem_free_skb(sk, skb);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 889ed96..d21a78f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2443,8 +2443,8 @@ u32 __tcp_select_window(struct sock *sk)
 	return window;
 }
 
-static void tcp_skb_collapse_tstamp(struct sk_buff *skb,
-				    const struct sk_buff *next_skb)
+void tcp_skb_collapse_tstamp(struct sk_buff *skb,
+			     const struct sk_buff *next_skb)
 {
 	const struct skb_shared_info *next_shinfo = skb_shinfo(next_skb);
 
-- 
2.5.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg
  2016-04-18 22:46 [RFC PATCH v2 net-next 0/7] tcp: Make use of MSG_EOR in tcp_sendmsg Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 3/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_shifted_skb Martin KaFai Lau
@ 2016-04-18 22:46 ` Martin KaFai Lau
  2016-04-18 23:18   ` Eric Dumazet
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 5/7] tcp: Make use of MSG_EOR in tcp_sendpage Martin KaFai Lau
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2016-04-18 22:46 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

This patch allows the user process to use MSG_EOR during
tcp_sendmsg to tell the kernel that it is the last byte
of an application response message.

It is currently useful when the end-user has turned on any bit of the
SOF_TIMESTAMPING_TX_RECORD_MASK (either by setsockopt or cmsg).
The kernel will then mark the newly added tcb->eor_info bit so
that the shinfo->tskey will not be overwritten (i.e. lost) in
the later skb append/collapse operation.

With selective SOF_TIMESTAMPING_TX_ACK (by cmsg) and MSG_EOR (this
patch), the user application can specially tell which outgoing byte
it wants to track its ACK and ask the kernel not to lose this
tracking info in the later skb append/collapse action.

This patch handles the append case in tcp_sendmsg.  The later
patches will handle the collapse during retransmission and
skb slicing in tcp_fragment()/tso_fragment().

One of our use case is at the webserver.  The webserver tracks
the HTTP2 response latency by measuring when the webserver sends
the first byte to the socket till the TCP ACK of the last byte
is received.  In the cases where we don't have client side
measurement, measuring from the server side is the only option.
In the cases we have the client side measurement, the server side
data can also be used to justify/cross-check-with the client
side data.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h |  5 ++++-
 net/ipv4/tcp.c    | 21 +++++++++++++++++----
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index c0ef054..f3c5dcb 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -762,7 +762,10 @@ struct tcp_skb_cb {
 
 	__u8		ip_dsfield;	/* IPv4 tos or IPv6 dsfield	*/
 	__u8		txstamp_ack:1,	/* Record TX timestamp for ack? */
-			unused:7;
+			eor_info:1,	/* Any EOR marked info that prevents
+					 * skbs from merging.
+					 */
+			unused:6;
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
 	union {
 		struct inet_skb_parm	h4;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4d73858..2918f42 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -428,15 +428,18 @@ void tcp_init_sock(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_init_sock);
 
-static void tcp_tx_timestamp(struct sock *sk, u16 tsflags, struct sk_buff *skb)
+static void tcp_tx_timestamp(struct sock *sk, u16 tsflags, struct sk_buff *skb,
+			     int flags)
 {
 	if (sk->sk_tsflags || tsflags) {
 		struct skb_shared_info *shinfo = skb_shinfo(skb);
 		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
 
 		sock_tx_timestamp(sk, tsflags, &shinfo->tx_flags);
-		if (shinfo->tx_flags & SKBTX_ANY_TSTAMP)
+		if (shinfo->tx_flags & SKBTX_ANY_TSTAMP) {
 			shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
+			tcb->eor_info = !!(flags & MSG_EOR);
+		}
 		tcb->txstamp_ack = !!(shinfo->tx_flags & SKBTX_ACK_TSTAMP);
 	}
 }
@@ -874,6 +877,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }
 
+static bool tcp_sendmsg_noappend(struct sock *sk, u16 tx_tsflags)
+{
+	return unlikely((tx_tsflags & SOF_TIMESTAMPING_TX_RECORD_MASK) &&
+			tcp_send_head(sk) &&
+			TCP_SKB_CB(tcp_write_queue_tail(sk))->eor_info);
+}
+
 static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 				size_t size, int flags)
 {
@@ -959,7 +969,7 @@ new_segment:
 		offset += copy;
 		size -= copy;
 		if (!size) {
-			tcp_tx_timestamp(sk, sk->sk_tsflags, skb);
+			tcp_tx_timestamp(sk, sk->sk_tsflags, skb, 0);
 			goto out;
 		}
 
@@ -1145,6 +1155,9 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 
 	sg = !!(sk->sk_route_caps & NETIF_F_SG);
 
+	if (tcp_sendmsg_noappend(sk, sockc.tsflags))
+		goto new_segment;
+
 	while (msg_data_left(msg)) {
 		int copy = 0;
 		int max = size_goal;
@@ -1249,7 +1262,7 @@ new_segment:
 
 		copied += copy;
 		if (!msg_data_left(msg)) {
-			tcp_tx_timestamp(sk, sockc.tsflags, skb);
+			tcp_tx_timestamp(sk, sockc.tsflags, skb, flags);
 			goto out;
 		}
 
-- 
2.5.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 net-next 5/7] tcp: Make use of MSG_EOR in tcp_sendpage
  2016-04-18 22:46 [RFC PATCH v2 net-next 0/7] tcp: Make use of MSG_EOR in tcp_sendmsg Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg Martin KaFai Lau
@ 2016-04-18 22:46 ` Martin KaFai Lau
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 6/7] tcp: Carry eor_info in tcp_fragment_tstamp() and tcp_skb_collapse_tstamp() Martin KaFai Lau
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 7/7] tcp: Avoid losing eor_info when collapsing skbs Martin KaFai Lau
  6 siblings, 0 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2016-04-18 22:46 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

It reuses the tcp_sendmsg_noappend() to decide if a new_segment
is needed before entering the loop.  More checks could be added
later for the tcp_sendpage case to decide if a new_segment is
needed immediately.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2918f42..6bb33b8 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -913,6 +913,9 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
 		goto out_err;
 
+	if (tcp_sendmsg_noappend(sk, sk->sk_tsflags))
+		goto new_segment;
+
 	while (size > 0) {
 		struct sk_buff *skb = tcp_write_queue_tail(sk);
 		int copy, i;
@@ -969,7 +972,7 @@ new_segment:
 		offset += copy;
 		size -= copy;
 		if (!size) {
-			tcp_tx_timestamp(sk, sk->sk_tsflags, skb, 0);
+			tcp_tx_timestamp(sk, sk->sk_tsflags, skb, flags);
 			goto out;
 		}
 
-- 
2.5.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 net-next 6/7] tcp: Carry eor_info in tcp_fragment_tstamp() and tcp_skb_collapse_tstamp()
  2016-04-18 22:46 [RFC PATCH v2 net-next 0/7] tcp: Make use of MSG_EOR in tcp_sendmsg Martin KaFai Lau
                   ` (4 preceding siblings ...)
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 5/7] tcp: Make use of MSG_EOR in tcp_sendpage Martin KaFai Lau
@ 2016-04-18 22:46 ` Martin KaFai Lau
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 7/7] tcp: Avoid losing eor_info when collapsing skbs Martin KaFai Lau
  6 siblings, 0 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2016-04-18 22:46 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

Like txstamp_ack bit, if needed, the eor_info bit should also be carried
to the new skb2 when splitting a skb
or
to the prev skb from the next_skb when collapsing skbs.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp_output.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d21a78f..e71336c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1125,6 +1125,8 @@ static void tcp_fragment_tstamp(struct sk_buff *skb, struct sk_buff *skb2)
 		swap(shinfo->tskey, shinfo2->tskey);
 		TCP_SKB_CB(skb2)->txstamp_ack = TCP_SKB_CB(skb)->txstamp_ack;
 		TCP_SKB_CB(skb)->txstamp_ack = 0;
+		TCP_SKB_CB(skb2)->eor_info = TCP_SKB_CB(skb)->eor_info;
+		TCP_SKB_CB(skb)->eor_info = 0;
 	}
 }
 
@@ -2456,6 +2458,8 @@ void tcp_skb_collapse_tstamp(struct sk_buff *skb,
 		shinfo->tskey = next_shinfo->tskey;
 		TCP_SKB_CB(skb)->txstamp_ack =
 			!!(shinfo->tx_flags & SKBTX_ACK_TSTAMP);
+		if (TCP_SKB_CB(next_skb)->eor_info)
+			TCP_SKB_CB(skb)->eor_info = 1;
 	}
 }
 
-- 
2.5.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 net-next 7/7] tcp: Avoid losing eor_info when collapsing skbs
  2016-04-18 22:46 [RFC PATCH v2 net-next 0/7] tcp: Make use of MSG_EOR in tcp_sendmsg Martin KaFai Lau
                   ` (5 preceding siblings ...)
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 6/7] tcp: Carry eor_info in tcp_fragment_tstamp() and tcp_skb_collapse_tstamp() Martin KaFai Lau
@ 2016-04-18 22:46 ` Martin KaFai Lau
  6 siblings, 0 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2016-04-18 22:46 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

When collapsing skbs during tcp_collapse_retrans and tcp_shift_skb_data,
this patch is to avoid collapsing to a prev skb that has eor_info mark
and the next_skb also has the tskey set (i.e. the prev skb will lose
a eor marked tskey info).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h     | 6 ++++++
 net/ipv4/tcp_input.c  | 3 +++
 net/ipv4/tcp_output.c | 7 +++++--
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index f3c5dcb..56a287a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -812,6 +812,12 @@ static inline int tcp_skb_mss(const struct sk_buff *skb)
 	return TCP_SKB_CB(skb)->tcp_gso_size;
 }
 
+static inline bool  tcp_skb_can_collapse_eor(const struct sk_buff *skb,
+					     const struct sk_buff *next_skb)
+{
+	return !(TCP_SKB_CB(skb)->eor_info && skb_shinfo(next_skb)->tskey);
+}
+
 /* Events passed to congestion control interface */
 enum tcp_ca_event {
 	CA_EVENT_TX_START,	/* first transmit when no packets in flight */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 75e8336..e60922d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1368,6 +1368,9 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
 	if ((TCP_SKB_CB(prev)->sacked & TCPCB_TAGBITS) != TCPCB_SACKED_ACKED)
 		goto fallback;
 
+	if (!tcp_skb_can_collapse_eor(prev, skb))
+		goto fallback;
+
 	in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq) &&
 		  !before(end_seq, TCP_SKB_CB(skb)->end_seq);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e71336c..1ae21f1 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2512,7 +2512,8 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 }
 
 /* Check if coalescing SKBs is legal. */
-static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb)
+static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *to,
+			     const struct sk_buff *skb)
 {
 	if (tcp_skb_pcount(skb) > 1)
 		return false;
@@ -2526,6 +2527,8 @@ static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb)
 	/* Some heurestics for collapsing over SACK'd could be invented */
 	if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
 		return false;
+	if (!tcp_skb_can_collapse_eor(to, skb))
+		return false;
 
 	return true;
 }
@@ -2546,7 +2549,7 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
 		return;
 
 	tcp_for_write_queue_from_safe(skb, tmp, sk) {
-		if (!tcp_can_collapse(sk, skb))
+		if (!tcp_can_collapse(sk, to, skb))
 			break;
 
 		space -= skb->len;
-- 
2.5.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg Martin KaFai Lau
@ 2016-04-18 23:18   ` Eric Dumazet
  2016-04-18 23:43     ` kafai
  2016-04-19  9:47     ` David Laight
  0 siblings, 2 replies; 28+ messages in thread
From: Eric Dumazet @ 2016-04-18 23:18 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

On Mon, 2016-04-18 at 15:46 -0700, Martin KaFai Lau wrote:
> This patch allows the user process to use MSG_EOR during
> tcp_sendmsg to tell the kernel that it is the last byte
> of an application response message.
> 
> It is currently useful when the end-user has turned on any bit of the
> SOF_TIMESTAMPING_TX_RECORD_MASK (either by setsockopt or cmsg).
> The kernel will then mark the newly added tcb->eor_info bit so
> that the shinfo->tskey will not be overwritten (i.e. lost) in
> the later skb append/collapse operation.
> 
> With selective SOF_TIMESTAMPING_TX_ACK (by cmsg) and MSG_EOR (this
> patch), the user application can specially tell which outgoing byte
> it wants to track its ACK and ask the kernel not to lose this
> tracking info in the later skb append/collapse action.
> 
> This patch handles the append case in tcp_sendmsg.  The later
> patches will handle the collapse during retransmission and
> skb slicing in tcp_fragment()/tso_fragment().
> 
> One of our use case is at the webserver.  The webserver tracks
> the HTTP2 response latency by measuring when the webserver sends
> the first byte to the socket till the TCP ACK of the last byte
> is received.  In the cases where we don't have client side
> measurement, measuring from the server side is the only option.
> In the cases we have the client side measurement, the server side
> data can also be used to justify/cross-check-with the client
> side data.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---

MSG_EOR should not depend on SKBTX_ANY_TSTAMP

Really, simply using send(fd, ..., len, MSG_EOR) should instruct TCP to
mark the cooked skb as a non candidate for future coalescing.

netperf could then get an option to set this MSG_EOR ;)

I believe Soheil was working on such simple alternative ?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg
  2016-04-18 23:18   ` Eric Dumazet
@ 2016-04-18 23:43     ` kafai
  2016-04-19  0:06       ` Eric Dumazet
  2016-04-19  9:47     ` David Laight
  1 sibling, 1 reply; 28+ messages in thread
From: kafai @ 2016-04-18 23:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

On Mon, Apr 18, 2016 at 04:18:13PM -0700, Eric Dumazet wrote:
> On Mon, 2016-04-18 at 15:46 -0700, Martin KaFai Lau wrote:
> > This patch allows the user process to use MSG_EOR during
> > tcp_sendmsg to tell the kernel that it is the last byte
> > of an application response message.
> >
> > It is currently useful when the end-user has turned on any bit of the
> > SOF_TIMESTAMPING_TX_RECORD_MASK (either by setsockopt or cmsg).
> > The kernel will then mark the newly added tcb->eor_info bit so
> > that the shinfo->tskey will not be overwritten (i.e. lost) in
> > the later skb append/collapse operation.
> >
> > With selective SOF_TIMESTAMPING_TX_ACK (by cmsg) and MSG_EOR (this
> > patch), the user application can specially tell which outgoing byte
> > it wants to track its ACK and ask the kernel not to lose this
> > tracking info in the later skb append/collapse action.
> >
> > This patch handles the append case in tcp_sendmsg.  The later
> > patches will handle the collapse during retransmission and
> > skb slicing in tcp_fragment()/tso_fragment().
> >
> > One of our use case is at the webserver.  The webserver tracks
> > the HTTP2 response latency by measuring when the webserver sends
> > the first byte to the socket till the TCP ACK of the last byte
> > is received.  In the cases where we don't have client side
> > measurement, measuring from the server side is the only option.
> > In the cases we have the client side measurement, the server side
> > data can also be used to justify/cross-check-with the client
> > side data.
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Cc: Neal Cardwell <ncardwell@google.com>
> > Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Yuchung Cheng <ycheng@google.com>
> > ---
>
> MSG_EOR should not depend on SKBTX_ANY_TSTAMP
>
> Really, simply using send(fd, ..., len, MSG_EOR) should instruct TCP to
> mark the cooked skb as a non candidate for future coalescing.
It was one of my earlier local attempt.  There are cases that coalescing
will not lose the tskey, so I trashed it.

If we mark eor only based on MSG_EOR, we can still do checks on
prev_skb's tskey and next_skb's tskey before coalescing two skbs
or
you meant simply don't coalesce if the prev_skb has eor marked?

>
> netperf could then get an option to set this MSG_EOR ;)
Not sure how it is related.  Can you share how netperf can
benefit from MSG_EOR in TCP tests without any of the
SOF_TIMESTAMPING_TX_RECORD_MASK.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg
  2016-04-18 23:43     ` kafai
@ 2016-04-19  0:06       ` Eric Dumazet
  2016-04-19  2:27         ` Martin KaFai Lau
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2016-04-19  0:06 UTC (permalink / raw)
  To: kafai
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

On Mon, 2016-04-18 at 16:43 -0700, kafai@fb.com wrote:

> >
> > netperf could then get an option to set this MSG_EOR ;)
> Not sure how it is related.  Can you share how netperf can
> benefit from MSG_EOR in TCP tests without any of the
> SOF_TIMESTAMPING_TX_RECORD_MASK.

Simply setting MSG_EOR would be orthogonal to other timestamping stuff.

Maybe the application does not _want_ to be notified when skb is sent or
acknowledged, but would like some kind of "tcpdump awareness" or
something about burst control, who knows...

It should only be a request from user space to ask TCP to not aggregate
stuff on future sendpage()/sendmsg() on the skb carrying this new flag.

We already have other flags to ask for timestamping stuff, and they
could be used at the same time.

If the stack needs to be changed to properly fragment skb (or
aggregating them at retransmit), this is a separate concern.

Note that you do not need to automatically assert MSG_BOR (Begin of
Request) : MSG_EOR should really control the fact that last byte sent
marks the skb as being a non candidate for aggregation.

This would keep tcp_sendmsg() reasonnably fast.
Your tcp_sendmsg_noappend() is quite expensive :(

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg
  2016-04-19  0:06       ` Eric Dumazet
@ 2016-04-19  2:27         ` Martin KaFai Lau
  2016-04-19  2:50           ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2016-04-19  2:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

On Mon, Apr 18, 2016 at 05:06:57PM -0700, Eric Dumazet wrote:
> It should only be a request from user space to ask TCP to not aggregate
> stuff on future sendpage()/sendmsg() on the skb carrying this new flag.
>
How about something like this.  Please advise if tcp_sendmsg_noappend can
be simpler.

diff --git a/include/net/tcp.h b/include/net/tcp.h
index c0ef054..ac31798 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -762,7 +762,8 @@ struct tcp_skb_cb {

 	__u8		ip_dsfield;	/* IPv4 tos or IPv6 dsfield	*/
 	__u8		txstamp_ack:1,	/* Record TX timestamp for ack? */
-			unused:7;
+			eor:1,		/* Is skb MSG_EOR marked */
+			unused:6;
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
 	union {
 		struct inet_skb_parm	h4;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4d73858..12772be 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -874,6 +874,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }

+static bool tcp_sendmsg_noappend(const struct sock *sk)
+{
+	const struct sk_buff *skb = tcp_write_queue_tail(sk);
+
+	return (!skb || TCP_SKB_CB(skb)->eor);
+}
+
 static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 				size_t size, int flags)
 {
@@ -903,6 +910,9 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
 		goto out_err;

+	if (tcp_sendmsg_noappend(sk))
+		goto new_segment;
+
 	while (size > 0) {
 		struct sk_buff *skb = tcp_write_queue_tail(sk);
 		int copy, i;
@@ -960,6 +970,7 @@ new_segment:
 		size -= copy;
 		if (!size) {
 			tcp_tx_timestamp(sk, sk->sk_tsflags, skb);
+			TCP_SKB_CB(skb)->eor = !!(flags & MSG_EOR);
 			goto out;
 		}

@@ -1145,6 +1156,9 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)

 	sg = !!(sk->sk_route_caps & NETIF_F_SG);

+	if (tcp_sendmsg_noappend(sk))
+		goto new_segment;
+
 	while (msg_data_left(msg)) {
 		int copy = 0;
 		int max = size_goal;
@@ -1250,6 +1264,7 @@ new_segment:
 		copied += copy;
 		if (!msg_data_left(msg)) {
 			tcp_tx_timestamp(sk, sockc.tsflags, skb);
+			TCP_SKB_CB(skb)->eor = !!(flags & MSG_EOR);
 			goto out;
 		}

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg
  2016-04-19  2:27         ` Martin KaFai Lau
@ 2016-04-19  2:50           ` Eric Dumazet
  2016-04-19  3:18             ` Martin KaFai Lau
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2016-04-19  2:50 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

On Mon, 2016-04-18 at 19:27 -0700, Martin KaFai Lau wrote:
> On Mon, Apr 18, 2016 at 05:06:57PM -0700, Eric Dumazet wrote:
> > It should only be a request from user space to ask TCP to not aggregate
> > stuff on future sendpage()/sendmsg() on the skb carrying this new flag.
> >
> How about something like this.  Please advise if tcp_sendmsg_noappend can
> be simpler.
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index c0ef054..ac31798 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -762,7 +762,8 @@ struct tcp_skb_cb {
> 
>  	__u8		ip_dsfield;	/* IPv4 tos or IPv6 dsfield	*/
>  	__u8		txstamp_ack:1,	/* Record TX timestamp for ack? */
> -			unused:7;
> +			eor:1,		/* Is skb MSG_EOR marked */
> +			unused:6;
>  	__u32		ack_seq;	/* Sequence number ACK'd	*/
>  	union {
>  		struct inet_skb_parm	h4;
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 4d73858..12772be 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -874,6 +874,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
>  	return mss_now;
>  }
> 
> +static bool tcp_sendmsg_noappend(const struct sock *sk)
> +{
> +	const struct sk_buff *skb = tcp_write_queue_tail(sk);
> +
> +	return (!skb || TCP_SKB_CB(skb)->eor);
> +}
> +
>  static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
>  				size_t size, int flags)
>  {
> @@ -903,6 +910,9 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
>  	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
>  		goto out_err;
> 
> +	if (tcp_sendmsg_noappend(sk))
> +		goto new_segment;
> +
>  	while (size > 0) {
>  		struct sk_buff *skb = tcp_write_queue_tail(sk);
>  		int copy, i;
> @@ -960,6 +970,7 @@ new_segment:
>  		size -= copy;
>  		if (!size) {
>  			tcp_tx_timestamp(sk, sk->sk_tsflags, skb);
> +			TCP_SKB_CB(skb)->eor = !!(flags & MSG_EOR);
>  			goto out;
>  		}
> 
> @@ -1145,6 +1156,9 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> 
>  	sg = !!(sk->sk_route_caps & NETIF_F_SG);
> 
> +	if (tcp_sendmsg_noappend(sk))
> +		goto new_segment;
> +
>  	while (msg_data_left(msg)) {
>  		int copy = 0;
>  		int max = size_goal;
> @@ -1250,6 +1264,7 @@ new_segment:
>  		copied += copy;
>  		if (!msg_data_left(msg)) {
>  			tcp_tx_timestamp(sk, sockc.tsflags, skb);
> +			TCP_SKB_CB(skb)->eor = !!(flags & MSG_EOR);
>  			goto out;
>  		}

I believe it is slightly wrong (to do the goto new_segment if there is
no data to send)

I would instead use this fast path, doing the test _when_ we already
have an skb to test for.

diff --git a/include/net/tcp.h b/include/net/tcp.h
index fd40f8c64d5f..015851634195 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -760,7 +760,8 @@ struct tcp_skb_cb {
 
 	__u8		ip_dsfield;	/* IPv4 tos or IPv6 dsfield	*/
 	__u8		txstamp_ack:1,	/* Record TX timestamp for ack? */
-			unused:7;
+			eor:1,		/* Is skb MSG_EOR marked */
+			unused:6;
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
 	union {
 		struct inet_skb_parm	h4;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4d73858991af..1944c19885ab 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -908,7 +908,9 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 		int copy, i;
 		bool can_coalesce;
 
-		if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0) {
+		if (!tcp_send_head(sk) ||
+		    (copy = size_goal - skb->len) <= 0 ||
+		    TCP_SKB_CB(skb)->eor) {
 new_segment:
 			if (!sk_stream_memory_free(sk))
 				goto wait_for_sndbuf;
@@ -960,6 +962,7 @@ new_segment:
 		size -= copy;
 		if (!size) {
 			tcp_tx_timestamp(sk, sk->sk_tsflags, skb);
+			TCP_SKB_CB(skb)->eor = !!(flags & MSG_EOR);
 			goto out;
 		}
 
@@ -1156,7 +1159,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 			copy = max - skb->len;
 		}
 
-		if (copy <= 0) {
+		if (copy <= 0 || TCP_SKB_CB(skb)->eor) {
 new_segment:
 			/* Allocate new segment. If the interface is SG,
 			 * allocate skb fitting to single page.
@@ -1250,6 +1253,7 @@ new_segment:
 		copied += copy;
 		if (!msg_data_left(msg)) {
 			tcp_tx_timestamp(sk, sockc.tsflags, skb);
+			TCP_SKB_CB(skb)->eor = !!(flags & MSG_EOR);
 			goto out;
 		}
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6451b83d81e9..acfbff81ef47 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1171,6 +1171,8 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
 	TCP_SKB_CB(skb)->tcp_flags = flags & ~(TCPHDR_FIN | TCPHDR_PSH);
 	TCP_SKB_CB(buff)->tcp_flags = flags;
 	TCP_SKB_CB(buff)->sacked = TCP_SKB_CB(skb)->sacked;
+	TCP_SKB_CB(buff)->eor = TCP_SKB_CB(skb)->eor;
+	TCP_SKB_CB(skb)->eor = 0;
 
 	if (!skb_shinfo(skb)->nr_frags && skb->ip_summed != CHECKSUM_PARTIAL) {
 		/* Copy and checksum data tail into the new buffer. */
@@ -1730,6 +1732,8 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
 
 	/* This packet was never sent out yet, so no SACK bits. */
 	TCP_SKB_CB(buff)->sacked = 0;
+	TCP_SKB_CB(buff)->eor = TCP_SKB_CB(skb)->eor;
+	TCP_SKB_CB(skb)->eor = 0;
 
 	buff->ip_summed = skb->ip_summed = CHECKSUM_PARTIAL;
 	skb_split(skb, buff, len);
@@ -2471,6 +2475,7 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 
 	/* Merge over control information. This moves PSH/FIN etc. over */
 	TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags;
+	TCP_SKB_CB(skb)->eor = TCP_SKB_CB(next_skb)->eor;
 
 	/* All done, get rid of second SKB and account for it so
 	 * packet counting does not break.
@@ -2502,7 +2507,8 @@ static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb)
 	/* Some heurestics for collapsing over SACK'd could be invented */
 	if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
 		return false;
-
+	if (TCP_SKB_CB(skb)->eor)
+		return false;
 	return true;
 }
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg
  2016-04-19  2:50           ` Eric Dumazet
@ 2016-04-19  3:18             ` Martin KaFai Lau
  2016-04-19  3:25               ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2016-04-19  3:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

On Mon, Apr 18, 2016 at 07:50:41PM -0700, Eric Dumazet wrote:
> I believe it is slightly wrong (to do the goto new_segment if there is
> no data to send)
Aha. Thanks for pointing it out.

>
> I would instead use this fast path, doing the test _when_ we already
> have an skb to test for.
The v1 was doing a check in the loop but the feedback was, instead
of doing this unlikely(test) repeatedly in the loop, do it before
entering the loop and do a goto new_segment if needed.

I agree that doing it in the loop is easier to follow/read
and checking TCP_SKB_CB(skb)->eor is cheaper than my v1.
I will respin with your suggestion.

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 6451b83d81e9..acfbff81ef47 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1171,6 +1171,8 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
>  	TCP_SKB_CB(skb)->tcp_flags = flags & ~(TCPHDR_FIN | TCPHDR_PSH);
>  	TCP_SKB_CB(buff)->tcp_flags = flags;
>  	TCP_SKB_CB(buff)->sacked = TCP_SKB_CB(skb)->sacked;
> +	TCP_SKB_CB(buff)->eor = TCP_SKB_CB(skb)->eor;
> +	TCP_SKB_CB(skb)->eor = 0;
>
>  	if (!skb_shinfo(skb)->nr_frags && skb->ip_summed != CHECKSUM_PARTIAL) {
>  		/* Copy and checksum data tail into the new buffer. */
> @@ -1730,6 +1732,8 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
>
>  	/* This packet was never sent out yet, so no SACK bits. */
>  	TCP_SKB_CB(buff)->sacked = 0;
> +	TCP_SKB_CB(buff)->eor = TCP_SKB_CB(skb)->eor;
> +	TCP_SKB_CB(skb)->eor = 0;
>
>  	buff->ip_summed = skb->ip_summed = CHECKSUM_PARTIAL;
>  	skb_split(skb, buff, len);
> @@ -2471,6 +2475,7 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
>
>  	/* Merge over control information. This moves PSH/FIN etc. over */
>  	TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags;
> +	TCP_SKB_CB(skb)->eor = TCP_SKB_CB(next_skb)->eor;
>
>  	/* All done, get rid of second SKB and account for it so
>  	 * packet counting does not break.
> @@ -2502,7 +2507,8 @@ static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb)
>  	/* Some heurestics for collapsing over SACK'd could be invented */
>  	if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
>  		return false;
> -
> +	if (TCP_SKB_CB(skb)->eor)
> +		return false;
>  	return true;
>  }
Thanks for this diff.  It confirms that I probably understand your last
suggestion correctly.  I also have similar diff for the sacks handling.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg
  2016-04-19  3:18             ` Martin KaFai Lau
@ 2016-04-19  3:25               ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2016-04-19  3:25 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

On Mon, 2016-04-18 at 20:18 -0700, Martin KaFai Lau wrote:
> On Mon, Apr 18, 2016 at 07:50:41PM -0700, Eric Dumazet wrote:
> > I believe it is slightly wrong (to do the goto new_segment if there is
> > no data to send)
> Aha. Thanks for pointing it out.
> 
> >
> > I would instead use this fast path, doing the test _when_ we already
> > have an skb to test for.
> The v1 was doing a check in the loop but the feedback was, instead
> of doing this unlikely(test) repeatedly in the loop, do it before
> entering the loop and do a goto new_segment if needed.
> 
> I agree that doing it in the loop is easier to follow/read
> and checking TCP_SKB_CB(skb)->eor is cheaper than my v1.
> I will respin with your suggestion.

I do not remember the suggestion. It is better to do the test in the
loop, as you could have multiple threads doing a sendmsg() on the same
socket, and eventually sleeping in sk_stream_wait_memory()

A bit convoluted, but who knows what can be done by some applications ;)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 1/7] tcp: Carry txstamp_ack in tcp_fragment_tstamp
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 1/7] tcp: Carry txstamp_ack in tcp_fragment_tstamp Martin KaFai Lau
@ 2016-04-19  5:21   ` Soheil Hassas Yeganeh
  2016-04-19 17:39     ` Martin KaFai Lau
  0 siblings, 1 reply; 28+ messages in thread
From: Soheil Hassas Yeganeh @ 2016-04-19  5:21 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team,
	Soheil Hassas Yeganeh

On Mon, Apr 18, 2016 at 6:46 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> When a tcp skb is sliced into two smaller skbs (e.g. in
> tcp_fragment() and tso_fragment()),  it does not carry
> the txstamp_ack bit to the newly created skb if it is needed.
> The end result is a timestamping event (SCM_TSTAMP_ACK) will
> be missing from the sk->sk_error_queue.
>
> This patch carries this bit to the new skb2 (if needed)
> in tcp_fragment_tstamp().
>
> BPF Output Before:
> ~~~~~~
> <No output due to missing SCM_TSTAMP_ACK timestamp>
>
> BPF Output After:
> ~~~~~~
> <...>-2050  [000] d.s.   100.928763: : ee_data:14599
>
> Packetdrill Script:
> ~~~~~~
> +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
> +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
> +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0 bind(3, ..., ...) = 0
> +0 listen(3, 1) = 0
>
> 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
> 0.200 < . 1:1(0) ack 1 win 257
> 0.200 accept(3, ..., ...) = 4
> +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
>
> +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
> 0.200 write(4, ..., 14600) = 14600
> +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
>
> 0.200 > . 1:7301(7300) ack 1
> 0.200 > P. 7301:14601(7300) ack 1
>
> 0.300 < . 1:1(0) ack 14601 win 257
>
> 0.300 close(4) = 0
> 0.300 > F. 14601:14601(0) ack 1
> 0.400 < F. 1:1(0) ack 16062 win 257
> 0.400 > . 14602:14602(0) ack 2
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc:  Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---
>  net/ipv4/tcp_output.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 6451b83..0527ce9 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1123,6 +1123,8 @@ static void tcp_fragment_tstamp(struct sk_buff *skb, struct sk_buff *skb2)
>                 shinfo->tx_flags &= ~tsflags;
>                 shinfo2->tx_flags |= tsflags;
>                 swap(shinfo->tskey, shinfo2->tskey);
> +               TCP_SKB_CB(skb2)->txstamp_ack = TCP_SKB_CB(skb)->txstamp_ack;
> +               TCP_SKB_CB(skb)->txstamp_ack = 0;
>         }
>  }

Thanks for the fixes! I was going to submit similar patches. :-)

Could you please submit the timestamping patches separately as non RFCs? Thanks!

> --
> 2.5.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 2/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_collapse_retrans
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 2/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_collapse_retrans Martin KaFai Lau
@ 2016-04-19  5:32   ` Soheil Hassas Yeganeh
  2016-04-19 17:28     ` Martin KaFai Lau
  0 siblings, 1 reply; 28+ messages in thread
From: Soheil Hassas Yeganeh @ 2016-04-19  5:32 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

On Mon, Apr 18, 2016 at 6:46 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> If two skbs are merged/collapsed during retransmission, the current
> logic does not merge the tx_flags, tskey and txstamp_ack.  The end
> result is the SCM_TSTAMP_ACK timestamp could be missing for a
> packet that the end-user has specifically turned on
> SOF_TIMESTAMPING_TX_ACK (e.g. by cmsg).
>
> The patch:
> 1. Merge the tx_flags and txstamp_ack
> 2. Overwrite the tskey with the later skb (next_skb)
>
> BPF Output Before:
> ~~~~~~
> <no-output-due-to-missing-tstamp-event>
>
> BPF Output After:
> ~~~~~~
> packetdrill-2092  [001] d.s.   453.998486: : ee_data:1459
>
> Packetdrill Script:
> ~~~~~~
> +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
> +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
> +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0 bind(3, ..., ...) = 0
> +0 listen(3, 1) = 0
>
> 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
> 0.200 < . 1:1(0) ack 1 win 257
> 0.200 accept(3, ..., ...) = 4
> +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
>
> 0.200 write(4, ..., 730) = 730
> +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
> 0.200 write(4, ..., 730) = 730
> +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
> 0.200 write(4, ..., 11680) = 11680
>
> 0.200 > P. 1:731(730) ack 1
> 0.200 > P. 731:1461(730) ack 1
> 0.200 > . 1461:8761(7300) ack 1
> 0.200 > P. 8761:13141(4380) ack 1
>
> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
> 0.300 > P. 1:1461(1460) ack 1
> 0.400 < . 1:1(0) ack 13141 win 257
>
> 0.400 close(4) = 0
> 0.400 > F. 13141:13141(0) ack 1
> 0.500 < F. 1:1(0) ack 13142 win 257
> 0.500 > . 13142:13142(0) ack 2
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>

Cc:  Soheil Hassas Yeganeh <soheil@google.com>

> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---
>  net/ipv4/tcp_output.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 0527ce9..889ed96 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2443,6 +2443,22 @@ u32 __tcp_select_window(struct sock *sk)
>         return window;
>  }
>
> +static void tcp_skb_collapse_tstamp(struct sk_buff *skb,
> +                                   const struct sk_buff *next_skb)
> +{
> +       const struct skb_shared_info *next_shinfo = skb_shinfo(next_skb);
> +
> +       if (unlikely(next_shinfo->tx_flags & SKBTX_ANY_TSTAMP)) {
> +               struct skb_shared_info *shinfo = skb_shinfo(skb);
> +               u8 tsflags = next_shinfo->tx_flags & SKBTX_ANY_TSTAMP;

nit: maybe move this local variable out of the if block?

      tsflags = ...
      if (unlikely(tsflags)) { ... }

> +
> +               shinfo->tx_flags |= tsflags;
> +               shinfo->tskey = next_shinfo->tskey;
> +               TCP_SKB_CB(skb)->txstamp_ack =
> +                       !!(shinfo->tx_flags & SKBTX_ACK_TSTAMP);

Maybe we can skip a conditional jump here (because of !!), by simply
using the cached bit in next_skb:
TCP_SKB_CB(skb)->txstamp_ack = TCP_SKB_CB(next_skb)->txstamp_ack;

> +       }
> +}
> +
>  /* Collapses two adjacent SKB's during retransmission. */
>  static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
>  {
> @@ -2486,6 +2502,8 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
>
>         tcp_adjust_pcount(sk, next_skb, tcp_skb_pcount(next_skb));
>
> +       tcp_skb_collapse_tstamp(skb, next_skb);
> +
>         sk_wmem_free_skb(sk, next_skb);
>  }

Really nice fixes! thanks.

> --
> 2.5.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 3/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_shifted_skb
  2016-04-18 22:46 ` [RFC PATCH v2 net-next 3/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_shifted_skb Martin KaFai Lau
@ 2016-04-19  5:38   ` Soheil Hassas Yeganeh
  0 siblings, 0 replies; 28+ messages in thread
From: Soheil Hassas Yeganeh @ 2016-04-19  5:38 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

On Mon, Apr 18, 2016 at 6:46 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> After receiving sacks, tcp_shifted_skb() will collapse
> skbs if possible.  tx_flags/tskey/txstamp_ack also has
> to be merged in this case.
>
> This patch resues the tcp_skb_collapse_tstamp() to handle
> them.
>
> BPF Output Before:
> ~~~~~
> <no-output-due-to-missing-tstamp-event>
>
> BPF Output After:
> ~~~~~
> <...>-2024  [007] d.s.    88.644374: : ee_data:14599
>
> Packetdrill Script:
> ~~~~~
> +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
> +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
> +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0 bind(3, ..., ...) = 0
> +0 listen(3, 1) = 0
>
> 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
> 0.200 < . 1:1(0) ack 1 win 257
> 0.200 accept(3, ..., ...) = 4
> +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
>
> 0.200 write(4, ..., 1460) = 1460
> +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
> 0.200 write(4, ..., 13140) = 13140
> +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
>
> 0.200 > P. 1:1461(1460) ack 1
> 0.200 > . 1461:8761(7300) ack 1
> 0.200 > P. 8761:14601(5840) ack 1
>
> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
> 0.300 > P. 1:1461(1460) ack 1
> 0.400 < . 1:1(0) ack 14601 win 257
>
> 0.400 close(4) = 0
> 0.400 > F. 14601:14601(0) ack 1
> 0.500 < F. 1:1(0) ack 14602 win 257
> 0.500 > . 14602:14602(0) ack 2
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---
>  include/net/tcp.h     | 2 ++
>  net/ipv4/tcp_input.c  | 1 +
>  net/ipv4/tcp_output.c | 4 ++--
>  3 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index fd40f8c..c0ef054 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -557,6 +557,8 @@ void tcp_send_ack(struct sock *sk);
>  void tcp_send_delayed_ack(struct sock *sk);
>  void tcp_send_loss_probe(struct sock *sk);
>  bool tcp_schedule_loss_probe(struct sock *sk);
> +void tcp_skb_collapse_tstamp(struct sk_buff *skb,
> +                            const struct sk_buff *next_skb);
>
>  /* tcp_input.c */
>  void tcp_resume_early_retransmit(struct sock *sk);
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 5e45a9c..75e8336 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -1309,6 +1309,7 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *skb,
>         if (skb == tcp_highest_sack(sk))
>                 tcp_advance_highest_sack(sk, skb);
>
> +       tcp_skb_collapse_tstamp(prev, skb);
>         tcp_unlink_write_queue(skb, sk);
>         sk_wmem_free_skb(sk, skb);
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 889ed96..d21a78f 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2443,8 +2443,8 @@ u32 __tcp_select_window(struct sock *sk)
>         return window;
>  }
>
> -static void tcp_skb_collapse_tstamp(struct sk_buff *skb,
> -                                   const struct sk_buff *next_skb)
> +void tcp_skb_collapse_tstamp(struct sk_buff *skb,
> +                            const struct sk_buff *next_skb)
>  {
>         const struct skb_shared_info *next_shinfo = skb_shinfo(next_skb);
>

nice, thanks for the fix!

> --
> 2.5.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg
  2016-04-18 23:18   ` Eric Dumazet
  2016-04-18 23:43     ` kafai
@ 2016-04-19  9:47     ` David Laight
  2016-04-19 12:19       ` Eric Dumazet
  1 sibling, 1 reply; 28+ messages in thread
From: David Laight @ 2016-04-19  9:47 UTC (permalink / raw)
  To: 'Eric Dumazet', Martin KaFai Lau
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

From: Eric Dumazet
> Sent: 19 April 2016 00:18
...
> MSG_EOR should not depend on SKBTX_ANY_TSTAMP
> 
> Really, simply using send(fd, ..., len, MSG_EOR) should instruct TCP to
> mark the cooked skb as a non candidate for future coalescing.

Isn't that very similar to the inverse of MSG_MORE?
Or a send with Nagle disabled?

	David


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg
  2016-04-19  9:47     ` David Laight
@ 2016-04-19 12:19       ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2016-04-19 12:19 UTC (permalink / raw)
  To: David Laight
  Cc: Martin KaFai Lau, netdev, Eric Dumazet, Neal Cardwell,
	Soheil Hassas Yeganeh, Willem de Bruijn, Yuchung Cheng,
	Kernel Team

On Tue, 2016-04-19 at 09:47 +0000, David Laight wrote:
> From: Eric Dumazet
> > Sent: 19 April 2016 00:18
> ...
> > MSG_EOR should not depend on SKBTX_ANY_TSTAMP
> > 
> > Really, simply using send(fd, ..., len, MSG_EOR) should instruct TCP to
> > mark the cooked skb as a non candidate for future coalescing.
> 
> Isn't that very similar to the inverse of MSG_MORE?

Yes. But not specifying MSG_MORE does not mean we want to force send
tiny packets. TCP is allowed to aggregate since it is a stream protocol,
since it reduces overhead.

> Or a send with Nagle disabled?

No.  tcp_sendmsg() still can coalesce/aggregate data to the last packet
in write queue, if it was not yet sent, regardless of Nagle.

If you enable or disable Nagle, following command will only send first
packets with 724 bytes, but following ones will be very big.

netperf -t TCP_STREAM -- -m 724

05:16:53.707778 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [S], seq 711876787, win 29200, options [mss 1460,sackOK,TS val 320048472 ecr 0,nop,wscale 7], length 0
05:16:53.707908 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [S.], seq 2592965917, ack 711876788, win 28960, options [mss 1460,sackOK,TS val 23660532 ecr 320048472,nop,wscale 7], length 0
05:16:53.708001 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [.], ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], length 0
05:16:53.708045 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 1:725, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], length 724
05:16:53.708053 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 725, win 238, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708083 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 725:2173, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], length 1448
05:16:53.708097 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 2173, win 261, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708094 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 2173:3621, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], length 1448
05:16:53.708107 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 3621, win 283, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708132 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 3621:6517, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], length 2896
05:16:53.708133 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 6517:7965, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], length 1448
05:16:53.708148 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 6517, win 329, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708152 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 7965, win 351, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708157 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 7965:10861, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], length 2896
05:16:53.708165 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 10861, win 396, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708168 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 10861:12309, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], length 1448
05:16:53.708177 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 12309, win 419, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708220 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [.], seq 12309:16653, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], length 4344
05:16:53.708240 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 16653, win 487, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708335 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 16653:28237, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], length 11584
05:16:53.708378 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 28237, win 668, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708453 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 28237:42717, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], length 14480
05:16:53.708496 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 42717, win 894, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708537 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [.], seq 42717:49957, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], length 7240
05:16:53.708585 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 49957:57197, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], length 7240
05:16:53.708641 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 49957, win 1007, options [nop,nop,TS val 23660533 ecr 320048473], length 0
05:16:53.708651 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 57197, win 1120, options [nop,nop,TS val 23660533 ecr 320048473], length 0
05:16:53.708806 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 57197:83261, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], length 26064
05:16:53.708836 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 83261, win 1528, options [nop,nop,TS val 23660533 ecr 320048473], length 0
05:16:53.708944 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [.], seq 83261:97741, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660533], length 14480
05:16:53.708967 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 97741, win 1754, options [nop,nop,TS val 23660533 ecr 320048473], length 0
05:16:53.709005 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [.], seq 97741:104981, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660533], length 7240
05:16:53.709036 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 104981:109325, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660533], length 4344
05:16:53.709058 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 104981, win 1867, options [nop,nop,TS val 23660533 ecr 320048473], length 0
05:16:53.709065 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 109325, win 1924, options [nop,nop,TS val 23660533 ecr 320048473], length 0
05:16:53.709251 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 109325:135389, ack 1, win 229, options [nop,nop,TS val 320048474 ecr 23660533], length 26064
05:16:53.709332 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [.], seq 135389:141181, ack 1, win 229, options [nop,nop,TS val 320048474 ecr 23660533], length 5792
05:16:53.709423 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 135389, win 1924, options [nop,nop,TS val 23660533 ecr 320048474], length 0
05:16:53.709434 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 141181, win 1924, options [nop,nop,TS val 23660533 ecr 320048474], length 0
05:16:53.709730 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 141181:191861, ack 1, win 229, options [nop,nop,TS val 320048474 ecr 23660533], length 50680

Using MSG_EOR on the send(fd, buffer, 724, MSG_EOR) would force TCP to
send tiny datagrams and not big GSO/TSO packets.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 2/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_collapse_retrans
  2016-04-19  5:32   ` Soheil Hassas Yeganeh
@ 2016-04-19 17:28     ` Martin KaFai Lau
  2016-04-19 17:35       ` Eric Dumazet
  2016-04-19 17:42       ` Soheil Hassas Yeganeh
  0 siblings, 2 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2016-04-19 17:28 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

On Tue, Apr 19, 2016 at 01:32:14AM -0400, Soheil Hassas Yeganeh wrote:
> > +               TCP_SKB_CB(skb)->txstamp_ack =
> > +                       !!(shinfo->tx_flags & SKBTX_ACK_TSTAMP);
>
> Maybe we can skip a conditional jump here (because of !!), by simply
> using the cached bit in next_skb:
> TCP_SKB_CB(skb)->txstamp_ack = TCP_SKB_CB(next_skb)->txstamp_ack;
Recall the tx_flags are merged/combined (and so should be the txstamp_ack).
Would there be a case that TCP_SKB_CB(skb)->txstamp_ack is 1 and
TCP_SKB_CB(next_skb)->txstamp_ack is 0?

I can change it like the following which may help in showing the intention:
if (TCP_SKB_CB(next_skb)->txstamp_ack)
	TCP_SKB_CB(skb)->txstamp_ack = 1;

A bit off topic, I feel like the SKBTX_ACK_TSTAMP and txstamp_ack are sort
of redundant but I have not look into the details yet, so not completely
sure.  It wwould be a separate cleanup patch if it is the case.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 2/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_collapse_retrans
  2016-04-19 17:28     ` Martin KaFai Lau
@ 2016-04-19 17:35       ` Eric Dumazet
  2016-04-19 18:18         ` Martin KaFai Lau
  2016-04-19 17:42       ` Soheil Hassas Yeganeh
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2016-04-19 17:35 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Soheil Hassas Yeganeh, netdev, Neal Cardwell,
	Soheil Hassas Yeganeh, Willem de Bruijn, Yuchung Cheng,
	Kernel Team

On Tue, Apr 19, 2016 at 10:28 AM, Martin KaFai Lau <kafai@fb.com> wrote:

> A bit off topic, I feel like the SKBTX_ACK_TSTAMP and txstamp_ack are sort
> of redundant but I have not look into the details yet, so not completely
> sure.  It wwould be a separate cleanup patch if it is the case.

Please read 6b084928baac562ed61866f540a96120e9c9ddb7 changelog ;)

A cache line miss avoidance is critical

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 1/7] tcp: Carry txstamp_ack in tcp_fragment_tstamp
  2016-04-19  5:21   ` Soheil Hassas Yeganeh
@ 2016-04-19 17:39     ` Martin KaFai Lau
  2016-04-19 17:44       ` Soheil Hassas Yeganeh
  0 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2016-04-19 17:39 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

On Tue, Apr 19, 2016 at 01:21:04AM -0400, Soheil Hassas Yeganeh wrote:
> Could you please submit the timestamping patches separately as non RFCs? Thanks!
Agree.  I will re-spin.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 2/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_collapse_retrans
  2016-04-19 17:28     ` Martin KaFai Lau
  2016-04-19 17:35       ` Eric Dumazet
@ 2016-04-19 17:42       ` Soheil Hassas Yeganeh
  1 sibling, 0 replies; 28+ messages in thread
From: Soheil Hassas Yeganeh @ 2016-04-19 17:42 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

On Tue, Apr 19, 2016 at 1:28 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> On Tue, Apr 19, 2016 at 01:32:14AM -0400, Soheil Hassas Yeganeh wrote:
>> > +               TCP_SKB_CB(skb)->txstamp_ack =
>> > +                       !!(shinfo->tx_flags & SKBTX_ACK_TSTAMP);
>>
>> Maybe we can skip a conditional jump here (because of !!), by simply
>> using the cached bit in next_skb:
>> TCP_SKB_CB(skb)->txstamp_ack = TCP_SKB_CB(next_skb)->txstamp_ack;
> Recall the tx_flags are merged/combined (and so should be the txstamp_ack).

Oh sure, sorry, I missed an "or":

TCP_SKB_CB(skb)->txstamp_ack |= TCP_SKB_CB(next_skb)->txstamp_ack;

> Would there be a case that TCP_SKB_CB(skb)->txstamp_ack is 1 and
> TCP_SKB_CB(next_skb)->txstamp_ack is 0?
>
> I can change it like the following which may help in showing the intention:
> if (TCP_SKB_CB(next_skb)->txstamp_ack)
>         TCP_SKB_CB(skb)->txstamp_ack = 1;
>
> A bit off topic, I feel like the SKBTX_ACK_TSTAMP and txstamp_ack are sort
> of redundant but I have not look into the details yet, so not completely
> sure.  It wwould be a separate cleanup patch if it is the case.

As Eric mentioned, this is needed to avoid a cache-line miss in
accessing the shared info.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 1/7] tcp: Carry txstamp_ack in tcp_fragment_tstamp
  2016-04-19 17:39     ` Martin KaFai Lau
@ 2016-04-19 17:44       ` Soheil Hassas Yeganeh
  0 siblings, 0 replies; 28+ messages in thread
From: Soheil Hassas Yeganeh @ 2016-04-19 17:44 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

On Tue, Apr 19, 2016 at 1:39 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> On Tue, Apr 19, 2016 at 01:21:04AM -0400, Soheil Hassas Yeganeh wrote:
>> Could you please submit the timestamping patches separately as non RFCs? Thanks!
> Agree.  I will re-spin.

Great, thank you very much!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 2/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_collapse_retrans
  2016-04-19 17:35       ` Eric Dumazet
@ 2016-04-19 18:18         ` Martin KaFai Lau
  2016-04-19 18:24           ` Soheil Hassas Yeganeh
  0 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2016-04-19 18:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Soheil Hassas Yeganeh, netdev, Neal Cardwell,
	Soheil Hassas Yeganeh, Willem de Bruijn, Yuchung Cheng,
	Kernel Team

On Tue, Apr 19, 2016 at 10:35:52AM -0700, Eric Dumazet wrote:
> On Tue, Apr 19, 2016 at 10:28 AM, Martin KaFai Lau <kafai@fb.com> wrote:
>
> > A bit off topic, I feel like the SKBTX_ACK_TSTAMP and txstamp_ack are sort
> > of redundant but I have not look into the details yet, so not completely
> > sure.  It wwould be a separate cleanup patch if it is the case.
>
> Please read 6b084928baac562ed61866f540a96120e9c9ddb7 changelog ;)
>
> A cache line miss avoidance is critical
I looked at the patch but I probably am missing something :(
Is checking txstamp_ack alone enough and SKBTX_ACK_TSTAMP is not needed
since they are always set together?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 2/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_collapse_retrans
  2016-04-19 18:18         ` Martin KaFai Lau
@ 2016-04-19 18:24           ` Soheil Hassas Yeganeh
  2016-04-21 20:25             ` Willem de Bruijn
  0 siblings, 1 reply; 28+ messages in thread
From: Soheil Hassas Yeganeh @ 2016-04-19 18:24 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Eric Dumazet, netdev, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

On Tue, Apr 19, 2016 at 2:18 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> On Tue, Apr 19, 2016 at 10:35:52AM -0700, Eric Dumazet wrote:
>> On Tue, Apr 19, 2016 at 10:28 AM, Martin KaFai Lau <kafai@fb.com> wrote:
>>
>> > A bit off topic, I feel like the SKBTX_ACK_TSTAMP and txstamp_ack are sort
>> > of redundant but I have not look into the details yet, so not completely
>> > sure.  It wwould be a separate cleanup patch if it is the case.
>>
>> Please read 6b084928baac562ed61866f540a96120e9c9ddb7 changelog ;)
>>
>> A cache line miss avoidance is critical
> I looked at the patch but I probably am missing something :(
> Is checking txstamp_ack alone enough and SKBTX_ACK_TSTAMP is not needed
> since they are always set together?

That's right, the check on "(shinfo->tx_flags & SKBTX_ACK_TSTAMP)" in
tcp_ack_tstamp() is redundant and I had a patch prepared to remove it.
But I thought it's better to wait for
https://patchwork.ozlabs.org/patch/611938/ to be merged first.

Feel free to remove it in your patches, if you'd prefer that.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 net-next 2/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_collapse_retrans
  2016-04-19 18:24           ` Soheil Hassas Yeganeh
@ 2016-04-21 20:25             ` Willem de Bruijn
  0 siblings, 0 replies; 28+ messages in thread
From: Willem de Bruijn @ 2016-04-21 20:25 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: Martin KaFai Lau, Eric Dumazet, netdev, Neal Cardwell,
	Soheil Hassas Yeganeh, Willem de Bruijn, Yuchung Cheng,
	Kernel Team

On Tue, Apr 19, 2016 at 2:24 PM, Soheil Hassas Yeganeh
<soheil@google.com> wrote:
> On Tue, Apr 19, 2016 at 2:18 PM, Martin KaFai Lau <kafai@fb.com> wrote:
>> On Tue, Apr 19, 2016 at 10:35:52AM -0700, Eric Dumazet wrote:
>>> On Tue, Apr 19, 2016 at 10:28 AM, Martin KaFai Lau <kafai@fb.com> wrote:
>>>
>>> > A bit off topic, I feel like the SKBTX_ACK_TSTAMP and txstamp_ack are sort
>>> > of redundant but I have not look into the details yet, so not completely
>>> > sure.  It wwould be a separate cleanup patch if it is the case.

Yes, with the introduction of txstamp_ack, SKBTX_ACK_TSTAMP is completely
redundant.

>>>
>>> Please read 6b084928baac562ed61866f540a96120e9c9ddb7 changelog ;)
>>>
>>> A cache line miss avoidance is critical
>> I looked at the patch but I probably am missing something :(
>> Is checking txstamp_ack alone enough and SKBTX_ACK_TSTAMP is not needed
>> since they are always set together?
>
> That's right, the check on "(shinfo->tx_flags & SKBTX_ACK_TSTAMP)" in
> tcp_ack_tstamp() is redundant and I had a patch prepared to remove it.

You can even remove the flag completely and

-               tcb->txstamp_ack = !!(shinfo->tx_flags & SKBTX_ACK_TSTAMP);
+               if (tsflags & SOF_TIMESTAMPING_TX_ACK)
+                       tcb->txstamp_ack = 1;

> But I thought it's better to wait for
> https://patchwork.ozlabs.org/patch/611938/ to be merged first.
>
> Feel free to remove it in your patches, if you'd prefer that.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-04-21 20:26 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-18 22:46 [RFC PATCH v2 net-next 0/7] tcp: Make use of MSG_EOR in tcp_sendmsg Martin KaFai Lau
2016-04-18 22:46 ` [RFC PATCH v2 net-next 1/7] tcp: Carry txstamp_ack in tcp_fragment_tstamp Martin KaFai Lau
2016-04-19  5:21   ` Soheil Hassas Yeganeh
2016-04-19 17:39     ` Martin KaFai Lau
2016-04-19 17:44       ` Soheil Hassas Yeganeh
2016-04-18 22:46 ` [RFC PATCH v2 net-next 2/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_collapse_retrans Martin KaFai Lau
2016-04-19  5:32   ` Soheil Hassas Yeganeh
2016-04-19 17:28     ` Martin KaFai Lau
2016-04-19 17:35       ` Eric Dumazet
2016-04-19 18:18         ` Martin KaFai Lau
2016-04-19 18:24           ` Soheil Hassas Yeganeh
2016-04-21 20:25             ` Willem de Bruijn
2016-04-19 17:42       ` Soheil Hassas Yeganeh
2016-04-18 22:46 ` [RFC PATCH v2 net-next 3/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_shifted_skb Martin KaFai Lau
2016-04-19  5:38   ` Soheil Hassas Yeganeh
2016-04-18 22:46 ` [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg Martin KaFai Lau
2016-04-18 23:18   ` Eric Dumazet
2016-04-18 23:43     ` kafai
2016-04-19  0:06       ` Eric Dumazet
2016-04-19  2:27         ` Martin KaFai Lau
2016-04-19  2:50           ` Eric Dumazet
2016-04-19  3:18             ` Martin KaFai Lau
2016-04-19  3:25               ` Eric Dumazet
2016-04-19  9:47     ` David Laight
2016-04-19 12:19       ` Eric Dumazet
2016-04-18 22:46 ` [RFC PATCH v2 net-next 5/7] tcp: Make use of MSG_EOR in tcp_sendpage Martin KaFai Lau
2016-04-18 22:46 ` [RFC PATCH v2 net-next 6/7] tcp: Carry eor_info in tcp_fragment_tstamp() and tcp_skb_collapse_tstamp() Martin KaFai Lau
2016-04-18 22:46 ` [RFC PATCH v2 net-next 7/7] tcp: Avoid losing eor_info when collapsing skbs Martin KaFai Lau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.