[RFC PATCH v2 net-next] net: Preserve skb delivery time during forward

* [RFC PATCH v2 net-next] net: Preserve skb delivery time during forward
@ 2021-12-15 20:11 Martin KaFai Lau
  2021-12-15 22:07 ` Julian Anastasov
  2021-12-16 15:32 ` Willem de Bruijn
  0 siblings, 2 replies; 10+ messages in thread
From: Martin KaFai Lau @ 2021-12-15 20:11 UTC (permalink / raw)
  To: netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, Eric Dumazet,
	Jakub Kicinski, kernel-team, Willem de Bruijn

The skb->skb_mstamp_ns is used as the EDT (Earliest Department Time)
in TCP.  skb->skb_mstamp_ns is a union member of skb->tstamp.

When the skb traveling veth and being forwarded like below, the skb->tstamp
is reset to 0 at multiple points.

                                                                        (c: skb->tstamp = 0)
                                                                         vv
tcp-sender => veth@netns => veth@hostns(b: rx: skb->tstamp = real_clock) => fq@eth0
                         ^^
                        (a: skb->tstamp = 0)

(a) veth@netns TX to veth@hostns:
    skb->tstamp (mono clock) is a EDT and it is in future time.
    Reset to 0 so that it won't skip the net_timestamp_check at the
    RX side in (b).
(b) RX (netif_rx) in veth@hostns:
    net_timestamp_check puts a current time (real clock) in skb->tstamp.
(c) veth@hostns forward to fq@eth0:
    skb->tstamp is reset back to 0 again because fq is using
    mono clock.

This leads to an unstable TCP throughput issue described by Daniel in [0].

We also have a use case that a bpf runs at ingress@veth@hostns
to set EDT in skb->tstamp to limit the bandwidth usage
of a particular netns.  This EDT currently also gets
reset in step (c) as described above.

Unlike RFC v1 trying to migrate rx tstamp to mono first,
this patch is to preserve the EDT in skb->skb_mstamp_ns during forward.

The idea is to temporarily store skb->skb_mstamp_ns during forward.
skb_shinfo(skb)->hwtstamps is used as a temporary store and
it is union-ed with the newly added "u64 tx_delivery_tstamp".
hwtstamps should only be used when a packet is received or
sent out of a hw device.

During forward, skb->tstamp will be temporarily stored in
skb_shinfo(skb)->tx_delivery_tstamp and a new bit
(SKBTX_DELIVERY_TSTAMP) in skb_shinfo(skb)->tx_flags
will also be set to tell tx_delivery_tstamp is in use.
hwtstamps is accessed through the skb_hwtstamps() getter,
so unlikely(tx_flags & SKBTX_DELIVERY_TSTAMP) can
be tested in there and reset tx_delivery_tstamp to 0
before hwtstamps is used.

After moving the skb->tstamp to skb_shinfo(skb)->tx_delivery_tstamp,
the skb->tstamp will still be reset to 0 during forward.  Thus,
on the RX side (__netif_receive_skb_core), all existing code paths
will still get the received time in real clock and will work as-is.

When this skb finally xmit-ing out in __dev_queue_xmit(),
it will check the SKBTX_DELIVERY_TSTAMP bit in skb_shinfo(skb)->tx_flags
and restore the skb->tstamp from skb_shinfo(skb)->tx_delivery_tstamp
if needed.  This bit test is done immediately after another existing
bit test 'skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP'.

Another bit SKBTX_DELIVERY_TSTAMP_ALLOW_FWD is added
to skb_shinfo(skb)->tx_flags.  It is used to specify
the skb->tstamp is set as a delivery time and can be
temporarily stored during forward.  This bit is now set
when EDT is stored in skb->skb_mstamp_ns in tcp_output.c
This will avoid packet received from a NIC with real-clock
in skb->tstamp being forwarded without reset.

The change in af_packet.c is to avoid it calling skb_hwtstamps()
which will reset the skb_shinfo(skb)->tx_delivery_tstamp.
af_packet.c only wants to read the hwtstamps instead of
storing a time in it, so a new read only getter skb_hwtstamps_ktime()
is added.  Otherwise, a tcpdump will trigger this code path
and unnecessarily reset the EDT stored in tx_delivery_tstamp.

[Note: not all skb->tstamp=0 reset has been changed in this RFC yet]

[0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/skbuff.h  | 52 ++++++++++++++++++++++++++++++++++++++++-
 net/bridge/br_forward.c |  2 +-
 net/core/dev.c          |  1 +
 net/core/filter.c       |  6 ++---
 net/core/skbuff.c       |  2 +-
 net/ipv4/ip_forward.c   |  2 +-
 net/ipv4/tcp_output.c   | 21 +++++++++++------
 net/ipv6/ip6_output.c   |  2 +-
 net/packet/af_packet.c  |  8 +++----
 9 files changed, 77 insertions(+), 19 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6535294f6a48..9bf0a1e2a1bd 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -435,9 +435,17 @@ enum {
 	/* device driver is going to provide hardware time stamp */
 	SKBTX_IN_PROGRESS = 1 << 2,
 
+	/* shinfo stores a future tx_delivery_tstamp instead of hwtstamps */
+	SKBTX_DELIVERY_TSTAMP = 1 << 3,
+
 	/* generate wifi status information (where possible) */
 	SKBTX_WIFI_STATUS = 1 << 4,
 
+	/* skb->tstamp stored a future delivery time which
+	 * was set by a local sk and it can be fowarded.
+	 */
+	SKBTX_DELIVERY_TSTAMP_ALLOW_FWD = 1 << 5,
+
 	/* generate software time stamp when entering packet scheduling */
 	SKBTX_SCHED_TSTAMP = 1 << 6,
 };
@@ -530,7 +538,14 @@ struct skb_shared_info {
 	/* Warning: this field is not always filled in (UFO)! */
 	unsigned short	gso_segs;
 	struct sk_buff	*frag_list;
-	struct skb_shared_hwtstamps hwtstamps;
+	union {
+		/* If SKBTX_DELIVERY_TSTAMP is set in tx_flags,
+		 * tx_delivery_tstamp is stored instead of
+		 * hwtstamps.
+		 */
+		struct skb_shared_hwtstamps hwtstamps;
+		u64 tx_delivery_tstamp;
+	};
 	unsigned int	gso_type;
 	u32		tskey;
 
@@ -1463,9 +1478,44 @@ static inline unsigned int skb_end_offset(const struct sk_buff *skb)
 
 static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
 {
+	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP)) {
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP;
+		skb_shinfo(skb)->tx_delivery_tstamp = 0;
+	}
 	return &skb_shinfo(skb)->hwtstamps;
 }
 
+/* Caller only needs to read the hwtstamps as ktime.
+ * To update hwtstamps,  HW device driver should call the writable
+ * version skb_hwtstamps() that returns a pointer.
+ */
+static inline ktime_t skb_hwtstamps_ktime(const struct sk_buff *skb)
+{
+	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP))
+		return 0;
+	return skb_shinfo(skb)->hwtstamps.hwtstamp;
+}
+
+static inline void skb_scrub_tstamp(struct sk_buff *skb)
+{
+	if (skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP_ALLOW_FWD) {
+		skb_shinfo(skb)->tx_delivery_tstamp = skb->tstamp;
+		skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP;
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
+	}
+	skb->tstamp = 0;
+}
+
+static inline void skb_restore_delivery_time(struct sk_buff *skb)
+{
+	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP)) {
+		skb->tstamp = skb_shinfo(skb)->tx_delivery_tstamp;
+		skb_shinfo(skb)->tx_delivery_tstamp = 0;
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP;
+		skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
+	}
+}
+
 static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
 {
 	bool is_zcopy = skb && skb_shinfo(skb)->flags & SKBFL_ZEROCOPY_ENABLE;
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index ec646656dbf1..a3ba6195f2e3 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -62,7 +62,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit);
 
 int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 	return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,
 		       net, sk, skb, NULL, skb->dev,
 		       br_dev_queue_push_xmit);
diff --git a/net/core/dev.c b/net/core/dev.c
index a855e41bbe39..e9e7de758cba 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4039,6 +4039,7 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 
 	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
 		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, SCM_TSTAMP_SCHED);
+	skb_restore_delivery_time(skb);
 
 	/* Disable soft irqs for various locks below. Also
 	 * stops preemption for RCU.
diff --git a/net/core/filter.c b/net/core/filter.c
index e4cc3aff5bf7..7c3c6abc25a9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2107,7 +2107,7 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 
 	dev_xmit_recursion_inc();
 	ret = dev_queue_xmit(skb);
@@ -2176,7 +2176,7 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb,
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
 		skb = skb_expand_head(skb, hh_len);
@@ -2274,7 +2274,7 @@ static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb,
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
 		skb = skb_expand_head(skb, hh_len);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a33247fdb8f5..a53ca773afb5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5324,7 +5324,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 
 	ipvs_reset(skb);
 	skb->mark = 0;
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 }
 EXPORT_SYMBOL_GPL(skb_scrub_packet);
 
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 00ec819f949b..f216fc97c6ce 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -79,7 +79,7 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s
 	if (unlikely(opt->optlen))
 		ip_forward_options(skb);
 
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 	return dst_output(net, sk, skb);
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5079832af5c1..d5524ea5fcfa 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1223,6 +1223,12 @@ INDIRECT_CALLABLE_DECLARE(int ip_queue_xmit(struct sock *sk, struct sk_buff *skb
 INDIRECT_CALLABLE_DECLARE(int inet6_csk_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl));
 INDIRECT_CALLABLE_DECLARE(void tcp_v4_send_check(struct sock *sk, struct sk_buff *skb));
 
+static void tcp_set_skb_edt(struct sk_buff *skb, u64 edt)
+{
+	skb->skb_mstamp_ns = edt;
+	skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
+}
+
 /* This routine actually transmits TCP packets queued in by
  * tcp_do_sendmsg().  This is used by both the initial
  * transmission and possible later retransmissions.
@@ -1253,7 +1259,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 	tp = tcp_sk(sk);
 	prior_wstamp = tp->tcp_wstamp_ns;
 	tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache);
-	skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
+	tcp_set_skb_edt(skb, tp->tcp_wstamp_ns);
 	if (clone_it) {
 		oskb = skb;
 
@@ -1589,7 +1595,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 
 	skb_split(skb, buff, len);
 
-	buff->tstamp = skb->tstamp;
+	tcp_set_skb_edt(buff, skb->skb_mstamp_ns);
 	tcp_fragment_tstamp(skb, buff);
 
 	old_factor = tcp_skb_pcount(skb);
@@ -2616,7 +2622,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 
 		if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
 			/* "skb_mstamp_ns" is used as a start point for the retransmit timer */
-			skb->skb_mstamp_ns = tp->tcp_wstamp_ns = tp->tcp_clock_cache;
+			tp->tcp_wstamp_ns = tp->tcp_clock_cache;
+			tcp_set_skb_edt(skb, tp->tcp_wstamp_ns);
 			list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue);
 			tcp_init_tso_segs(skb, mss_now);
 			goto repair; /* Skip network transmission */
@@ -3541,11 +3548,11 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	now = tcp_clock_ns();
 #ifdef CONFIG_SYN_COOKIES
 	if (unlikely(synack_type == TCP_SYNACK_COOKIE && ireq->tstamp_ok))
-		skb->skb_mstamp_ns = cookie_init_timestamp(req, now);
+		tcp_set_skb_edt(skb, cookie_init_timestamp(req, now));
 	else
 #endif
 	{
-		skb->skb_mstamp_ns = now;
+		tcp_set_skb_edt(skb, now);
 		if (!tcp_rsk(req)->snt_synack) /* Timestamp first SYNACK */
 			tcp_rsk(req)->snt_synack = tcp_skb_timestamp_us(skb);
 	}
@@ -3594,7 +3601,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb,
 				synack_type, &opts);
 
-	skb->skb_mstamp_ns = now;
+	tcp_set_skb_edt(skb, now);
 	tcp_add_tx_delay(skb, tp);
 
 	return skb;
@@ -3771,7 +3778,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 
 	err = tcp_transmit_skb(sk, syn_data, 1, sk->sk_allocation);
 
-	syn->skb_mstamp_ns = syn_data->skb_mstamp_ns;
+	tcp_set_skb_edt(syn, syn_data->skb_mstamp_ns);
 
 	/* Now full SYN+DATA was cloned and sent (or not),
 	 * remove the SYN from the original skb (syn_data)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 2995f8d89e7e..f5436afdafc7 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -440,7 +440,7 @@ static inline int ip6_forward_finish(struct net *net, struct sock *sk,
 	}
 #endif
 
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 	return dst_output(net, sk, skb);
 }
 
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index a1ffdb48cc47..79a57836853a 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -451,12 +451,12 @@ static int __packet_get_status(const struct packet_sock *po, void *frame)
 static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts,
 				   unsigned int flags)
 {
-	struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
+	ktime_t hwtstamp = skb_hwtstamps_ktime(skb);
 
-	if (shhwtstamps &&
-	    (flags & SOF_TIMESTAMPING_RAW_HARDWARE) &&
-	    ktime_to_timespec64_cond(shhwtstamps->hwtstamp, ts))
+	if (hwtstamp && (flags & SOF_TIMESTAMPING_RAW_HARDWARE)) {
+		*ts = ktime_to_timespec64(hwtstamp);
 		return TP_STATUS_TS_RAW_HARDWARE;
+	}
 
 	if ((flags & SOF_TIMESTAMPING_SOFTWARE) &&
 	    ktime_to_timespec64_cond(skb->tstamp, ts))
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread