All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next 1/2] net: Use mono clock in RX timestamp
@ 2021-12-07  2:01 Martin KaFai Lau
  2021-12-07  2:01 ` [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space Martin KaFai Lau
  0 siblings, 1 reply; 18+ messages in thread
From: Martin KaFai Lau @ 2021-12-07  2:01 UTC (permalink / raw)
  To: netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, Eric Dumazet,
	Jakub Kicinski, kernel-team, Willem de Bruijn

The TX skb->skb_mstamp_ns is used as the EDT (Earliest Department Time)
in TCP.  skb->skb_mstamp_ns is an union member of skb->tstamp.
Mono clock is used in EDT.  The EDT will be lost when the
packet is forwarded.  For example, the tcp sender is in its own
netns and its packet will eventually be sent out from the host
eth0 like this:

                                                                      (skb->tstamp = 0)
                                                                      vv
tcp-sender => veth@netns => veth@hostns(rx: skb->tstamp = real_clock) => fq@eth0
                         ^^
                         (skb->tstamp = 0)

veth@netns forward to veth@hostns:
  (a) skb->tstamp is reset to 0
RX in veth@hostns:
  (b) skb->tstamp is stamped with real clock if skb->tstamp is 0.
      Thus, the earlier skb->tstamp reset to 0 is needed in (a).
veth@hostns forward to fq@eth0:
  (c) skb->tstamp is reset back to 0 again because fq is using
      mono clock.

Resetting the skb->skb_mstamp_ns marked by the tcp-sender@netns
leads to unstable TCP throughput issue described by Daniel in [0].

We also have a use case that a bpf runs at ingress@veth@hostns
to set EDT in skb->tstamp to limit the bandwidth usage
of a particular netns.  This EDT (set by the bpf@ingress@veth@hostns)
currently also gets reset in step (c) as described above.

This patch tries to use mono clock instead of real clock in the
RX timestamp.  Before exposing the RX timestamp to the
userspace, skb_get_ktime() is used to convert the mono
clock to real clock.

When forwarding a packet, instead of always resetting skb->tstamp,
this patch only resets when the skb->tstamp is marked by
userspace (by testing sock_flag(skb->sk, SOCK_TXTIME)).
It is done in a new helper skb_scrub_tstamp().

Not all RX real clock usages in skb->tstamp have been audited
and changed, so under RFC.  The RX timestamp has been in real
clock for a long time and it is likely I missed some of
the reasons, so it will be useful to get feedback
before converting the remaining RX usages.

[0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/skbuff.h      | 23 +++++++++++------------
 include/linux/timekeeping.h |  5 +++++
 include/net/pkt_sched.h     |  2 +-
 include/net/sock.h          |  4 ++--
 include/net/tcp.h           |  8 ++++----
 net/bridge/br_forward.c     |  2 +-
 net/core/filter.c           |  6 +++---
 net/core/skbuff.c           | 12 ++++++++++--
 net/ipv4/ip_forward.c       |  2 +-
 net/ipv4/tcp.c              |  6 +++---
 net/ipv4/tcp_output.c       | 16 ++++++++--------
 net/ipv6/ip6_output.c       |  2 +-
 net/socket.c                |  2 +-
 net/sunrpc/svcsock.c        |  9 +++------
 14 files changed, 54 insertions(+), 45 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index dd262bd8ddbe..b609bdc5398b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -753,10 +753,8 @@ struct sk_buff {
 		int			ip_defrag_offset;
 	};
 
-	union {
-		ktime_t		tstamp;
-		u64		skb_mstamp_ns; /* earliest departure time */
-	};
+	ktime_t		tstamp;
+
 	/*
 	 * This is the control buffer. It is free to use for every
 	 * layer. Please put your private variables there. If you
@@ -3694,6 +3692,7 @@ int skb_zerocopy(struct sk_buff *to, struct sk_buff *from,
 		 int len, int hlen);
 void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len);
 int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen);
+void skb_scrub_tstamp(struct sk_buff *skb);
 void skb_scrub_packet(struct sk_buff *skb, bool xnet);
 bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu);
 bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len);
@@ -3810,7 +3809,7 @@ void skb_init(void);
 
 static inline ktime_t skb_get_ktime(const struct sk_buff *skb)
 {
-	return skb->tstamp;
+	return ktime_mono_to_real_cond(skb->tstamp);
 }
 
 /**
@@ -3825,13 +3824,13 @@ static inline ktime_t skb_get_ktime(const struct sk_buff *skb)
 static inline void skb_get_timestamp(const struct sk_buff *skb,
 				     struct __kernel_old_timeval *stamp)
 {
-	*stamp = ns_to_kernel_old_timeval(skb->tstamp);
+	*stamp = ns_to_kernel_old_timeval(skb_get_ktime(skb));
 }
 
 static inline void skb_get_new_timestamp(const struct sk_buff *skb,
 					 struct __kernel_sock_timeval *stamp)
 {
-	struct timespec64 ts = ktime_to_timespec64(skb->tstamp);
+	struct timespec64 ts = ktime_to_timespec64(skb_get_ktime(skb));
 
 	stamp->tv_sec = ts.tv_sec;
 	stamp->tv_usec = ts.tv_nsec / 1000;
@@ -3840,7 +3839,7 @@ static inline void skb_get_new_timestamp(const struct sk_buff *skb,
 static inline void skb_get_timestampns(const struct sk_buff *skb,
 				       struct __kernel_old_timespec *stamp)
 {
-	struct timespec64 ts = ktime_to_timespec64(skb->tstamp);
+	struct timespec64 ts = ktime_to_timespec64(skb_get_ktime(skb));
 
 	stamp->tv_sec = ts.tv_sec;
 	stamp->tv_nsec = ts.tv_nsec;
@@ -3849,7 +3848,7 @@ static inline void skb_get_timestampns(const struct sk_buff *skb,
 static inline void skb_get_new_timestampns(const struct sk_buff *skb,
 					   struct __kernel_timespec *stamp)
 {
-	struct timespec64 ts = ktime_to_timespec64(skb->tstamp);
+	struct timespec64 ts = ktime_to_timespec64(skb_get_ktime(skb));
 
 	stamp->tv_sec = ts.tv_sec;
 	stamp->tv_nsec = ts.tv_nsec;
@@ -3857,12 +3856,12 @@ static inline void skb_get_new_timestampns(const struct sk_buff *skb,
 
 static inline void __net_timestamp(struct sk_buff *skb)
 {
-	skb->tstamp = ktime_get_real();
+	skb->tstamp = ktime_get_ns();
 }
 
 static inline ktime_t net_timedelta(ktime_t t)
 {
-	return ktime_sub(ktime_get_real(), t);
+	return ktime_sub(ktime_get_ns(), t);
 }
 
 static inline ktime_t net_invalid_timestamp(void)
@@ -4726,7 +4725,7 @@ static inline void skb_set_redirected(struct sk_buff *skb, bool from_ingress)
 #ifdef CONFIG_NET_REDIRECT
 	skb->from_ingress = from_ingress;
 	if (skb->from_ingress)
-		skb->tstamp = 0;
+		skb_scrub_tstamp(skb);
 #endif
 }
 
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 78a98bdff76d..575938469b10 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -149,6 +149,11 @@ static inline ktime_t ktime_mono_to_real(ktime_t mono)
 	return ktime_mono_to_any(mono, TK_OFFS_REAL);
 }
 
+static inline ktime_t ktime_mono_to_real_cond(ktime_t mono)
+{
+	return mono ? ktime_mono_to_any(mono, TK_OFFS_REAL) : 0;
+}
+
 static inline u64 ktime_get_ns(void)
 {
 	return ktime_to_ns(ktime_get());
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index bf79f3a890af..fbe885150f00 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -184,7 +184,7 @@ struct tc_taprio_qopt_offload *taprio_offload_get(struct tc_taprio_qopt_offload
 						  *offload);
 void taprio_offload_free(struct tc_taprio_qopt_offload *offload);
 
-/* Ensure skb_mstamp_ns, which might have been populated with the txtime, is
+/* Ensure tstamp, which might have been populated with the txtime, is
  * not mistaken for a software timestamp, because this will otherwise prevent
  * the dispatch of hardware timestamps to the socket.
  */
diff --git a/include/net/sock.h b/include/net/sock.h
index ae61cd0b650d..96f70087ae7a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2603,7 +2603,7 @@ void __sock_recv_wifi_status(struct msghdr *msg, struct sock *sk,
 static inline void
 sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
 {
-	ktime_t kt = skb->tstamp;
+	ktime_t kt = skb_get_ktime(skb);
 	struct skb_shared_hwtstamps *hwtstamps = skb_hwtstamps(skb);
 
 	/*
@@ -2640,7 +2640,7 @@ static inline void sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk,
 	if (sk->sk_flags & FLAGS_TS_OR_DROPS || sk->sk_tsflags & TSFLAGS_ANY)
 		__sock_recv_ts_and_drops(msg, sk, skb);
 	else if (unlikely(sock_flag(sk, SOCK_TIMESTAMP)))
-		sock_write_timestamp(sk, skb->tstamp);
+		sock_write_timestamp(sk, skb_get_ktime(skb));
 	else if (unlikely(sk->sk_stamp == SK_DEFAULT_STAMP))
 		sock_write_timestamp(sk, 0);
 }
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 44e442bf23f9..5f5c5910a985 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -810,13 +810,13 @@ static inline u32 tcp_stamp_us_delta(u64 t1, u64 t0)
 
 static inline u32 tcp_skb_timestamp(const struct sk_buff *skb)
 {
-	return tcp_ns_to_ts(skb->skb_mstamp_ns);
+	return tcp_ns_to_ts(skb->tstamp);
 }
 
 /* provide the departure time in us unit */
 static inline u64 tcp_skb_timestamp_us(const struct sk_buff *skb)
 {
-	return div_u64(skb->skb_mstamp_ns, NSEC_PER_USEC);
+	return div_u64(skb->tstamp, NSEC_PER_USEC);
 }
 
 
@@ -862,7 +862,7 @@ struct tcp_skb_cb {
 #define TCPCB_SACKED_RETRANS	0x02	/* SKB retransmitted		*/
 #define TCPCB_LOST		0x04	/* SKB is lost			*/
 #define TCPCB_TAGBITS		0x07	/* All tag bits			*/
-#define TCPCB_REPAIRED		0x10	/* SKB repaired (no skb_mstamp_ns)	*/
+#define TCPCB_REPAIRED		0x10	/* SKB repaired (no tstamp)	*/
 #define TCPCB_EVER_RETRANS	0x80	/* Ever retransmitted frame	*/
 #define TCPCB_RETRANS		(TCPCB_SACKED_RETRANS|TCPCB_EVER_RETRANS| \
 				TCPCB_REPAIRED)
@@ -2395,7 +2395,7 @@ static inline void tcp_add_tx_delay(struct sk_buff *skb,
 				    const struct tcp_sock *tp)
 {
 	if (static_branch_unlikely(&tcp_tx_delay_enabled))
-		skb->skb_mstamp_ns += (u64)tp->tcp_tx_delay * NSEC_PER_USEC;
+		skb->tstamp += (u64)tp->tcp_tx_delay * NSEC_PER_USEC;
 }
 
 /* Compute Earliest Departure Time for some control packets
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index ec646656dbf1..a3ba6195f2e3 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -62,7 +62,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit);
 
 int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 	return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,
 		       net, sk, skb, NULL, skb->dev,
 		       br_dev_queue_push_xmit);
diff --git a/net/core/filter.c b/net/core/filter.c
index 8271624a19aa..17bf90f49c64 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2108,7 +2108,7 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 
 	dev_xmit_recursion_inc();
 	ret = dev_queue_xmit(skb);
@@ -2177,7 +2177,7 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb,
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
 		skb = skb_expand_head(skb, hh_len);
@@ -2275,7 +2275,7 @@ static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb,
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
 		skb = skb_expand_head(skb, hh_len);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a33247fdb8f5..f091c7807a9e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4794,7 +4794,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 	if (hwtstamps)
 		*skb_hwtstamps(skb) = *hwtstamps;
 	else
-		skb->tstamp = ktime_get_real();
+		__net_timestamp(skb);
 
 	__skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
 }
@@ -5291,6 +5291,14 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 }
 EXPORT_SYMBOL(skb_try_coalesce);
 
+void skb_scrub_tstamp(struct sk_buff *skb)
+{
+	struct sock *sk = skb->sk;
+
+	if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME))
+		skb->tstamp = 0;
+}
+
 /**
  * skb_scrub_packet - scrub an skb
  *
@@ -5324,7 +5332,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 
 	ipvs_reset(skb);
 	skb->mark = 0;
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 }
 EXPORT_SYMBOL_GPL(skb_scrub_packet);
 
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 00ec819f949b..f216fc97c6ce 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -79,7 +79,7 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s
 	if (unlikely(opt->optlen))
 		ip_forward_options(skb);
 
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 	return dst_output(net, sk, skb);
 }
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 6ab82e1a1d41..6b5feadc766e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1289,7 +1289,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			copy = size_goal;
 
 			/* All packets are restored as if they have
-			 * already been sent. skb_mstamp_ns isn't set to
+			 * already been sent. skb->tstamp isn't set to
 			 * avoid wrong rtt estimation.
 			 */
 			if (tp->repair)
@@ -1754,7 +1754,7 @@ void tcp_update_recv_tstamps(struct sk_buff *skb,
 			     struct scm_timestamping_internal *tss)
 {
 	if (skb->tstamp)
-		tss->ts[0] = ktime_to_timespec64(skb->tstamp);
+		tss->ts[0] = ktime_to_timespec64(skb_get_ktime(skb));
 	else
 		tss->ts[0] = (struct timespec64) {0};
 
@@ -3928,7 +3928,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk,
 	nla_put_u16(stats, TCP_NLA_TIMEOUT_REHASH, tp->timeout_rehash);
 	nla_put_u32(stats, TCP_NLA_BYTES_NOTSENT,
 		    max_t(int, 0, tp->write_seq - tp->snd_nxt));
-	nla_put_u64_64bit(stats, TCP_NLA_EDT, orig_skb->skb_mstamp_ns,
+	nla_put_u64_64bit(stats, TCP_NLA_EDT, orig_skb->tstamp,
 			  TCP_NLA_PAD);
 	if (ack_skb)
 		nla_put_u8(stats, TCP_NLA_TTL,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5079832af5c1..5da076f4cf84 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1253,7 +1253,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 	tp = tcp_sk(sk);
 	prior_wstamp = tp->tcp_wstamp_ns;
 	tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache);
-	skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
+	skb->tstamp = tp->tcp_wstamp_ns;
 	if (clone_it) {
 		oskb = skb;
 
@@ -1391,7 +1391,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 	skb_shinfo(skb)->gso_segs = tcp_skb_pcount(skb);
 	skb_shinfo(skb)->gso_size = tcp_skb_mss(skb);
 
-	/* Leave earliest departure time in skb->tstamp (skb->skb_mstamp_ns) */
+	/* Leave earliest departure time in skb->tstamp */
 
 	/* Cleanup our debris for IP stacks */
 	memset(skb->cb, 0, max(sizeof(struct inet_skb_parm),
@@ -2615,8 +2615,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		unsigned int limit;
 
 		if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
-			/* "skb_mstamp_ns" is used as a start point for the retransmit timer */
-			skb->skb_mstamp_ns = tp->tcp_wstamp_ns = tp->tcp_clock_cache;
+			/* "skb->tstamp" is used as a start point for the retransmit timer */
+			skb->tstamp = tp->tcp_wstamp_ns = tp->tcp_clock_cache;
 			list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue);
 			tcp_init_tso_segs(skb, mss_now);
 			goto repair; /* Skip network transmission */
@@ -3541,11 +3541,11 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	now = tcp_clock_ns();
 #ifdef CONFIG_SYN_COOKIES
 	if (unlikely(synack_type == TCP_SYNACK_COOKIE && ireq->tstamp_ok))
-		skb->skb_mstamp_ns = cookie_init_timestamp(req, now);
+		skb->tstamp = cookie_init_timestamp(req, now);
 	else
 #endif
 	{
-		skb->skb_mstamp_ns = now;
+		skb->tstamp = now;
 		if (!tcp_rsk(req)->snt_synack) /* Timestamp first SYNACK */
 			tcp_rsk(req)->snt_synack = tcp_skb_timestamp_us(skb);
 	}
@@ -3594,7 +3594,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb,
 				synack_type, &opts);
 
-	skb->skb_mstamp_ns = now;
+	skb->tstamp = now;
 	tcp_add_tx_delay(skb, tp);
 
 	return skb;
@@ -3771,7 +3771,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 
 	err = tcp_transmit_skb(sk, syn_data, 1, sk->sk_allocation);
 
-	syn->skb_mstamp_ns = syn_data->skb_mstamp_ns;
+	syn->tstamp = syn_data->tstamp;
 
 	/* Now full SYN+DATA was cloned and sent (or not),
 	 * remove the SYN from the original skb (syn_data)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 2995f8d89e7e..f5436afdafc7 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -440,7 +440,7 @@ static inline int ip6_forward_finish(struct net *net, struct sock *sk,
 	}
 #endif
 
-	skb->tstamp = 0;
+	skb_scrub_tstamp(skb);
 	return dst_output(net, sk, skb);
 }
 
diff --git a/net/socket.c b/net/socket.c
index 7f64a6eccf63..6f3f54746737 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -871,7 +871,7 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 
 	memset(&tss, 0, sizeof(tss));
 	if ((sk->sk_tsflags & SOF_TIMESTAMPING_SOFTWARE) &&
-	    ktime_to_timespec64_cond(skb->tstamp, tss.ts + 0))
+	    ktime_to_timespec64_cond(skb_get_ktime(skb), tss.ts + 0))
 		empty = 0;
 	if (shhwtstamps &&
 	    (sk->sk_tsflags & SOF_TIMESTAMPING_RAW_HARDWARE) &&
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 478f857cdaed..32f6935d3f5f 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -470,12 +470,9 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
 
 	len = svc_addr_len(svc_addr(rqstp));
 	rqstp->rq_addrlen = len;
-	if (skb->tstamp == 0) {
-		skb->tstamp = ktime_get_real();
-		/* Don't enable netstamp, sunrpc doesn't
-		   need that much accuracy */
-	}
-	sock_write_timestamp(svsk->sk_sk, skb->tstamp);
+	if (skb->tstamp == 0)
+		__net_timestamp(skb);
+	sock_write_timestamp(svsk->sk_sk, skb_get_ktime(skb));
 	set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); /* there may be more data... */
 
 	len = skb->len;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-07  2:01 [RFC PATCH net-next 1/2] net: Use mono clock in RX timestamp Martin KaFai Lau
@ 2021-12-07  2:01 ` Martin KaFai Lau
  2021-12-07 14:27   ` Willem de Bruijn
  0 siblings, 1 reply; 18+ messages in thread
From: Martin KaFai Lau @ 2021-12-07  2:01 UTC (permalink / raw)
  To: netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, Eric Dumazet,
	Jakub Kicinski, kernel-team, Willem de Bruijn

The skb->tstamp may be set by a local sk (as a sender in tcp) which then
forwarded and delivered to another sk (as a receiver).

An example:
    sender-sk => veth@netns =====> veth@host => receiver-sk
                             ^^^
			__dev_forward_skb

The skb->tstamp is marked with a future TX time.  This future
skb->tstamp will confuse the receiver-sk.

This patch marks the skb if the skb->tstamp is forwarded.
Before using the skb->tstamp as a rx timestamp, it needs
to be re-stamped to avoid getting a future time.  It is
done in the RX timestamp reading helper skb_get_ktime().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/skbuff.h | 14 +++++++++-----
 net/core/dev.c         |  4 +++-
 net/core/skbuff.c      |  6 +++++-
 3 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b609bdc5398b..bc4ae34c4e22 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -867,6 +867,7 @@ struct sk_buff {
 	__u8			decrypted:1;
 #endif
 	__u8			slow_gro:1;
+	__u8			fwd_tstamp:1;
 
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
@@ -3806,9 +3807,12 @@ static inline void skb_copy_to_linear_data_offset(struct sk_buff *skb,
 }
 
 void skb_init(void);
+void net_timestamp_set(struct sk_buff *skb);
 
-static inline ktime_t skb_get_ktime(const struct sk_buff *skb)
+static inline ktime_t skb_get_ktime(struct sk_buff *skb)
 {
+	if (unlikely(skb->fwd_tstamp))
+		net_timestamp_set(skb);
 	return ktime_mono_to_real_cond(skb->tstamp);
 }
 
@@ -3821,13 +3825,13 @@ static inline ktime_t skb_get_ktime(const struct sk_buff *skb)
  *	This function converts the offset back to a struct timeval and stores
  *	it in stamp.
  */
-static inline void skb_get_timestamp(const struct sk_buff *skb,
+static inline void skb_get_timestamp(struct sk_buff *skb,
 				     struct __kernel_old_timeval *stamp)
 {
 	*stamp = ns_to_kernel_old_timeval(skb_get_ktime(skb));
 }
 
-static inline void skb_get_new_timestamp(const struct sk_buff *skb,
+static inline void skb_get_new_timestamp(struct sk_buff *skb,
 					 struct __kernel_sock_timeval *stamp)
 {
 	struct timespec64 ts = ktime_to_timespec64(skb_get_ktime(skb));
@@ -3836,7 +3840,7 @@ static inline void skb_get_new_timestamp(const struct sk_buff *skb,
 	stamp->tv_usec = ts.tv_nsec / 1000;
 }
 
-static inline void skb_get_timestampns(const struct sk_buff *skb,
+static inline void skb_get_timestampns(struct sk_buff *skb,
 				       struct __kernel_old_timespec *stamp)
 {
 	struct timespec64 ts = ktime_to_timespec64(skb_get_ktime(skb));
@@ -3845,7 +3849,7 @@ static inline void skb_get_timestampns(const struct sk_buff *skb,
 	stamp->tv_nsec = ts.tv_nsec;
 }
 
-static inline void skb_get_new_timestampns(const struct sk_buff *skb,
+static inline void skb_get_new_timestampns(struct sk_buff *skb,
 					   struct __kernel_timespec *stamp)
 {
 	struct timespec64 ts = ktime_to_timespec64(skb_get_ktime(skb));
diff --git a/net/core/dev.c b/net/core/dev.c
index 4420086f3aeb..96cd31d9a359 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2058,12 +2058,14 @@ void net_disable_timestamp(void)
 }
 EXPORT_SYMBOL(net_disable_timestamp);
 
-static inline void net_timestamp_set(struct sk_buff *skb)
+void net_timestamp_set(struct sk_buff *skb)
 {
 	skb->tstamp = 0;
+	skb->fwd_tstamp = 0;
 	if (static_branch_unlikely(&netstamp_needed_key))
 		__net_timestamp(skb);
 }
+EXPORT_SYMBOL(net_timestamp_set);
 
 #define net_timestamp_check(COND, SKB)				\
 	if (static_branch_unlikely(&netstamp_needed_key)) {	\
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f091c7807a9e..181ddc989ead 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5295,8 +5295,12 @@ void skb_scrub_tstamp(struct sk_buff *skb)
 {
 	struct sock *sk = skb->sk;
 
-	if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME))
+	if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME)) {
 		skb->tstamp = 0;
+		skb->fwd_tstamp = 0;
+	} else if (skb->tstamp) {
+		skb->fwd_tstamp = 1;
+	}
 }
 
 /**
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-07  2:01 ` [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space Martin KaFai Lau
@ 2021-12-07 14:27   ` Willem de Bruijn
  2021-12-07 21:48     ` Daniel Borkmann
  2021-12-08  0:07     ` Martin KaFai Lau
  0 siblings, 2 replies; 18+ messages in thread
From: Willem de Bruijn @ 2021-12-07 14:27 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team

On Mon, Dec 6, 2021 at 9:01 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> The skb->tstamp may be set by a local sk (as a sender in tcp) which then
> forwarded and delivered to another sk (as a receiver).
>
> An example:
>     sender-sk => veth@netns =====> veth@host => receiver-sk
>                              ^^^
>                         __dev_forward_skb
>
> The skb->tstamp is marked with a future TX time.  This future
> skb->tstamp will confuse the receiver-sk.
>
> This patch marks the skb if the skb->tstamp is forwarded.
> Before using the skb->tstamp as a rx timestamp, it needs
> to be re-stamped to avoid getting a future time.  It is
> done in the RX timestamp reading helper skb_get_ktime().
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/linux/skbuff.h | 14 +++++++++-----
>  net/core/dev.c         |  4 +++-
>  net/core/skbuff.c      |  6 +++++-
>  3 files changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index b609bdc5398b..bc4ae34c4e22 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -867,6 +867,7 @@ struct sk_buff {
>         __u8                    decrypted:1;
>  #endif
>         __u8                    slow_gro:1;
> +       __u8                    fwd_tstamp:1;
>
>  #ifdef CONFIG_NET_SCHED
>         __u16                   tc_index;       /* traffic control index */
> @@ -3806,9 +3807,12 @@ static inline void skb_copy_to_linear_data_offset(struct sk_buff *skb,
>  }
>
>  void skb_init(void);
> +void net_timestamp_set(struct sk_buff *skb);
>
> -static inline ktime_t skb_get_ktime(const struct sk_buff *skb)
> +static inline ktime_t skb_get_ktime(struct sk_buff *skb)
>  {
> +       if (unlikely(skb->fwd_tstamp))
> +               net_timestamp_set(skb);
>         return ktime_mono_to_real_cond(skb->tstamp);

This changes timestamp behavior for existing applications, probably
worth mentioning in the commit message if nothing else. A timestamp
taking at the time of the recv syscall is not very useful.

If a forwarded timestamp is not a future delivery time (as those are
scrubbed), is it not correct to just deliver the original timestamp?
It probably was taken at some earlier __netif_receive_skb_core.

>  }
>
> -static inline void net_timestamp_set(struct sk_buff *skb)
> +void net_timestamp_set(struct sk_buff *skb)
>  {
>         skb->tstamp = 0;
> +       skb->fwd_tstamp = 0;
>         if (static_branch_unlikely(&netstamp_needed_key))
>                 __net_timestamp(skb);
>  }
> +EXPORT_SYMBOL(net_timestamp_set);
>
>  #define net_timestamp_check(COND, SKB)                         \
>         if (static_branch_unlikely(&netstamp_needed_key)) {     \
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f091c7807a9e..181ddc989ead 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5295,8 +5295,12 @@ void skb_scrub_tstamp(struct sk_buff *skb)
>  {
>         struct sock *sk = skb->sk;
>
> -       if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME))
> +       if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME)) {

There is a slight race here with the socket flipping the feature on/off.

>
>                 skb->tstamp = 0;
> +               skb->fwd_tstamp = 0;
> +       } else if (skb->tstamp) {
> +               skb->fwd_tstamp = 1;
> +       }

SO_TXTIME future delivery times are scrubbed, but TCP future delivery
times are not?

If adding a bit, might it be simpler to add a bit tstamp_is_edt, and
scrub based on that. That is also not open to the above race.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-07 14:27   ` Willem de Bruijn
@ 2021-12-07 21:48     ` Daniel Borkmann
  2021-12-07 23:06       ` Daniel Borkmann
  2021-12-08  8:18       ` Martin KaFai Lau
  2021-12-08  0:07     ` Martin KaFai Lau
  1 sibling, 2 replies; 18+ messages in thread
From: Daniel Borkmann @ 2021-12-07 21:48 UTC (permalink / raw)
  To: Willem de Bruijn, Martin KaFai Lau
  Cc: netdev, Alexei Starovoitov, David Miller, Eric Dumazet,
	Jakub Kicinski, kernel-team

On 12/7/21 3:27 PM, Willem de Bruijn wrote:
> On Mon, Dec 6, 2021 at 9:01 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>
>> The skb->tstamp may be set by a local sk (as a sender in tcp) which then
>> forwarded and delivered to another sk (as a receiver).
>>
>> An example:
>>      sender-sk => veth@netns =====> veth@host => receiver-sk
>>                               ^^^
>>                          __dev_forward_skb
>>
>> The skb->tstamp is marked with a future TX time.  This future
>> skb->tstamp will confuse the receiver-sk.
>>
>> This patch marks the skb if the skb->tstamp is forwarded.
>> Before using the skb->tstamp as a rx timestamp, it needs
>> to be re-stamped to avoid getting a future time.  It is
>> done in the RX timestamp reading helper skb_get_ktime().
>>
>> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
>> ---
>>   include/linux/skbuff.h | 14 +++++++++-----
>>   net/core/dev.c         |  4 +++-
>>   net/core/skbuff.c      |  6 +++++-
>>   3 files changed, 17 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index b609bdc5398b..bc4ae34c4e22 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -867,6 +867,7 @@ struct sk_buff {
>>          __u8                    decrypted:1;
>>   #endif
>>          __u8                    slow_gro:1;
>> +       __u8                    fwd_tstamp:1;
>>
>>   #ifdef CONFIG_NET_SCHED
>>          __u16                   tc_index;       /* traffic control index */
>> @@ -3806,9 +3807,12 @@ static inline void skb_copy_to_linear_data_offset(struct sk_buff *skb,
>>   }
>>
>>   void skb_init(void);
>> +void net_timestamp_set(struct sk_buff *skb);
>>
>> -static inline ktime_t skb_get_ktime(const struct sk_buff *skb)
>> +static inline ktime_t skb_get_ktime(struct sk_buff *skb)
>>   {
>> +       if (unlikely(skb->fwd_tstamp))
>> +               net_timestamp_set(skb);
>>          return ktime_mono_to_real_cond(skb->tstamp);
> 
> This changes timestamp behavior for existing applications, probably
> worth mentioning in the commit message if nothing else. A timestamp
> taking at the time of the recv syscall is not very useful.
> 
> If a forwarded timestamp is not a future delivery time (as those are
> scrubbed), is it not correct to just deliver the original timestamp?
> It probably was taken at some earlier __netif_receive_skb_core.
> 
>>   }
>>
>> -static inline void net_timestamp_set(struct sk_buff *skb)
>> +void net_timestamp_set(struct sk_buff *skb)
>>   {
>>          skb->tstamp = 0;
>> +       skb->fwd_tstamp = 0;
>>          if (static_branch_unlikely(&netstamp_needed_key))
>>                  __net_timestamp(skb);
>>   }
>> +EXPORT_SYMBOL(net_timestamp_set);
>>
>>   #define net_timestamp_check(COND, SKB)                         \
>>          if (static_branch_unlikely(&netstamp_needed_key)) {     \
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index f091c7807a9e..181ddc989ead 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -5295,8 +5295,12 @@ void skb_scrub_tstamp(struct sk_buff *skb)
>>   {
>>          struct sock *sk = skb->sk;
>>
>> -       if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME))
>> +       if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME)) {
> 
> There is a slight race here with the socket flipping the feature on/off.
> 
>>                  skb->tstamp = 0;
>> +               skb->fwd_tstamp = 0;
>> +       } else if (skb->tstamp) {
>> +               skb->fwd_tstamp = 1;
>> +       }
> 
> SO_TXTIME future delivery times are scrubbed, but TCP future delivery
> times are not?
> 
> If adding a bit, might it be simpler to add a bit tstamp_is_edt, and
> scrub based on that. That is also not open to the above race.

One other thing I wonder, BPF progs at host-facing veth's tc ingress which
are not aware of skb->tstamp will then see a tstamp from future given we
intentionally bypass the net_timestamp_check() and might get confused (or
would confuse higher-layer application logic)? Not quite sure yet if they
would be the only affected user.

With regards to open question on mono clock and time namespaces (which
cover mono + boottime offsets), looks like it seems not an issue as they
only affect syscall-facing APIs.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-07 21:48     ` Daniel Borkmann
@ 2021-12-07 23:06       ` Daniel Borkmann
  2021-12-08  8:18       ` Martin KaFai Lau
  1 sibling, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2021-12-07 23:06 UTC (permalink / raw)
  To: Willem de Bruijn, Martin KaFai Lau
  Cc: netdev, Alexei Starovoitov, David Miller, Eric Dumazet,
	Jakub Kicinski, kernel-team

On 12/7/21 10:48 PM, Daniel Borkmann wrote:
> On 12/7/21 3:27 PM, Willem de Bruijn wrote:
>> On Mon, Dec 6, 2021 at 9:01 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>>
>>> The skb->tstamp may be set by a local sk (as a sender in tcp) which then
>>> forwarded and delivered to another sk (as a receiver).
>>>
>>> An example:
>>>      sender-sk => veth@netns =====> veth@host => receiver-sk
>>>                               ^^^
>>>                          __dev_forward_skb
>>>
>>> The skb->tstamp is marked with a future TX time.  This future
>>> skb->tstamp will confuse the receiver-sk.
>>>
>>> This patch marks the skb if the skb->tstamp is forwarded.
>>> Before using the skb->tstamp as a rx timestamp, it needs
>>> to be re-stamped to avoid getting a future time.  It is
>>> done in the RX timestamp reading helper skb_get_ktime().
>>>
>>> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
>>> ---
>>>   include/linux/skbuff.h | 14 +++++++++-----
>>>   net/core/dev.c         |  4 +++-
>>>   net/core/skbuff.c      |  6 +++++-
>>>   3 files changed, 17 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>> index b609bdc5398b..bc4ae34c4e22 100644
>>> --- a/include/linux/skbuff.h
>>> +++ b/include/linux/skbuff.h
>>> @@ -867,6 +867,7 @@ struct sk_buff {
>>>          __u8                    decrypted:1;
>>>   #endif
>>>          __u8                    slow_gro:1;
>>> +       __u8                    fwd_tstamp:1;
>>>
>>>   #ifdef CONFIG_NET_SCHED
>>>          __u16                   tc_index;       /* traffic control index */
>>> @@ -3806,9 +3807,12 @@ static inline void skb_copy_to_linear_data_offset(struct sk_buff *skb,
>>>   }
>>>
>>>   void skb_init(void);
>>> +void net_timestamp_set(struct sk_buff *skb);
>>>
>>> -static inline ktime_t skb_get_ktime(const struct sk_buff *skb)
>>> +static inline ktime_t skb_get_ktime(struct sk_buff *skb)
>>>   {
>>> +       if (unlikely(skb->fwd_tstamp))
>>> +               net_timestamp_set(skb);
>>>          return ktime_mono_to_real_cond(skb->tstamp);
>>
>> This changes timestamp behavior for existing applications, probably
>> worth mentioning in the commit message if nothing else. A timestamp
>> taking at the time of the recv syscall is not very useful.
>>
>> If a forwarded timestamp is not a future delivery time (as those are
>> scrubbed), is it not correct to just deliver the original timestamp?
>> It probably was taken at some earlier __netif_receive_skb_core.
>>
>>>   }
>>>
>>> -static inline void net_timestamp_set(struct sk_buff *skb)
>>> +void net_timestamp_set(struct sk_buff *skb)
>>>   {
>>>          skb->tstamp = 0;
>>> +       skb->fwd_tstamp = 0;
>>>          if (static_branch_unlikely(&netstamp_needed_key))
>>>                  __net_timestamp(skb);
>>>   }
>>> +EXPORT_SYMBOL(net_timestamp_set);
>>>
>>>   #define net_timestamp_check(COND, SKB)                         \
>>>          if (static_branch_unlikely(&netstamp_needed_key)) {     \
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index f091c7807a9e..181ddc989ead 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -5295,8 +5295,12 @@ void skb_scrub_tstamp(struct sk_buff *skb)
>>>   {
>>>          struct sock *sk = skb->sk;
>>>
>>> -       if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME))
>>> +       if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME)) {
>>
>> There is a slight race here with the socket flipping the feature on/off.
>>
>>>                  skb->tstamp = 0;
>>> +               skb->fwd_tstamp = 0;
>>> +       } else if (skb->tstamp) {
>>> +               skb->fwd_tstamp = 1;
>>> +       }
>>
>> SO_TXTIME future delivery times are scrubbed, but TCP future delivery
>> times are not?
>>
>> If adding a bit, might it be simpler to add a bit tstamp_is_edt, and
>> scrub based on that. That is also not open to the above race.
> 
> One other thing I wonder, BPF progs at host-facing veth's tc ingress which
> are not aware of skb->tstamp will then see a tstamp from future given we
> intentionally bypass the net_timestamp_check() and might get confused (or
> would confuse higher-layer application logic)? Not quite sure yet if they
> would be the only affected user.

Untested (& unoptimized wrt netdev cachelines), but worst case maybe something
like this could work ... not generic, but smaller risk wrt timestamp behavior
changes for applications when pushing up the stack (?).

Meaning, the attribute would be set for host-facing veths and the phys dev with
sch_fq. Not generic unfortunately given this would require the coorperation from
BPF side on tc ingress of the host veths, meaning, given the net_timestamp_check()
is skipped, the prog would have to call net_timestamp_set() via BPF helper if it
decides to return with TC_ACT_OK. (So orchestrator would opt-in(/out) to set the
devs it manages to xnet_flush_tstamp to 0 and to update tstamp via helper.. hm)

  include/linux/netdevice.h |  4 +++-
  include/linux/skbuff.h    |  6 +++++-
  net/core/dev.c            |  1 +
  net/core/filter.c         |  9 ++++++---
  net/core/net-sysfs.c      | 18 ++++++++++++++++++
  net/core/skbuff.c         | 15 +++++++++------
  6 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec42495a43a..df9141f92bbf 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2172,6 +2172,7 @@ struct net_device {
  	struct timer_list	watchdog_timer;
  	int			watchdog_timeo;

+	u32			xnet_flush_tstamp;
  	u32                     proto_down_reason;

  	struct list_head	todo_list;
@@ -4137,7 +4138,8 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
  		return NET_RX_DROP;
  	}

-	skb_scrub_packet(skb, !net_eq(dev_net(dev), dev_net(skb->dev)));
+	__skb_scrub_packet(skb, !net_eq(dev_net(dev), dev_net(skb->dev)),
+			   READ_ONCE(dev->xnet_flush_tstamp));
  	skb->priority = 0;
  	return 0;
  }
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 686a666d073d..09b670bcd7fd 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3688,7 +3688,11 @@ int skb_zerocopy(struct sk_buff *to, struct sk_buff *from,
  		 int len, int hlen);
  void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len);
  int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen);
-void skb_scrub_packet(struct sk_buff *skb, bool xnet);
+void __skb_scrub_packet(struct sk_buff *skb, bool xnet, bool tstamp);
+static __always_inline void skb_scrub_packet(struct sk_buff *skb, bool xnet)
+{
+	__skb_scrub_packet(skb, xnet, true);
+}
  bool skb_gso_validate_network_len(const struct sk_buff *skb, unsigned int mtu);
  bool skb_gso_validate_mac_len(const struct sk_buff *skb, unsigned int len);
  struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features);
diff --git a/net/core/dev.c b/net/core/dev.c
index 15ac064b5562..1678032bd5a3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10853,6 +10853,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
  	dev->gso_max_segs = GSO_MAX_SEGS;
  	dev->upper_level = 1;
  	dev->lower_level = 1;
+	dev->xnet_flush_tstamp = 1;
  #ifdef CONFIG_LOCKDEP
  	dev->nested_level = 0;
  	INIT_LIST_HEAD(&dev->unlink_list);
diff --git a/net/core/filter.c b/net/core/filter.c
index fe27c91e3758..69366af42141 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2107,7 +2107,8 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
  	}

  	skb->dev = dev;
-	skb->tstamp = 0;
+	if (READ_ONCE(dev->xnet_flush_tstamp))
+		skb->tstamp = 0;

  	dev_xmit_recursion_inc();
  	ret = dev_queue_xmit(skb);
@@ -2176,7 +2177,8 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb,
  	}

  	skb->dev = dev;
-	skb->tstamp = 0;
+	if (READ_ONCE(dev->xnet_flush_tstamp))
+		skb->tstamp = 0;

  	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
  		skb = skb_expand_head(skb, hh_len);
@@ -2274,7 +2276,8 @@ static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb,
  	}

  	skb->dev = dev;
-	skb->tstamp = 0;
+	if (READ_ONCE(dev->xnet_flush_tstamp))
+		skb->tstamp = 0;

  	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
  		skb = skb_expand_head(skb, hh_len);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 9c01c642cf9e..d8ad9dbbbf55 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -403,6 +403,23 @@ static ssize_t gro_flush_timeout_store(struct device *dev,
  }
  NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);

+static int change_xnet_flush_tstamp(struct net_device *dev, unsigned long val)
+{
+	WRITE_ONCE(dev->xnet_flush_tstamp, val);
+	return 0;
+}
+
+static ssize_t xnet_flush_tstamp_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t len)
+{
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	return netdev_store(dev, attr, buf, len, change_xnet_flush_tstamp);
+}
+NETDEVICE_SHOW_RW(xnet_flush_tstamp, fmt_dec);
+
  static int change_napi_defer_hard_irqs(struct net_device *dev, unsigned long val)
  {
  	WRITE_ONCE(dev->napi_defer_hard_irqs, val);
@@ -651,6 +668,7 @@ static struct attribute *net_class_attrs[] __ro_after_init = {
  	&dev_attr_flags.attr,
  	&dev_attr_tx_queue_len.attr,
  	&dev_attr_gro_flush_timeout.attr,
+	&dev_attr_xnet_flush_tstamp.attr,
  	&dev_attr_napi_defer_hard_irqs.attr,
  	&dev_attr_phys_port_id.attr,
  	&dev_attr_phys_port_name.attr,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ba2f38246f07..b0f6b96c7b2a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5440,19 +5440,21 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
  EXPORT_SYMBOL(skb_try_coalesce);

  /**
- * skb_scrub_packet - scrub an skb
+ * __skb_scrub_packet - scrub an skb
   *
   * @skb: buffer to clean
   * @xnet: packet is crossing netns
+ * @tstamp: timestamp needs scrubbing
   *
- * skb_scrub_packet can be used after encapsulating or decapsulting a packet
+ * __skb_scrub_packet can be used after encapsulating or decapsulting a packet
   * into/from a tunnel. Some information have to be cleared during these
   * operations.
- * skb_scrub_packet can also be used to clean a skb before injecting it in
+ *
+ * __skb_scrub_packet can also be used to clean a skb before injecting it in
   * another namespace (@xnet == true). We have to clear all information in the
   * skb that could impact namespace isolation.
   */
-void skb_scrub_packet(struct sk_buff *skb, bool xnet)
+void __skb_scrub_packet(struct sk_buff *skb, bool xnet, bool tstamp)
  {
  	skb->pkt_type = PACKET_HOST;
  	skb->skb_iif = 0;
@@ -5472,9 +5474,10 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)

  	ipvs_reset(skb);
  	skb->mark = 0;
-	skb->tstamp = 0;
+	if (tstamp)
+		skb->tstamp = 0;
  }
-EXPORT_SYMBOL_GPL(skb_scrub_packet);
+EXPORT_SYMBOL_GPL(__skb_scrub_packet);

  /**
   * skb_gso_transport_seglen - Return length of individual segments of a gso packet
-- 
2.21.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-07 14:27   ` Willem de Bruijn
  2021-12-07 21:48     ` Daniel Borkmann
@ 2021-12-08  0:07     ` Martin KaFai Lau
  2021-12-08  0:44       ` Willem de Bruijn
  1 sibling, 1 reply; 18+ messages in thread
From: Martin KaFai Lau @ 2021-12-08  0:07 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team

On Tue, Dec 07, 2021 at 09:27:55AM -0500, Willem de Bruijn wrote:
> On Mon, Dec 6, 2021 at 9:01 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > The skb->tstamp may be set by a local sk (as a sender in tcp) which then
> > forwarded and delivered to another sk (as a receiver).
> >
> > An example:
> >     sender-sk => veth@netns =====> veth@host => receiver-sk
> >                              ^^^
> >                         __dev_forward_skb
> >
> > The skb->tstamp is marked with a future TX time.  This future
> > skb->tstamp will confuse the receiver-sk.
> >
> > This patch marks the skb if the skb->tstamp is forwarded.
> > Before using the skb->tstamp as a rx timestamp, it needs
> > to be re-stamped to avoid getting a future time.  It is
> > done in the RX timestamp reading helper skb_get_ktime().
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  include/linux/skbuff.h | 14 +++++++++-----
> >  net/core/dev.c         |  4 +++-
> >  net/core/skbuff.c      |  6 +++++-
> >  3 files changed, 17 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index b609bdc5398b..bc4ae34c4e22 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -867,6 +867,7 @@ struct sk_buff {
> >         __u8                    decrypted:1;
> >  #endif
> >         __u8                    slow_gro:1;
> > +       __u8                    fwd_tstamp:1;
> >
> >  #ifdef CONFIG_NET_SCHED
> >         __u16                   tc_index;       /* traffic control index */
> > @@ -3806,9 +3807,12 @@ static inline void skb_copy_to_linear_data_offset(struct sk_buff *skb,
> >  }
> >
> >  void skb_init(void);
> > +void net_timestamp_set(struct sk_buff *skb);
> >
> > -static inline ktime_t skb_get_ktime(const struct sk_buff *skb)
> > +static inline ktime_t skb_get_ktime(struct sk_buff *skb)
> >  {
> > +       if (unlikely(skb->fwd_tstamp))
> > +               net_timestamp_set(skb);
> >         return ktime_mono_to_real_cond(skb->tstamp);
> 
> This changes timestamp behavior for existing applications, probably
> worth mentioning in the commit message if nothing else. A timestamp
> taking at the time of the recv syscall is not very useful.
> 
> If a forwarded timestamp is not a future delivery time (as those are
> scrubbed), is it not correct to just deliver the original timestamp?
> It probably was taken at some earlier __netif_receive_skb_core.
Make sense.  I will compare with the current mono clock first before
resetting and also mention this behavior change in the commit message.

Do you think it will be too heavy to always compare with
the current time without testing the skb->fwd_tstamp bit
first?

> 
> >  }
> >
> > -static inline void net_timestamp_set(struct sk_buff *skb)
> > +void net_timestamp_set(struct sk_buff *skb)
> >  {
> >         skb->tstamp = 0;
> > +       skb->fwd_tstamp = 0;
> >         if (static_branch_unlikely(&netstamp_needed_key))
> >                 __net_timestamp(skb);
> >  }
> > +EXPORT_SYMBOL(net_timestamp_set);
> >
> >  #define net_timestamp_check(COND, SKB)                         \
> >         if (static_branch_unlikely(&netstamp_needed_key)) {     \
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index f091c7807a9e..181ddc989ead 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5295,8 +5295,12 @@ void skb_scrub_tstamp(struct sk_buff *skb)
> >  {
> >         struct sock *sk = skb->sk;
> >
> > -       if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME))
> > +       if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME)) {
> 
> There is a slight race here with the socket flipping the feature on/off.
Right, I think it is an inherited race by relating skb->tstamp with
a bit in sk, like the existing sch_etf.c.
Directly setting a bit in skb when setting the skb->tstamp will help.

> 
> >
> >                 skb->tstamp = 0;
> > +               skb->fwd_tstamp = 0;
> > +       } else if (skb->tstamp) {
> > +               skb->fwd_tstamp = 1;
> > +       }
> 
> SO_TXTIME future delivery times are scrubbed, but TCP future delivery
> times are not?
It is not too much about scrubbing future SO_TXTIME or future TCP
delivery time for the local delivery.

fwd_mono_tstamp may be a better name.  It is about the forwarded tstamp
is in mono.  e.g. the packet from a container-netns can be queued
at the fq@hostns (the case described in patch 1 commit log).
Also, the bpf@ingress@veth@hostns can now expect the skb->tstamp is in
mono time.  BPF side does not have helper returning real clock, so it is
safe to assume that bpf prog is comparing (or setting) skb->tstamp as
mono also.

> If adding a bit, might it be simpler to add a bit tstamp_is_edt, and
> scrub based on that. That is also not open to the above race.
It was one of my earlier attempts by adding tstamp_is_tx_mono and
set it in tcp_output.c and then test it before scrubbing.
Other than changing the tcp_output.c (e.g. in __tcp_transmit_skb),
I ended up making another change on the bpf side to also set
this bit when the bpf_prog is updating the __sk_buff->tstamp.  Thus,
in this patch , I ended up setting a bit only in the forward path.

I can go back to retry the tstamp_is_edt/tstamp_is_tx_mono idea and
that can also avoid the race in testing sock_flag(sk, SOCK_TXTIME)
as you suggested.

Thanks for the review !

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-08  0:07     ` Martin KaFai Lau
@ 2021-12-08  0:44       ` Willem de Bruijn
  2021-12-08  2:45         ` Martin KaFai Lau
  0 siblings, 1 reply; 18+ messages in thread
From: Willem de Bruijn @ 2021-12-08  0:44 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, netdev, Alexei Starovoitov, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team

> > > -static inline ktime_t skb_get_ktime(const struct sk_buff *skb)
> > > +static inline ktime_t skb_get_ktime(struct sk_buff *skb)
> > >  {
> > > +       if (unlikely(skb->fwd_tstamp))
> > > +               net_timestamp_set(skb);
> > >         return ktime_mono_to_real_cond(skb->tstamp);
> >
> > This changes timestamp behavior for existing applications, probably
> > worth mentioning in the commit message if nothing else. A timestamp
> > taking at the time of the recv syscall is not very useful.
> >
> > If a forwarded timestamp is not a future delivery time (as those are
> > scrubbed), is it not correct to just deliver the original timestamp?
> > It probably was taken at some earlier __netif_receive_skb_core.
> Make sense.  I will compare with the current mono clock first before
> resetting and also mention this behavior change in the commit message.
>
> Do you think it will be too heavy to always compare with
> the current time without testing the skb->fwd_tstamp bit
> first?

There are other examples of code using ktime_get and variants in the
hot path, such as FQ.

Especially if skb_get_ktime is called in recv() timestamp helpers, it
is perhaps acceptable. If not ideal. If we need an skb bit anyway,
then this is moot.

> >
> > >  }
> > >
> > > -static inline void net_timestamp_set(struct sk_buff *skb)
> > > +void net_timestamp_set(struct sk_buff *skb)
> > >  {
> > >         skb->tstamp = 0;
> > > +       skb->fwd_tstamp = 0;
> > >         if (static_branch_unlikely(&netstamp_needed_key))
> > >                 __net_timestamp(skb);
> > >  }
> > > +EXPORT_SYMBOL(net_timestamp_set);
> > >
> > >  #define net_timestamp_check(COND, SKB)                         \
> > >         if (static_branch_unlikely(&netstamp_needed_key)) {     \
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index f091c7807a9e..181ddc989ead 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -5295,8 +5295,12 @@ void skb_scrub_tstamp(struct sk_buff *skb)
> > >  {
> > >         struct sock *sk = skb->sk;
> > >
> > > -       if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME))
> > > +       if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME)) {
> >
> > There is a slight race here with the socket flipping the feature on/off.
> Right, I think it is an inherited race by relating skb->tstamp with
> a bit in sk, like the existing sch_etf.c.
> Directly setting a bit in skb when setting the skb->tstamp will help.
>
> >
> > >
> > >                 skb->tstamp = 0;
> > > +               skb->fwd_tstamp = 0;
> > > +       } else if (skb->tstamp) {
> > > +               skb->fwd_tstamp = 1;
> > > +       }
> >
> > SO_TXTIME future delivery times are scrubbed, but TCP future delivery
> > times are not?
> It is not too much about scrubbing future SO_TXTIME or future TCP
> delivery time for the local delivery.

The purpose of the above is to reset future delivery time whenever it
can be mistaken for a timestamp, right?

This function is called on forwarding, redirection, looping from
egress to ingress with __dev_forward_skb, etc. But then it breaks the
delivery time forwarding over veth that I thought was the purpose of
this patch series. I guess I'm a bit hazy when this is supposed to be
scrubbed exactly.

> fwd_mono_tstamp may be a better name.  It is about the forwarded tstamp
> is in mono.

After your change skb->tstamp is no longer in CLOCK_REALTIME, right?

Somewhat annoyingly, that does not imply that it is always
CLOCK_MONOTONIC. Because while FQ uses that, ETF is programmed with
CLOCK_TAI.

Perhaps skb->delivery_time is the most specific description. And that
is easy to test for in skb_scrub_tstamp.


> e.g. the packet from a container-netns can be queued
> at the fq@hostns (the case described in patch 1 commit log).
> Also, the bpf@ingress@veth@hostns can now expect the skb->tstamp is in
> mono time.  BPF side does not have helper returning real clock, so it is
> safe to assume that bpf prog is comparing (or setting) skb->tstamp as
> mono also.
>
> > If adding a bit, might it be simpler to add a bit tstamp_is_edt, and
> > scrub based on that. That is also not open to the above race.
> It was one of my earlier attempts by adding tstamp_is_tx_mono and
> set it in tcp_output.c and then test it before scrubbing.
> Other than changing the tcp_output.c (e.g. in __tcp_transmit_skb),
> I ended up making another change on the bpf side to also set
> this bit when the bpf_prog is updating the __sk_buff->tstamp.  Thus,
> in this patch , I ended up setting a bit only in the forward path.
>
> I can go back to retry the tstamp_is_edt/tstamp_is_tx_mono idea and
> that can also avoid the race in testing sock_flag(sk, SOCK_TXTIME)
> as you suggested.

Sounds great, thanks

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-08  0:44       ` Willem de Bruijn
@ 2021-12-08  2:45         ` Martin KaFai Lau
  0 siblings, 0 replies; 18+ messages in thread
From: Martin KaFai Lau @ 2021-12-08  2:45 UTC (permalink / raw)
  To: Willem de Bruijn, Daniel Borkmann
  Cc: netdev, Alexei Starovoitov, David Miller, Eric Dumazet,
	Jakub Kicinski, kernel-team

On Tue, Dec 07, 2021 at 07:44:05PM -0500, Willem de Bruijn wrote:
> > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > index f091c7807a9e..181ddc989ead 100644
> > > > --- a/net/core/skbuff.c
> > > > +++ b/net/core/skbuff.c
> > > > @@ -5295,8 +5295,12 @@ void skb_scrub_tstamp(struct sk_buff *skb)
> > > >  {
> > > >         struct sock *sk = skb->sk;
> > > >
> > > > -       if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME))
> > > > +       if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME)) {
> > >
> > > There is a slight race here with the socket flipping the feature on/off.
> > Right, I think it is an inherited race by relating skb->tstamp with
> > a bit in sk, like the existing sch_etf.c.
> > Directly setting a bit in skb when setting the skb->tstamp will help.
> >
> > >
> > > >
> > > >                 skb->tstamp = 0;
> > > > +               skb->fwd_tstamp = 0;
> > > > +       } else if (skb->tstamp) {
> > > > +               skb->fwd_tstamp = 1;
> > > > +       }
> > >
> > > SO_TXTIME future delivery times are scrubbed, but TCP future delivery
> > > times are not?
> > It is not too much about scrubbing future SO_TXTIME or future TCP
> > delivery time for the local delivery.
> 
> The purpose of the above is to reset future delivery time whenever it
> can be mistaken for a timestamp, right?
> 
> This function is called on forwarding, redirection, looping from
> egress to ingress with __dev_forward_skb, etc. But then it breaks the
> delivery time forwarding over veth that I thought was the purpose of
> this patch series. I guess I'm a bit hazy when this is supposed to be
> scrubbed exactly.
> 
> > fwd_mono_tstamp may be a better name.  It is about the forwarded tstamp
> > is in mono.
> 
> After your change skb->tstamp is no longer in CLOCK_REALTIME, right?
Right.  The __net_timestamp() will use CLOCK_MONOTONIC.

> Somewhat annoyingly, that does not imply that it is always
> CLOCK_MONOTONIC. Because while FQ uses that, ETF is programmed with
> CLOCK_TAI.
Yes, it is the annoying part, so this patch keeps the tstamp
scrubbing for SO_TXTIME.

If a sk in veth@netns uses SO_TXTIME setting tstamp to TAI and
it is not scrubbed here, it may get forwarded to the fq@hostns
and then get dropped.

skb_ktime_get() also won't know how to compare with the current
time (mono or tai?) and then reset if needed.
Alternatively, it can always re-stamp (__net_timestamp()) much earlier
in the stack before recvmsg().  e.g. just after the sch_handle_ingress()
when TC_ACT_OK is returned as Daniel also mentioned in another thread.
That will be more limited to the bpf@ingress (and then bpf_redirect) usecase
instead of generally applicable to the ip[6]_forward.  However,
the benefit is a more limited impact scope and potential breakage.

> Perhaps skb->delivery_time is the most specific description. And that
> is easy to test for in skb_scrub_tstamp.
> 
> 
> > e.g. the packet from a container-netns can be queued
> > at the fq@hostns (the case described in patch 1 commit log).
> > Also, the bpf@ingress@veth@hostns can now expect the skb->tstamp is in
> > mono time.  BPF side does not have helper returning real clock, so it is
> > safe to assume that bpf prog is comparing (or setting) skb->tstamp as
> > mono also.
> >
> > > If adding a bit, might it be simpler to add a bit tstamp_is_edt, and
> > > scrub based on that. That is also not open to the above race.
> > It was one of my earlier attempts by adding tstamp_is_tx_mono and
> > set it in tcp_output.c and then test it before scrubbing.
> > Other than changing the tcp_output.c (e.g. in __tcp_transmit_skb),
> > I ended up making another change on the bpf side to also set
> > this bit when the bpf_prog is updating the __sk_buff->tstamp.  Thus,
> > in this patch , I ended up setting a bit only in the forward path.
> >
> > I can go back to retry the tstamp_is_edt/tstamp_is_tx_mono idea and
> > that can also avoid the race in testing sock_flag(sk, SOCK_TXTIME)
> > as you suggested.
> 
> Sounds great, thanks

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-07 21:48     ` Daniel Borkmann
  2021-12-07 23:06       ` Daniel Borkmann
@ 2021-12-08  8:18       ` Martin KaFai Lau
  2021-12-08  8:30         ` Martin KaFai Lau
  1 sibling, 1 reply; 18+ messages in thread
From: Martin KaFai Lau @ 2021-12-08  8:18 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Willem de Bruijn, netdev, Alexei Starovoitov, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team

On Tue, Dec 07, 2021 at 10:48:53PM +0100, Daniel Borkmann wrote:
> On 12/7/21 3:27 PM, Willem de Bruijn wrote:
> > On Mon, Dec 6, 2021 at 9:01 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > 
> > > The skb->tstamp may be set by a local sk (as a sender in tcp) which then
> > > forwarded and delivered to another sk (as a receiver).
> > > 
> > > An example:
> > >      sender-sk => veth@netns =====> veth@host => receiver-sk
> > >                               ^^^
> > >                          __dev_forward_skb
> > > 
> > > The skb->tstamp is marked with a future TX time.  This future
> > > skb->tstamp will confuse the receiver-sk.
> > > 
> > > This patch marks the skb if the skb->tstamp is forwarded.
> > > Before using the skb->tstamp as a rx timestamp, it needs
> > > to be re-stamped to avoid getting a future time.  It is
> > > done in the RX timestamp reading helper skb_get_ktime().
> > > 
> > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > ---
> > >   include/linux/skbuff.h | 14 +++++++++-----
> > >   net/core/dev.c         |  4 +++-
> > >   net/core/skbuff.c      |  6 +++++-
> > >   3 files changed, 17 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > index b609bdc5398b..bc4ae34c4e22 100644
> > > --- a/include/linux/skbuff.h
> > > +++ b/include/linux/skbuff.h
> > > @@ -867,6 +867,7 @@ struct sk_buff {
> > >          __u8                    decrypted:1;
> > >   #endif
> > >          __u8                    slow_gro:1;
> > > +       __u8                    fwd_tstamp:1;
> > > 
> > >   #ifdef CONFIG_NET_SCHED
> > >          __u16                   tc_index;       /* traffic control index */
> > > @@ -3806,9 +3807,12 @@ static inline void skb_copy_to_linear_data_offset(struct sk_buff *skb,
> > >   }
> > > 
> > >   void skb_init(void);
> > > +void net_timestamp_set(struct sk_buff *skb);
> > > 
> > > -static inline ktime_t skb_get_ktime(const struct sk_buff *skb)
> > > +static inline ktime_t skb_get_ktime(struct sk_buff *skb)
> > >   {
> > > +       if (unlikely(skb->fwd_tstamp))
> > > +               net_timestamp_set(skb);
> > >          return ktime_mono_to_real_cond(skb->tstamp);
> > 
> > This changes timestamp behavior for existing applications, probably
> > worth mentioning in the commit message if nothing else. A timestamp
> > taking at the time of the recv syscall is not very useful.
> > 
> > If a forwarded timestamp is not a future delivery time (as those are
> > scrubbed), is it not correct to just deliver the original timestamp?
> > It probably was taken at some earlier __netif_receive_skb_core.
> > 
> > >   }
> > > 
> > > -static inline void net_timestamp_set(struct sk_buff *skb)
> > > +void net_timestamp_set(struct sk_buff *skb)
> > >   {
> > >          skb->tstamp = 0;
> > > +       skb->fwd_tstamp = 0;
> > >          if (static_branch_unlikely(&netstamp_needed_key))
> > >                  __net_timestamp(skb);
> > >   }
> > > +EXPORT_SYMBOL(net_timestamp_set);
> > > 
> > >   #define net_timestamp_check(COND, SKB)                         \
> > >          if (static_branch_unlikely(&netstamp_needed_key)) {     \
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index f091c7807a9e..181ddc989ead 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -5295,8 +5295,12 @@ void skb_scrub_tstamp(struct sk_buff *skb)
> > >   {
> > >          struct sock *sk = skb->sk;
> > > 
> > > -       if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME))
> > > +       if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME)) {
> > 
> > There is a slight race here with the socket flipping the feature on/off.
> > 
> > >                  skb->tstamp = 0;
> > > +               skb->fwd_tstamp = 0;
> > > +       } else if (skb->tstamp) {
> > > +               skb->fwd_tstamp = 1;
> > > +       }
> > 
> > SO_TXTIME future delivery times are scrubbed, but TCP future delivery
> > times are not?
> > 
> > If adding a bit, might it be simpler to add a bit tstamp_is_edt, and
> > scrub based on that. That is also not open to the above race.
> 
> One other thing I wonder, BPF progs at host-facing veth's tc ingress which
> are not aware of skb->tstamp will then see a tstamp from future given we
> intentionally bypass the net_timestamp_check() and might get confused (or
> would confuse higher-layer application logic)? Not quite sure yet if they
> would be the only affected user.
Considering the variety of clock used in skb->tstamp (real/mono, and also
tai in SO_TXTIME),  in general I am not sure if the tc-bpf can assume anything
in the skb->tstamp now.
Also, there is only mono clock bpf_ktime_get helper, the most reasonable usage
now for tc-bpf is to set the EDT which is in mono.  This seems to be the
intention when the __sk_buff->tstamp was added.
For ingress, it is real clock now.  Other than simply printing it out,
it is hard to think of a good way to use the value.  Also, although
it is unlikely, net_timestamp_check() does not always stamp the skb.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-08  8:18       ` Martin KaFai Lau
@ 2021-12-08  8:30         ` Martin KaFai Lau
  2021-12-08 18:27           ` Eric Dumazet
  2021-12-08 19:27           ` Daniel Borkmann
  0 siblings, 2 replies; 18+ messages in thread
From: Martin KaFai Lau @ 2021-12-08  8:30 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Willem de Bruijn, netdev, Alexei Starovoitov, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team

On Wed, Dec 08, 2021 at 12:18:46AM -0800, Martin KaFai Lau wrote:
> On Tue, Dec 07, 2021 at 10:48:53PM +0100, Daniel Borkmann wrote:
> > On 12/7/21 3:27 PM, Willem de Bruijn wrote:
> > > On Mon, Dec 6, 2021 at 9:01 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > 
> > > > The skb->tstamp may be set by a local sk (as a sender in tcp) which then
> > > > forwarded and delivered to another sk (as a receiver).
> > > > 
> > > > An example:
> > > >      sender-sk => veth@netns =====> veth@host => receiver-sk
> > > >                               ^^^
> > > >                          __dev_forward_skb
> > > > 
> > > > The skb->tstamp is marked with a future TX time.  This future
> > > > skb->tstamp will confuse the receiver-sk.
> > > > 
> > > > This patch marks the skb if the skb->tstamp is forwarded.
> > > > Before using the skb->tstamp as a rx timestamp, it needs
> > > > to be re-stamped to avoid getting a future time.  It is
> > > > done in the RX timestamp reading helper skb_get_ktime().
> > > > 
> > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > ---
> > > >   include/linux/skbuff.h | 14 +++++++++-----
> > > >   net/core/dev.c         |  4 +++-
> > > >   net/core/skbuff.c      |  6 +++++-
> > > >   3 files changed, 17 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > index b609bdc5398b..bc4ae34c4e22 100644
> > > > --- a/include/linux/skbuff.h
> > > > +++ b/include/linux/skbuff.h
> > > > @@ -867,6 +867,7 @@ struct sk_buff {
> > > >          __u8                    decrypted:1;
> > > >   #endif
> > > >          __u8                    slow_gro:1;
> > > > +       __u8                    fwd_tstamp:1;
> > > > 
> > > >   #ifdef CONFIG_NET_SCHED
> > > >          __u16                   tc_index;       /* traffic control index */
> > > > @@ -3806,9 +3807,12 @@ static inline void skb_copy_to_linear_data_offset(struct sk_buff *skb,
> > > >   }
> > > > 
> > > >   void skb_init(void);
> > > > +void net_timestamp_set(struct sk_buff *skb);
> > > > 
> > > > -static inline ktime_t skb_get_ktime(const struct sk_buff *skb)
> > > > +static inline ktime_t skb_get_ktime(struct sk_buff *skb)
> > > >   {
> > > > +       if (unlikely(skb->fwd_tstamp))
> > > > +               net_timestamp_set(skb);
> > > >          return ktime_mono_to_real_cond(skb->tstamp);
> > > 
> > > This changes timestamp behavior for existing applications, probably
> > > worth mentioning in the commit message if nothing else. A timestamp
> > > taking at the time of the recv syscall is not very useful.
> > > 
> > > If a forwarded timestamp is not a future delivery time (as those are
> > > scrubbed), is it not correct to just deliver the original timestamp?
> > > It probably was taken at some earlier __netif_receive_skb_core.
> > > 
> > > >   }
> > > > 
> > > > -static inline void net_timestamp_set(struct sk_buff *skb)
> > > > +void net_timestamp_set(struct sk_buff *skb)
> > > >   {
> > > >          skb->tstamp = 0;
> > > > +       skb->fwd_tstamp = 0;
> > > >          if (static_branch_unlikely(&netstamp_needed_key))
> > > >                  __net_timestamp(skb);
> > > >   }
> > > > +EXPORT_SYMBOL(net_timestamp_set);
> > > > 
> > > >   #define net_timestamp_check(COND, SKB)                         \
> > > >          if (static_branch_unlikely(&netstamp_needed_key)) {     \
> > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > index f091c7807a9e..181ddc989ead 100644
> > > > --- a/net/core/skbuff.c
> > > > +++ b/net/core/skbuff.c
> > > > @@ -5295,8 +5295,12 @@ void skb_scrub_tstamp(struct sk_buff *skb)
> > > >   {
> > > >          struct sock *sk = skb->sk;
> > > > 
> > > > -       if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME))
> > > > +       if (sk && sk_fullsock(sk) && sock_flag(sk, SOCK_TXTIME)) {
> > > 
> > > There is a slight race here with the socket flipping the feature on/off.
> > > 
> > > >                  skb->tstamp = 0;
> > > > +               skb->fwd_tstamp = 0;
> > > > +       } else if (skb->tstamp) {
> > > > +               skb->fwd_tstamp = 1;
> > > > +       }
> > > 
> > > SO_TXTIME future delivery times are scrubbed, but TCP future delivery
> > > times are not?
> > > 
> > > If adding a bit, might it be simpler to add a bit tstamp_is_edt, and
> > > scrub based on that. That is also not open to the above race.
> > 
> > One other thing I wonder, BPF progs at host-facing veth's tc ingress which
> > are not aware of skb->tstamp will then see a tstamp from future given we
> > intentionally bypass the net_timestamp_check() and might get confused (or
> > would confuse higher-layer application logic)? Not quite sure yet if they
> > would be the only affected user.
> Considering the variety of clock used in skb->tstamp (real/mono, and also
> tai in SO_TXTIME),  in general I am not sure if the tc-bpf can assume anything
> in the skb->tstamp now.
> Also, there is only mono clock bpf_ktime_get helper, the most reasonable usage
> now for tc-bpf is to set the EDT which is in mono.  This seems to be the
> intention when the __sk_buff->tstamp was added.
> For ingress, it is real clock now.  Other than simply printing it out,
> it is hard to think of a good way to use the value.  Also, although
> it is unlikely, net_timestamp_check() does not always stamp the skb.
For non bpf ingress, hmmm.... yeah, not sure if it is indeed an issue :/
may be save the tx tstamp first and then temporarily restamp with __net_timestamp()

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-08  8:30         ` Martin KaFai Lau
@ 2021-12-08 18:27           ` Eric Dumazet
  2021-12-08 20:48             ` Martin KaFai Lau
  2021-12-08 19:27           ` Daniel Borkmann
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2021-12-08 18:27 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Daniel Borkmann, Willem de Bruijn, netdev, Alexei Starovoitov,
	David Miller, Jakub Kicinski, Kernel-team

On Wed, Dec 8, 2021 at 12:30 AM Martin KaFai Lau <kafai@fb.com> wrote:

> For non bpf ingress, hmmm.... yeah, not sure if it is indeed an issue :/
> may be save the tx tstamp first and then temporarily restamp with __net_timestamp()

Martin, have you looked at time namespaces (CLONE_NEWTIME) ?

Perhaps we need to have more than one bit to describe time bases.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-08  8:30         ` Martin KaFai Lau
  2021-12-08 18:27           ` Eric Dumazet
@ 2021-12-08 19:27           ` Daniel Borkmann
  2021-12-08 22:19             ` Martin KaFai Lau
  1 sibling, 1 reply; 18+ messages in thread
From: Daniel Borkmann @ 2021-12-08 19:27 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, netdev, Alexei Starovoitov, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team

On 12/8/21 9:30 AM, Martin KaFai Lau wrote:
> On Wed, Dec 08, 2021 at 12:18:46AM -0800, Martin KaFai Lau wrote:
>> On Tue, Dec 07, 2021 at 10:48:53PM +0100, Daniel Borkmann wrote:
[...]
>>> One other thing I wonder, BPF progs at host-facing veth's tc ingress which
>>> are not aware of skb->tstamp will then see a tstamp from future given we
>>> intentionally bypass the net_timestamp_check() and might get confused (or
>>> would confuse higher-layer application logic)? Not quite sure yet if they
>>> would be the only affected user.
>> Considering the variety of clock used in skb->tstamp (real/mono, and also
>> tai in SO_TXTIME),  in general I am not sure if the tc-bpf can assume anything
>> in the skb->tstamp now.

But today that's either only 0 or real via __net_timestamp() if skb->tstamp is
read at bpf@ingress@veth@host, no?

>> Also, there is only mono clock bpf_ktime_get helper, the most reasonable usage
>> now for tc-bpf is to set the EDT which is in mono.  This seems to be the
>> intention when the __sk_buff->tstamp was added.

Yep, fwiw, that's also how we only use it in our code base today.

>> For ingress, it is real clock now.  Other than simply printing it out,
>> it is hard to think of a good way to use the value.  Also, although
>> it is unlikely, net_timestamp_check() does not always stamp the skb.
> For non bpf ingress, hmmm.... yeah, not sure if it is indeed an issue :/
> may be save the tx tstamp first and then temporarily restamp with __net_timestamp()

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-08 18:27           ` Eric Dumazet
@ 2021-12-08 20:48             ` Martin KaFai Lau
  2021-12-10 10:19               ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Martin KaFai Lau @ 2021-12-08 20:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Daniel Borkmann, Willem de Bruijn, netdev, Alexei Starovoitov,
	David Miller, Jakub Kicinski, Kernel-team

On Wed, Dec 08, 2021 at 10:27:51AM -0800, Eric Dumazet wrote:
> On Wed, Dec 8, 2021 at 12:30 AM Martin KaFai Lau <kafai@fb.com> wrote:
> 
> > For non bpf ingress, hmmm.... yeah, not sure if it is indeed an issue :/
> > may be save the tx tstamp first and then temporarily restamp with __net_timestamp()
> 
> Martin, have you looked at time namespaces (CLONE_NEWTIME) ?
> 
> Perhaps we need to have more than one bit to describe time bases.
My noob understanding is it only affects the time returning
to the user in the syscall.  Could you explain how that
may affect the time in skb->tstamp?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-08 19:27           ` Daniel Borkmann
@ 2021-12-08 22:19             ` Martin KaFai Lau
  2021-12-09 12:58               ` Daniel Borkmann
  0 siblings, 1 reply; 18+ messages in thread
From: Martin KaFai Lau @ 2021-12-08 22:19 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Willem de Bruijn, netdev, Alexei Starovoitov, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team

On Wed, Dec 08, 2021 at 08:27:45PM +0100, Daniel Borkmann wrote:
> On 12/8/21 9:30 AM, Martin KaFai Lau wrote:
> > On Wed, Dec 08, 2021 at 12:18:46AM -0800, Martin KaFai Lau wrote:
> > > On Tue, Dec 07, 2021 at 10:48:53PM +0100, Daniel Borkmann wrote:
> [...]
> > > > One other thing I wonder, BPF progs at host-facing veth's tc ingress which
> > > > are not aware of skb->tstamp will then see a tstamp from future given we
> > > > intentionally bypass the net_timestamp_check() and might get confused (or
> > > > would confuse higher-layer application logic)? Not quite sure yet if they
> > > > would be the only affected user.
> > > Considering the variety of clock used in skb->tstamp (real/mono, and also
> > > tai in SO_TXTIME),  in general I am not sure if the tc-bpf can assume anything
> > > in the skb->tstamp now.
> 
> But today that's either only 0 or real via __net_timestamp() if skb->tstamp is
> read at bpf@ingress@veth@host, no?
I think I was trying to say the CLOCK_REALTIME in __sk_buff->tstamp 
is not practically useful in bpf@ingress other than an increasing number.
No easy way to get the 'now' in CLOCK_REALTIME to compare with
in order to learn if it is future or current time.  Setting
it based on bpf_ktime_get_ns() in MONO is likely broken currently
regardless of the skb is forwarded or delivered locally.

We have a use case that wants to change the forwarded EDT
in bpf@ingress@veth@host before forwarding.  If it needs to
keep __sk_buff->tstamp as the 'now' CLOCK_REALTIME in ingress,
it needs to provide a separate way for the bpf to check/change
the forwarded EDT.

Daniel, do you have suggestion on where to temporarily store
the forwarded EDT so that the bpf@ingress can access?

> 
> > > Also, there is only mono clock bpf_ktime_get helper, the most reasonable usage
> > > now for tc-bpf is to set the EDT which is in mono.  This seems to be the
> > > intention when the __sk_buff->tstamp was added.
> 
> Yep, fwiw, that's also how we only use it in our code base today.
> 
> > > For ingress, it is real clock now.  Other than simply printing it out,
> > > it is hard to think of a good way to use the value.  Also, although
> > > it is unlikely, net_timestamp_check() does not always stamp the skb.
> > For non bpf ingress, hmmm.... yeah, not sure if it is indeed an issue :/
> > may be save the tx tstamp first and then temporarily restamp with __net_timestamp()

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-08 22:19             ` Martin KaFai Lau
@ 2021-12-09 12:58               ` Daniel Borkmann
  2021-12-10  1:37                 ` Martin KaFai Lau
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Borkmann @ 2021-12-09 12:58 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, netdev, Alexei Starovoitov, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team

On 12/8/21 11:19 PM, Martin KaFai Lau wrote:
> On Wed, Dec 08, 2021 at 08:27:45PM +0100, Daniel Borkmann wrote:
>> On 12/8/21 9:30 AM, Martin KaFai Lau wrote:
>>> On Wed, Dec 08, 2021 at 12:18:46AM -0800, Martin KaFai Lau wrote:
>>>> On Tue, Dec 07, 2021 at 10:48:53PM +0100, Daniel Borkmann wrote:
>> [...]
>>>>> One other thing I wonder, BPF progs at host-facing veth's tc ingress which
>>>>> are not aware of skb->tstamp will then see a tstamp from future given we
>>>>> intentionally bypass the net_timestamp_check() and might get confused (or
>>>>> would confuse higher-layer application logic)? Not quite sure yet if they
>>>>> would be the only affected user.
>>>> Considering the variety of clock used in skb->tstamp (real/mono, and also
>>>> tai in SO_TXTIME),  in general I am not sure if the tc-bpf can assume anything
>>>> in the skb->tstamp now.
>>
>> But today that's either only 0 or real via __net_timestamp() if skb->tstamp is
>> read at bpf@ingress@veth@host, no?
> I think I was trying to say the CLOCK_REALTIME in __sk_buff->tstamp
> is not practically useful in bpf@ingress other than an increasing number.
> No easy way to get the 'now' in CLOCK_REALTIME to compare with
> in order to learn if it is future or current time.  Setting
> it based on bpf_ktime_get_ns() in MONO is likely broken currently
> regardless of the skb is forwarded or delivered locally.
> 
> We have a use case that wants to change the forwarded EDT
> in bpf@ingress@veth@host before forwarding.  If it needs to
> keep __sk_buff->tstamp as the 'now' CLOCK_REALTIME in ingress,
> it needs to provide a separate way for the bpf to check/change
> the forwarded EDT.
> 
> Daniel, do you have suggestion on where to temporarily store
> the forwarded EDT so that the bpf@ingress can access?

Hm, was thinking maybe moving skb->skb_mstamp_ns into the shared info as
in skb_hwtstamps(skb)->skb_mstamp_ns could work. In other words, as a union
with hwtstamp to not bloat it further. And TCP stack as well as everything
else (like sch_fq) could switch to it natively (hwtstamp might only be used
on RX or TX completion from driver side if I'm not mistaken).

But then while this would solve the netns transfer, we would run into the
/same/ issue again when implementing a hairpinning LB where we loop from RX
to TX given this would have to be cleared somewhere again if driver populates
hwtstamp, so not really feasible and bloating shared info with a second
tstamp would bump it by one cacheline. :(

A cleaner BUT still non-generic solution compared to the previous diff I could
think of might be the below. So no change in behavior in general, but if the
bpf@ingress@veth@host needs to access the original tstamp, it could do so
via existing mapping we already have in BPF, and then it could transfer it
for all or certain traffic (up to the prog) via BPF code setting ...

   skb->tstamp = skb->hwtstamp

... and do the redirect from there to the phys dev with BPF_F_KEEP_TSTAMP
flag. Minimal intrusive, but unfortunately only accessible for BPF. Maybe use
of skb_hwtstamps(skb)->nststamp could be extended though (?)

Thanks,
Daniel

  include/linux/skbuff.h   | 14 +++++++++++++-
  include/uapi/linux/bpf.h |  1 +
  net/core/filter.c        | 37 ++++++++++++++++++++++---------------
  net/core/skbuff.c        |  2 +-
  4 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c8cb7e697d47..b7c72fe7e1bb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -418,7 +418,10 @@ static inline bool skb_frag_must_loop(struct page *p)
   * &skb_shared_info. Use skb_hwtstamps() to get a pointer.
   */
  struct skb_shared_hwtstamps {
-	ktime_t	hwtstamp;
+	union {
+		ktime_t	hwtstamp;
+		ktime_t	nststamp;
+	};
  };

  /* Definitions for tx_flags in struct skb_shared_info */
@@ -3711,6 +3714,15 @@ int skb_mpls_dec_ttl(struct sk_buff *skb);
  struct sk_buff *pskb_extract(struct sk_buff *skb, int off, int to_copy,
  			     gfp_t gfp);

+static inline void skb_xfer_tstamp(struct sk_buff *skb)
+{
+	/* initns might still need access to the original
+	 * skb->tstamp from a netns, e.g. for EDT.
+	 */
+	skb_hwtstamps(skb)->nststamp = skb->tstamp;
+	skb->tstamp = 0;
+}
+
  static inline int memcpy_from_msg(void *data, struct msghdr *msg, int len)
  {
  	return copy_from_iter_full(data, len, &msg->msg_iter) ? 0 : -EFAULT;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ba5af15e25f5..12d10ab8bff7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5157,6 +5157,7 @@ enum {
  /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */
  enum {
  	BPF_F_INGRESS			= (1ULL << 0),
+	BPF_F_KEEP_TSTAMP		= (1ULL << 1),
  };

  /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
diff --git a/net/core/filter.c b/net/core/filter.c
index 6102f093d59a..e233b998387c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2097,7 +2097,8 @@ static inline int __bpf_rx_skb_no_mac(struct net_device *dev,
  	return ret;
  }

-static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
+static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb,
+			       bool keep_tstamp)
  {
  	int ret;

@@ -2108,7 +2109,8 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
  	}

  	skb->dev = dev;
-	skb->tstamp = 0;
+	if (!keep_tstamp)
+		skb_xfer_tstamp(skb);

  	dev_xmit_recursion_inc();
  	ret = dev_queue_xmit(skb);
@@ -2136,7 +2138,8 @@ static int __bpf_redirect_no_mac(struct sk_buff *skb, struct net_device *dev,
  	skb_pop_mac_header(skb);
  	skb_reset_mac_len(skb);
  	return flags & BPF_F_INGRESS ?
-	       __bpf_rx_skb_no_mac(dev, skb) : __bpf_tx_skb(dev, skb);
+	       __bpf_rx_skb_no_mac(dev, skb) :
+	       __bpf_tx_skb(dev, skb, flags & BPF_F_KEEP_TSTAMP);
  }

  static int __bpf_redirect_common(struct sk_buff *skb, struct net_device *dev,
@@ -2150,7 +2153,8 @@ static int __bpf_redirect_common(struct sk_buff *skb, struct net_device *dev,

  	bpf_push_mac_rcsum(skb);
  	return flags & BPF_F_INGRESS ?
-	       __bpf_rx_skb(dev, skb) : __bpf_tx_skb(dev, skb);
+	       __bpf_rx_skb(dev, skb) :
+	       __bpf_tx_skb(dev, skb, flags & BPF_F_KEEP_TSTAMP);
  }

  static int __bpf_redirect(struct sk_buff *skb, struct net_device *dev,
@@ -2177,7 +2181,6 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb,
  	}

  	skb->dev = dev;
-	skb->tstamp = 0;

  	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
  		skb = skb_expand_head(skb, hh_len);
@@ -2275,7 +2278,6 @@ static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb,
  	}

  	skb->dev = dev;
-	skb->tstamp = 0;

  	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
  		skb = skb_expand_head(skb, hh_len);
@@ -2367,7 +2369,7 @@ static int __bpf_redirect_neigh_v4(struct sk_buff *skb, struct net_device *dev,
  #endif /* CONFIG_INET */

  static int __bpf_redirect_neigh(struct sk_buff *skb, struct net_device *dev,
-				struct bpf_nh_params *nh)
+				struct bpf_nh_params *nh, bool keep_tstamp)
  {
  	struct ethhdr *ethh = eth_hdr(skb);

@@ -2380,6 +2382,8 @@ static int __bpf_redirect_neigh(struct sk_buff *skb, struct net_device *dev,
  	skb_pull(skb, sizeof(*ethh));
  	skb_unset_mac_header(skb);
  	skb_reset_network_header(skb);
+	if (!keep_tstamp)
+		skb_xfer_tstamp(skb);

  	if (skb->protocol == htons(ETH_P_IP))
  		return __bpf_redirect_neigh_v4(skb, dev, nh);
@@ -2392,9 +2396,9 @@ static int __bpf_redirect_neigh(struct sk_buff *skb, struct net_device *dev,

  /* Internal, non-exposed redirect flags. */
  enum {
-	BPF_F_NEIGH	= (1ULL << 1),
-	BPF_F_PEER	= (1ULL << 2),
-	BPF_F_NEXTHOP	= (1ULL << 3),
+	BPF_F_NEIGH	= (1ULL << 2),
+	BPF_F_PEER	= (1ULL << 3),
+	BPF_F_NEXTHOP	= (1ULL << 4),
  #define BPF_F_REDIRECT_INTERNAL	(BPF_F_NEIGH | BPF_F_PEER | BPF_F_NEXTHOP)
  };

@@ -2468,8 +2472,9 @@ int skb_do_redirect(struct sk_buff *skb)
  		return -EAGAIN;
  	}
  	return flags & BPF_F_NEIGH ?
-	       __bpf_redirect_neigh(skb, dev, flags & BPF_F_NEXTHOP ?
-				    &ri->nh : NULL) :
+	       __bpf_redirect_neigh(skb, dev,
+				    flags & BPF_F_NEXTHOP ? &ri->nh : NULL,
+				    flags & BPF_F_KEEP_TSTAMP) :
  	       __bpf_redirect(skb, dev, flags);
  out_drop:
  	kfree_skb(skb);
@@ -2480,7 +2485,8 @@ BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
  {
  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);

-	if (unlikely(flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)))
+	if (unlikely(flags & (~(BPF_F_INGRESS | BPF_F_KEEP_TSTAMP) |
+			      BPF_F_REDIRECT_INTERNAL)))
  		return TC_ACT_SHOT;

  	ri->flags = flags;
@@ -2523,10 +2529,11 @@ BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params,
  {
  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);

-	if (unlikely((plen && plen < sizeof(*params)) || flags))
+	if (unlikely((plen && plen < sizeof(*params)) ||
+		     (flags & ~BPF_F_KEEP_TSTAMP)))
  		return TC_ACT_SHOT;

-	ri->flags = BPF_F_NEIGH | (plen ? BPF_F_NEXTHOP : 0);
+	ri->flags = BPF_F_NEIGH | (plen ? BPF_F_NEXTHOP : 0) | flags;
  	ri->tgt_index = ifindex;

  	BUILD_BUG_ON(sizeof(struct bpf_redir_neigh) != sizeof(struct bpf_nh_params));
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ba2f38246f07..782b152a000c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5472,7 +5472,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)

  	ipvs_reset(skb);
  	skb->mark = 0;
-	skb->tstamp = 0;
+	skb_xfer_tstamp(skb);
  }
  EXPORT_SYMBOL_GPL(skb_scrub_packet);

-- 
2.21.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-09 12:58               ` Daniel Borkmann
@ 2021-12-10  1:37                 ` Martin KaFai Lau
  2021-12-10 10:08                   ` Daniel Borkmann
  0 siblings, 1 reply; 18+ messages in thread
From: Martin KaFai Lau @ 2021-12-10  1:37 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Willem de Bruijn, netdev, Alexei Starovoitov, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team

On Thu, Dec 09, 2021 at 01:58:52PM +0100, Daniel Borkmann wrote:
> > Daniel, do you have suggestion on where to temporarily store
> > the forwarded EDT so that the bpf@ingress can access?
> 
> Hm, was thinking maybe moving skb->skb_mstamp_ns into the shared info as
> in skb_hwtstamps(skb)->skb_mstamp_ns could work. In other words, as a union
> with hwtstamp to not bloat it further. And TCP stack as well as everything
> else (like sch_fq) could switch to it natively (hwtstamp might only be used
> on RX or TX completion from driver side if I'm not mistaken).
> 
> But then while this would solve the netns transfer, we would run into the
> /same/ issue again when implementing a hairpinning LB where we loop from RX
> to TX given this would have to be cleared somewhere again if driver populates
> hwtstamp, so not really feasible and bloating shared info with a second
> tstamp would bump it by one cacheline. :(
If the edt is set at skb_hwtstamps,
skb->tstamp probably needs to be re-populated for the bpf@tc-egress
but should be minor since there is a skb_at_tc_ingress() test.

It seems fq does not need shinfo now, so that will be an extra cacheline to
bring... hmm

> A cleaner BUT still non-generic solution compared to the previous diff I could
> think of might be the below. So no change in behavior in general, but if the
> bpf@ingress@veth@host needs to access the original tstamp, it could do so
> via existing mapping we already have in BPF, and then it could transfer it
> for all or certain traffic (up to the prog) via BPF code setting ...
> 
>   skb->tstamp = skb->hwtstamp
> 
> ... and do the redirect from there to the phys dev with BPF_F_KEEP_TSTAMP
> flag. Minimal intrusive, but unfortunately only accessible for BPF. Maybe use
> of skb_hwtstamps(skb)->nststamp could be extended though (?)
I like the idea of the possibility in temporarily storing a future mono EDT
in skb_shared_hwtstamps.

It may open up some possibilities.  Not sure how that may look like yet
but I will try to develop on this.

I may have to separate the fwd-edt problem from __sk_buff->tstamp accessibility
@ingress to keep it simple first.
will try to make it generic also before scaling back to a bpf-specific solution.

Thanks for the code and the idea !

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-10  1:37                 ` Martin KaFai Lau
@ 2021-12-10 10:08                   ` Daniel Borkmann
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2021-12-10 10:08 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, netdev, Alexei Starovoitov, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team

On 12/10/21 2:37 AM, Martin KaFai Lau wrote:
> On Thu, Dec 09, 2021 at 01:58:52PM +0100, Daniel Borkmann wrote:
>>> Daniel, do you have suggestion on where to temporarily store
>>> the forwarded EDT so that the bpf@ingress can access?
>>
>> Hm, was thinking maybe moving skb->skb_mstamp_ns into the shared info as
>> in skb_hwtstamps(skb)->skb_mstamp_ns could work. In other words, as a union
>> with hwtstamp to not bloat it further. And TCP stack as well as everything
>> else (like sch_fq) could switch to it natively (hwtstamp might only be used
>> on RX or TX completion from driver side if I'm not mistaken).
>>
>> But then while this would solve the netns transfer, we would run into the
>> /same/ issue again when implementing a hairpinning LB where we loop from RX
>> to TX given this would have to be cleared somewhere again if driver populates
>> hwtstamp, so not really feasible and bloating shared info with a second
>> tstamp would bump it by one cacheline. :(
> If the edt is set at skb_hwtstamps,
> skb->tstamp probably needs to be re-populated for the bpf@tc-egress
> but should be minor since there is a skb_at_tc_ingress() test.
> 
> It seems fq does not need shinfo now, so that will be an extra cacheline to
> bring... hmm

Right. :/ The other thing I was wondering (but haven't looked enough into the
code yet whether feasible or not) ... maybe skb_hwtstamps(skb)->hwtstamp could
be changed to cover both hw & sw ingress tstamp (meaning, if nic doesn't provide
it, then we fall back to the sw one and __net_timestamp() stores it there, too)
whereas skb->tstamp would always concern an egress tstamp. However, it might
result in quite a lot of churn given the wider-spread use, but more importantly,
performance implications are also not quite clear as you mentioned above wrt
extra cache miss.

>> A cleaner BUT still non-generic solution compared to the previous diff I could
>> think of might be the below. So no change in behavior in general, but if the
>> bpf@ingress@veth@host needs to access the original tstamp, it could do so
>> via existing mapping we already have in BPF, and then it could transfer it
>> for all or certain traffic (up to the prog) via BPF code setting ...
>>
>>    skb->tstamp = skb->hwtstamp
>>
>> ... and do the redirect from there to the phys dev with BPF_F_KEEP_TSTAMP
>> flag. Minimal intrusive, but unfortunately only accessible for BPF. Maybe use
>> of skb_hwtstamps(skb)->nststamp could be extended though (?)
> I like the idea of the possibility in temporarily storing a future mono EDT
> in skb_shared_hwtstamps.
> 
> It may open up some possibilities.  Not sure how that may look like yet
> but I will try to develop on this.

Ok! One thing I noticed later in the diff, that for the ingressing direction
aka phys -> host veth -> netns veth, we also do the skb_xfer_tstamp() switch
and might override the one stored from driver with potentially the one from
__net_timestamp(), but maybe for netns'es that's acceptable (perhaps a test
for existing skb->sk owner before skb_xfer_tstamp() could do the trick..).

> I may have to separate the fwd-edt problem from __sk_buff->tstamp accessibility
> @ingress to keep it simple first.
> will try to make it generic also before scaling back to a bpf-specific solution.

Yeah sounds good, if we can solve it generically, even better!

> Thanks for the code and the idea !
Thanks,
Daniel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
  2021-12-08 20:48             ` Martin KaFai Lau
@ 2021-12-10 10:19               ` Eric Dumazet
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2021-12-10 10:19 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Daniel Borkmann, Willem de Bruijn, netdev, Alexei Starovoitov,
	David Miller, Jakub Kicinski, Kernel-team

On Wed, Dec 8, 2021 at 12:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Dec 08, 2021 at 10:27:51AM -0800, Eric Dumazet wrote:
> > On Wed, Dec 8, 2021 at 12:30 AM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > > For non bpf ingress, hmmm.... yeah, not sure if it is indeed an issue :/
> > > may be save the tx tstamp first and then temporarily restamp with __net_timestamp()
> >
> > Martin, have you looked at time namespaces (CLONE_NEWTIME) ?
> >
> > Perhaps we need to have more than one bit to describe time bases.
> My noob understanding is it only affects the time returning
> to the user in the syscall.  Could you explain how that
> may affect the time in skb->tstamp?

I would think it should affect timestamps (rx/tx network ones),
otherwise user applications would be broken ?

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-12-10 10:19 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-07  2:01 [RFC PATCH net-next 1/2] net: Use mono clock in RX timestamp Martin KaFai Lau
2021-12-07  2:01 ` [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space Martin KaFai Lau
2021-12-07 14:27   ` Willem de Bruijn
2021-12-07 21:48     ` Daniel Borkmann
2021-12-07 23:06       ` Daniel Borkmann
2021-12-08  8:18       ` Martin KaFai Lau
2021-12-08  8:30         ` Martin KaFai Lau
2021-12-08 18:27           ` Eric Dumazet
2021-12-08 20:48             ` Martin KaFai Lau
2021-12-10 10:19               ` Eric Dumazet
2021-12-08 19:27           ` Daniel Borkmann
2021-12-08 22:19             ` Martin KaFai Lau
2021-12-09 12:58               ` Daniel Borkmann
2021-12-10  1:37                 ` Martin KaFai Lau
2021-12-10 10:08                   ` Daniel Borkmann
2021-12-08  0:07     ` Martin KaFai Lau
2021-12-08  0:44       ` Willem de Bruijn
2021-12-08  2:45         ` Martin KaFai Lau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.