All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp
@ 2022-03-02 19:55 Martin KaFai Lau
  2022-03-02 19:55 ` [PATCH v6 net-next 01/13] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Martin KaFai Lau
                   ` (14 more replies)
  0 siblings, 15 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-02 19:55 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

skb->tstamp was first used as the (rcv) timestamp.
The major usage is to report it to the user (e.g. SO_TIMESTAMP).

Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
during egress and used by the qdisc (e.g. sch_fq) to make decision on when
the skb can be passed to the dev.

Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
or the delivery_time, so it is always reset to 0 whenever forwarded
between egress and ingress.

While it makes sense to always clear the (rcv) timestamp in skb->tstamp
to avoid confusing sch_fq that expects the delivery_time, it is a
performance issue [0] to clear the delivery_time if the skb finally
egress to a fq@phy-dev.

This set is to keep the mono delivery time and make it available to
the final egress interface.  Please see individual patch for
the details.

[0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf

v6:
- Add kdoc and use non-UAPI type in patch 6 (Jakub)

v5:
netdev:
- Patch 3 in v4 is broken down into smaller patches 3, 4, and 5 in v5
- The mono_delivery_time bit clearing in __skb_tstamp_tx() is
  done in __net_timestamp() instead.  This is patch 4 in v5.
- Missed a skb_clear_delivery_time() for the 'skip_classify' case
  in dev.c in v4.  That is fixed in patch 5 in v5 for correctness.
  The skb_clear_delivery_time() will be moved to a later
  stage in Patch 10, so it was an intermediate error in v4.
- Added delivery time handling for nfnetlink_{log, queue}.c in patch 9 (Daniel)
- Added delivery time handling in the IPv6 IOAM hop-by-hop option which has
  an experimental IANA assigned value 49 in patch 8
- Added delivery time handling in nf_conntrack for the ipv6 defrag case
  in patch 7
- Removed unlikely() from testing skb->mono_delivery_time (Daniel)

bpf:
- Remove the skb->tstamp dance in ingress.  Depends on bpf insn
  rewrite to return 0 if skb->tstamp has delivery time in patch 11.
  It is to backward compatible with the existing tc-bpf@ingress in
  patch 11.
- bpf_set_delivery_time() will also allow dtime == 0 and
  dtime_type == BPF_SKB_DELIVERY_TIME_NONE as argument
  in patch 12.
  
v4:
netdev:
- Push the skb_clear_delivery_time() from
  ip_local_deliver() and ip6_input() to
  ip_local_deliver_finish() and ip6_input_finish()
  to accommodate the ipvs forward path.
  This is the notable change in v4 at the netdev side.

    - Patch 3/8 first does the skb_clear_delivery_time() after
      sch_handle_ingress() in dev.c and this will make the
      tc-bpf forward path work via the bpf_redirect_*() helper.

    - The next patch 4/8 (new in v4) will then postpone the
      skb_clear_delivery_time() from dev.c to
      the ip_local_deliver_finish() and ip6_input_finish() after
      taking care of the tstamp usage in the ip defrag case.
      This will make the kernel forward path also work, e.g.
      the ip[6]_forward().

- Fixed a case v3 which missed setting the skb->mono_delivery_time bit
  when sending TCP rst/ack in some cases (e.g. from a ctl_sk).
  That case happens at ip_send_unicast_reply() and
  tcp_v6_send_response().  It is fixed in patch 1/8 (and
  then patch 3/8) in v4.

bpf:
- Adding __sk_buff->delivery_time_type instead of adding
  __sk_buff->mono_delivery_time as in v3.  The tc-bpf can stay with
  one __sk_buff->tstamp instead of having two 'time' fields
  while one is 0 and another is not.
  tc-bpf can use the new __sk_buff->delivery_time_type to tell
  what is stored in __sk_buff->tstamp.
- bpf_skb_set_delivery_time() helper is added to set
  __sk_buff->tstamp from non mono delivery_time to
  mono delivery_time
- Most of the convert_ctx_access() bpf insn rewrite in v3
  is gone, so no new rewrite added for __sk_buff->tstamp.
  The only rewrite added is for reading the new
  __sk_buff->delivery_time_type.
- Added selftests, test_tc_dtime.c

v3:
- Feedback from v2 is using shinfo(skb)->tx_flags could be racy.
- Considered to reuse a few bits in skb->tstamp to represent
  different semantics, other than more code churns, it will break
  the bpf usecase which currently can write and then read back
  the skb->tstamp.
- Went back to v1 idea on adding a bit to skb and address the
  feedbacks on v1:
- Added one bit skb->mono_delivery_time to flag that
  the skb->tstamp has the mono delivery_time (EDT), instead
  of adding a bit to flag if the skb->tstamp has been forwarded or not.
- Instead of resetting the delivery_time back to the (rcv) timestamp
  during recvmsg syscall which may be too late and not useful,
  the delivery_time reset in v3 happens earlier once the stack
  knows that the skb will be delivered locally.
- Handled the tapping@ingress case by af_packet
- No need to change the (rcv) timestamp to mono clock base as in v1.
  The added one bit to flag skb->mono_delivery_time is enough
  to keep the EDT delivery_time during forward.
- Added logic to the bpf side to make the existing bpf
  running at ingress can still get the (rcv) timestamp
  when reading the __sk_buff->tstamp.  New __sk_buff->mono_delivery_time
  is also added.  Test is still needed to test this piece.

Martin KaFai Lau (13):
  net: Add skb->mono_delivery_time to distinguish mono delivery_time
    from (rcv) timestamp
  net: Add skb_clear_tstamp() to keep the mono delivery_time
  net: Handle delivery_time in skb->tstamp during network tapping with
    af_packet
  net: Clear mono_delivery_time bit in __skb_tstamp_tx()
  net: Set skb->mono_delivery_time and clear it after
    sch_handle_ingress()
  net: ip: Handle delivery_time in ip defrag
  net: ipv6: Handle delivery_time in ipv6 defrag
  net: ipv6: Get rcv timestamp if needed when handling hop-by-hop IOAM
    option
  net: Get rcv tstamp if needed in nfnetlink_{log, queue}.c
  net: Postpone skb_clear_delivery_time() until knowing the skb is
    delivered locally
  bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress
  bpf: Add __sk_buff->delivery_time_type and
    bpf_skb_set_skb_delivery_time()
  bpf: selftests: test skb->tstamp in redirect_neigh

 drivers/net/loopback.c                        |   2 +-
 include/linux/filter.h                        |   3 +-
 include/linux/skbuff.h                        |  74 ++-
 include/net/inet_frag.h                       |   2 +
 include/uapi/linux/bpf.h                      |  41 +-
 net/bridge/br_forward.c                       |   2 +-
 net/bridge/netfilter/nf_conntrack_bridge.c    |   5 +-
 net/core/dev.c                                |   8 +-
 net/core/filter.c                             | 178 ++++++-
 net/core/skbuff.c                             |   4 +-
 net/ieee802154/6lowpan/reassembly.c           |   1 +
 net/ipv4/inet_fragment.c                      |   1 +
 net/ipv4/ip_forward.c                         |   2 +-
 net/ipv4/ip_fragment.c                        |   1 +
 net/ipv4/ip_input.c                           |   1 +
 net/ipv4/ip_output.c                          |   6 +-
 net/ipv4/tcp_output.c                         |  16 +-
 net/ipv6/ioam6.c                              |  19 +-
 net/ipv6/ip6_input.c                          |   1 +
 net/ipv6/ip6_output.c                         |   7 +-
 net/ipv6/netfilter.c                          |   5 +-
 net/ipv6/netfilter/nf_conntrack_reasm.c       |   1 +
 net/ipv6/reassembly.c                         |   1 +
 net/ipv6/tcp_ipv6.c                           |   2 +-
 net/netfilter/ipvs/ip_vs_xmit.c               |   6 +-
 net/netfilter/nf_dup_netdev.c                 |   2 +-
 net/netfilter/nf_flow_table_ip.c              |   4 +-
 net/netfilter/nfnetlink_log.c                 |   6 +-
 net/netfilter/nfnetlink_queue.c               |   8 +-
 net/netfilter/nft_fwd_netdev.c                |   2 +-
 net/openvswitch/vport.c                       |   2 +-
 net/packet/af_packet.c                        |   4 +-
 net/sched/act_bpf.c                           |   2 +
 net/sched/cls_bpf.c                           |   2 +
 net/xfrm/xfrm_interface.c                     |   2 +-
 tools/include/uapi/linux/bpf.h                |  41 +-
 .../selftests/bpf/prog_tests/tc_redirect.c    | 434 ++++++++++++++++++
 .../selftests/bpf/progs/test_tc_dtime.c       | 349 ++++++++++++++
 38 files changed, 1174 insertions(+), 73 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_tc_dtime.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v6 net-next 01/13] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp
  2022-03-02 19:55 [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
@ 2022-03-02 19:55 ` Martin KaFai Lau
  2022-03-02 19:55 ` [PATCH v6 net-next 02/13] net: Add skb_clear_tstamp() to keep the mono delivery_time Martin KaFai Lau
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-02 19:55 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

skb->tstamp was first used as the (rcv) timestamp.
The major usage is to report it to the user (e.g. SO_TIMESTAMP).

Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
during egress and used by the qdisc (e.g. sch_fq) to make decision on when
the skb can be passed to the dev.

Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
or the delivery_time, so it is always reset to 0 whenever forwarded
between egress and ingress.

While it makes sense to always clear the (rcv) timestamp in skb->tstamp
to avoid confusing sch_fq that expects the delivery_time, it is a
performance issue [0] to clear the delivery_time if the skb finally
egress to a fq@phy-dev.  For example, when forwarding from egress to
ingress and then finally back to egress:

            tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
                                     ^              ^
                                     reset          rest

This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.

The current use case is to keep the TCP mono delivery_time (EDT) and
to be used with sch_fq.  A latter patch will also allow tc-bpf@ingress
to read and change the mono delivery_time.

In the future, another bit (e.g. skb->user_delivery_time) can be added
for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.

[ This patch is a prep work.  The following patches will
  get the other parts of the stack ready first.  Then another patch
  after that will finally set the skb->mono_delivery_time. ]

skb_set_delivery_time() function is added.  It is used by the tcp_output.c
and during ip[6] fragmentation to assign the delivery_time to
the skb->tstamp and also set the skb->mono_delivery_time.

A note on the change in ip_send_unicast_reply() in ip_output.c.
It is only used by TCP to send reset/ack out of a ctl_sk.
Like the new skb_set_delivery_time(), this patch sets
the skb->mono_delivery_time to 0 for now as a place
holder.  It will be enabled in a latter patch.
A similar case in tcp_ipv6 can be done with
skb_set_delivery_time() in tcp_v6_send_response().

[0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/skbuff.h                     | 13 +++++++++++++
 net/bridge/netfilter/nf_conntrack_bridge.c |  5 +++--
 net/ipv4/ip_output.c                       |  7 +++++--
 net/ipv4/tcp_output.c                      | 16 +++++++++-------
 net/ipv6/ip6_output.c                      |  5 +++--
 net/ipv6/netfilter.c                       |  5 +++--
 net/ipv6/tcp_ipv6.c                        |  2 +-
 7 files changed, 37 insertions(+), 16 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d67941f78b92..803ffa63dea6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -795,6 +795,10 @@ typedef unsigned char *sk_buff_data_t;
  *	@dst_pending_confirm: need to confirm neighbour
  *	@decrypted: Decrypted SKB
  *	@slow_gro: state present at GRO time, slower prepare step required
+ *	@mono_delivery_time: When set, skb->tstamp has the
+ *		delivery_time in mono clock base (i.e. EDT).  Otherwise, the
+ *		skb->tstamp has the (rcv) timestamp at ingress and
+ *		delivery_time at egress.
  *	@napi_id: id of the NAPI struct this skb came from
  *	@sender_cpu: (aka @napi_id) source CPU in XPS
  *	@secmark: security marking
@@ -965,6 +969,7 @@ struct sk_buff {
 	__u8			decrypted:1;
 #endif
 	__u8			slow_gro:1;
+	__u8			mono_delivery_time:1;
 
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
@@ -3983,6 +3988,14 @@ static inline ktime_t net_timedelta(ktime_t t)
 	return ktime_sub(ktime_get_real(), t);
 }
 
+static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
+					 bool mono)
+{
+	skb->tstamp = kt;
+	/* Setting mono_delivery_time will be enabled later */
+	skb->mono_delivery_time = 0;
+}
+
 static inline u8 skb_metadata_len(const struct sk_buff *skb)
 {
 	return skb_shinfo(skb)->meta_len;
diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c
index fdbed3158555..ebfb2a5c59e4 100644
--- a/net/bridge/netfilter/nf_conntrack_bridge.c
+++ b/net/bridge/netfilter/nf_conntrack_bridge.c
@@ -32,6 +32,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
 					   struct sk_buff *))
 {
 	int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
+	bool mono_delivery_time = skb->mono_delivery_time;
 	unsigned int hlen, ll_rs, mtu;
 	ktime_t tstamp = skb->tstamp;
 	struct ip_frag_state state;
@@ -81,7 +82,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
 			if (iter.frag)
 				ip_fraglist_prepare(skb, &iter);
 
-			skb->tstamp = tstamp;
+			skb_set_delivery_time(skb, tstamp, mono_delivery_time);
 			err = output(net, sk, data, skb);
 			if (err || !iter.frag)
 				break;
@@ -112,7 +113,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
 			goto blackhole;
 		}
 
-		skb2->tstamp = tstamp;
+		skb_set_delivery_time(skb2, tstamp, mono_delivery_time);
 		err = output(net, sk, data, skb2);
 		if (err)
 			goto blackhole;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6df3545c891d..a9588e0c82c5 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -761,6 +761,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 {
 	struct iphdr *iph;
 	struct sk_buff *skb2;
+	bool mono_delivery_time = skb->mono_delivery_time;
 	struct rtable *rt = skb_rtable(skb);
 	unsigned int mtu, hlen, ll_rs;
 	struct ip_fraglist_iter iter;
@@ -852,7 +853,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 				}
 			}
 
-			skb->tstamp = tstamp;
+			skb_set_delivery_time(skb, tstamp, mono_delivery_time);
 			err = output(net, sk, skb);
 
 			if (!err)
@@ -908,7 +909,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 		/*
 		 *	Put this fragment into the sending queue.
 		 */
-		skb2->tstamp = tstamp;
+		skb_set_delivery_time(skb2, tstamp, mono_delivery_time);
 		err = output(net, sk, skb2);
 		if (err)
 			goto fail;
@@ -1727,6 +1728,8 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
 			  arg->csumoffset) = csum_fold(csum_add(nskb->csum,
 								arg->csum));
 		nskb->ip_summed = CHECKSUM_NONE;
+		/* Setting mono_delivery_time will be enabled later */
+		nskb->mono_delivery_time = 0;
 		ip_push_pending_frames(sk, &fl4);
 	}
 out:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e76bf1e9251e..2319531267c6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1253,7 +1253,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 	tp = tcp_sk(sk);
 	prior_wstamp = tp->tcp_wstamp_ns;
 	tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache);
-	skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
+	skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true);
 	if (clone_it) {
 		oskb = skb;
 
@@ -1589,7 +1589,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 
 	skb_split(skb, buff, len);
 
-	buff->tstamp = skb->tstamp;
+	skb_set_delivery_time(buff, skb->tstamp, true);
 	tcp_fragment_tstamp(skb, buff);
 
 	old_factor = tcp_skb_pcount(skb);
@@ -2616,7 +2616,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 
 		if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
 			/* "skb_mstamp_ns" is used as a start point for the retransmit timer */
-			skb->skb_mstamp_ns = tp->tcp_wstamp_ns = tp->tcp_clock_cache;
+			tp->tcp_wstamp_ns = tp->tcp_clock_cache;
+			skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true);
 			list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue);
 			tcp_init_tso_segs(skb, mss_now);
 			goto repair; /* Skip network transmission */
@@ -3541,11 +3542,12 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	now = tcp_clock_ns();
 #ifdef CONFIG_SYN_COOKIES
 	if (unlikely(synack_type == TCP_SYNACK_COOKIE && ireq->tstamp_ok))
-		skb->skb_mstamp_ns = cookie_init_timestamp(req, now);
+		skb_set_delivery_time(skb, cookie_init_timestamp(req, now),
+				      true);
 	else
 #endif
 	{
-		skb->skb_mstamp_ns = now;
+		skb_set_delivery_time(skb, now, true);
 		if (!tcp_rsk(req)->snt_synack) /* Timestamp first SYNACK */
 			tcp_rsk(req)->snt_synack = tcp_skb_timestamp_us(skb);
 	}
@@ -3594,7 +3596,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb,
 				synack_type, &opts);
 
-	skb->skb_mstamp_ns = now;
+	skb_set_delivery_time(skb, now, true);
 	tcp_add_tx_delay(skb, tp);
 
 	return skb;
@@ -3771,7 +3773,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 
 	err = tcp_transmit_skb(sk, syn_data, 1, sk->sk_allocation);
 
-	syn->skb_mstamp_ns = syn_data->skb_mstamp_ns;
+	skb_set_delivery_time(syn, syn_data->skb_mstamp_ns, true);
 
 	/* Now full SYN+DATA was cloned and sent (or not),
 	 * remove the SYN from the original skb (syn_data)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index c5edc86b18bd..dad4e3d0492e 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -813,6 +813,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 	struct rt6_info *rt = (struct rt6_info *)skb_dst(skb);
 	struct ipv6_pinfo *np = skb->sk && !dev_recursion_level() ?
 				inet6_sk(skb->sk) : NULL;
+	bool mono_delivery_time = skb->mono_delivery_time;
 	struct ip6_frag_state state;
 	unsigned int mtu, hlen, nexthdr_offset;
 	ktime_t tstamp = skb->tstamp;
@@ -903,7 +904,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			if (iter.frag)
 				ip6_fraglist_prepare(skb, &iter);
 
-			skb->tstamp = tstamp;
+			skb_set_delivery_time(skb, tstamp, mono_delivery_time);
 			err = output(net, sk, skb);
 			if (!err)
 				IP6_INC_STATS(net, ip6_dst_idev(&rt->dst),
@@ -962,7 +963,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 		/*
 		 *	Put this fragment into the sending queue.
 		 */
-		frag->tstamp = tstamp;
+		skb_set_delivery_time(frag, tstamp, mono_delivery_time);
 		err = output(net, sk, frag);
 		if (err)
 			goto fail;
diff --git a/net/ipv6/netfilter.c b/net/ipv6/netfilter.c
index 6ab710b5a1a8..1da332450d98 100644
--- a/net/ipv6/netfilter.c
+++ b/net/ipv6/netfilter.c
@@ -121,6 +121,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 				  struct sk_buff *))
 {
 	int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
+	bool mono_delivery_time = skb->mono_delivery_time;
 	ktime_t tstamp = skb->tstamp;
 	struct ip6_frag_state state;
 	u8 *prevhdr, nexthdr = 0;
@@ -186,7 +187,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			if (iter.frag)
 				ip6_fraglist_prepare(skb, &iter);
 
-			skb->tstamp = tstamp;
+			skb_set_delivery_time(skb, tstamp, mono_delivery_time);
 			err = output(net, sk, data, skb);
 			if (err || !iter.frag)
 				break;
@@ -219,7 +220,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			goto blackhole;
 		}
 
-		skb2->tstamp = tstamp;
+		skb_set_delivery_time(skb2, tstamp, mono_delivery_time);
 		err = output(net, sk, data, skb2);
 		if (err)
 			goto blackhole;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index e98af869ff3a..cb2bb7d2e907 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -940,7 +940,7 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32
 		} else {
 			mark = sk->sk_mark;
 		}
-		buff->tstamp = tcp_transmit_time(sk);
+		skb_set_delivery_time(buff, tcp_transmit_time(sk), true);
 	}
 	fl6.flowi6_mark = IP6_REPLY_MARK(net, skb->mark) ?: mark;
 	fl6.fl6_dport = t1->dest;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 net-next 02/13] net: Add skb_clear_tstamp() to keep the mono delivery_time
  2022-03-02 19:55 [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
  2022-03-02 19:55 ` [PATCH v6 net-next 01/13] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Martin KaFai Lau
@ 2022-03-02 19:55 ` Martin KaFai Lau
  2022-03-02 19:55 ` [PATCH v6 net-next 03/13] net: Handle delivery_time in skb->tstamp during network tapping with af_packet Martin KaFai Lau
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-02 19:55 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

Right now, skb->tstamp is reset to 0 whenever the skb is forwarded.

If skb->tstamp has the mono delivery_time, clearing it can hurt
the performance when it finally transmits out to fq@phy-dev.

The earlier patch added a skb->mono_delivery_time bit to
flag the skb->tstamp carrying the mono delivery_time.

This patch adds skb_clear_tstamp() helper which keeps
the mono delivery_time and clears everything else.

The delivery_time clearing will be postponed until the stack knows the
skb will be delivered locally.  It will be done in a latter patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 drivers/net/loopback.c           |  2 +-
 include/linux/skbuff.h           | 10 +++++++++-
 net/bridge/br_forward.c          |  2 +-
 net/core/filter.c                |  6 +++---
 net/core/skbuff.c                |  2 +-
 net/ipv4/ip_forward.c            |  2 +-
 net/ipv6/ip6_output.c            |  2 +-
 net/netfilter/ipvs/ip_vs_xmit.c  |  6 +++---
 net/netfilter/nf_dup_netdev.c    |  2 +-
 net/netfilter/nf_flow_table_ip.c |  4 ++--
 net/netfilter/nft_fwd_netdev.c   |  2 +-
 net/openvswitch/vport.c          |  2 +-
 net/xfrm/xfrm_interface.c        |  2 +-
 13 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index d05f86fe78c9..720394c0639b 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -74,7 +74,7 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb,
 	skb_tx_timestamp(skb);
 
 	/* do not fool net_timestamp_check() with various clock bases */
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	skb_orphan(skb);
 
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 803ffa63dea6..27a28920e7b3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3996,6 +3996,14 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
 	skb->mono_delivery_time = 0;
 }
 
+static inline void skb_clear_tstamp(struct sk_buff *skb)
+{
+	if (skb->mono_delivery_time)
+		return;
+
+	skb->tstamp = 0;
+}
+
 static inline u8 skb_metadata_len(const struct sk_buff *skb)
 {
 	return skb_shinfo(skb)->meta_len;
@@ -4852,7 +4860,7 @@ static inline void skb_set_redirected(struct sk_buff *skb, bool from_ingress)
 #ifdef CONFIG_NET_REDIRECT
 	skb->from_ingress = from_ingress;
 	if (skb->from_ingress)
-		skb->tstamp = 0;
+		skb_clear_tstamp(skb);
 #endif
 }
 
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index ec646656dbf1..02bb620d3b8d 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -62,7 +62,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit);
 
 int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,
 		       net, sk, skb, NULL, skb->dev,
 		       br_dev_queue_push_xmit);
diff --git a/net/core/filter.c b/net/core/filter.c
index 65869fd510e8..cfcf9b4d1ec2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2107,7 +2107,7 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	dev_xmit_recursion_inc();
 	ret = dev_queue_xmit(skb);
@@ -2176,7 +2176,7 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb,
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
 		skb = skb_expand_head(skb, hh_len);
@@ -2274,7 +2274,7 @@ static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb,
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
 		skb = skb_expand_head(skb, hh_len);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b32c5d782fe1..9abb0028309f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5381,7 +5381,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 
 	ipvs_reset(skb);
 	skb->mark = 0;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 }
 EXPORT_SYMBOL_GPL(skb_scrub_packet);
 
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 00ec819f949b..92ba3350274b 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -79,7 +79,7 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s
 	if (unlikely(opt->optlen))
 		ip_forward_options(skb);
 
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	return dst_output(net, sk, skb);
 }
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index dad4e3d0492e..50db9b20d746 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -440,7 +440,7 @@ static inline int ip6_forward_finish(struct net *net, struct sock *sk,
 	}
 #endif
 
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	return dst_output(net, sk, skb);
 }
 
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index d2e5a8f644b8..029171379884 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -610,7 +610,7 @@ static inline int ip_vs_tunnel_xmit_prepare(struct sk_buff *skb,
 		nf_reset_ct(skb);
 		skb_forward_csum(skb);
 		if (skb->dev)
-			skb->tstamp = 0;
+			skb_clear_tstamp(skb);
 	}
 	return ret;
 }
@@ -652,7 +652,7 @@ static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb,
 	if (!local) {
 		skb_forward_csum(skb);
 		if (skb->dev)
-			skb->tstamp = 0;
+			skb_clear_tstamp(skb);
 		NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb,
 			NULL, skb_dst(skb)->dev, dst_output);
 	} else
@@ -674,7 +674,7 @@ static inline int ip_vs_send_or_cont(int pf, struct sk_buff *skb,
 		ip_vs_drop_early_demux_sk(skb);
 		skb_forward_csum(skb);
 		if (skb->dev)
-			skb->tstamp = 0;
+			skb_clear_tstamp(skb);
 		NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb,
 			NULL, skb_dst(skb)->dev, dst_output);
 	} else
diff --git a/net/netfilter/nf_dup_netdev.c b/net/netfilter/nf_dup_netdev.c
index a579e59ee5c5..7873bd1389c3 100644
--- a/net/netfilter/nf_dup_netdev.c
+++ b/net/netfilter/nf_dup_netdev.c
@@ -19,7 +19,7 @@ static void nf_do_netdev_egress(struct sk_buff *skb, struct net_device *dev)
 		skb_push(skb, skb->mac_len);
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	dev_queue_xmit(skb);
 }
 
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 889cf88d3dba..f1d387129f02 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -376,7 +376,7 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
 	nf_flow_nat_ip(flow, skb, thoff, dir, iph);
 
 	ip_decrease_ttl(iph);
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	if (flow_table->flags & NF_FLOWTABLE_COUNTER)
 		nf_ct_acct_update(flow->ct, tuplehash->tuple.dir, skb->len);
@@ -611,7 +611,7 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 	nf_flow_nat_ipv6(flow, skb, dir, ip6h);
 
 	ip6h->hop_limit--;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	if (flow_table->flags & NF_FLOWTABLE_COUNTER)
 		nf_ct_acct_update(flow->ct, tuplehash->tuple.dir, skb->len);
diff --git a/net/netfilter/nft_fwd_netdev.c b/net/netfilter/nft_fwd_netdev.c
index 619e394a91de..08e7a289738e 100644
--- a/net/netfilter/nft_fwd_netdev.c
+++ b/net/netfilter/nft_fwd_netdev.c
@@ -145,7 +145,7 @@ static void nft_fwd_neigh_eval(const struct nft_expr *expr,
 		return;
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	neigh_xmit(neigh_table, dev, addr, skb);
 out:
 	regs->verdict.code = verdict;
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index cf2ce5812489..82a74f998966 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -507,7 +507,7 @@ void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto)
 	}
 
 	skb->dev = vport->dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	vport->ops->send(skb);
 	return;
 
diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 57448fc519fc..4991e99ced9a 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -190,7 +190,7 @@ static void xfrmi_dev_uninit(struct net_device *dev)
 
 static void xfrmi_scrub_packet(struct sk_buff *skb, bool xnet)
 {
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	skb->pkt_type = PACKET_HOST;
 	skb->skb_iif = 0;
 	skb->ignore_df = 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 net-next 03/13] net: Handle delivery_time in skb->tstamp during network tapping with af_packet
  2022-03-02 19:55 [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
  2022-03-02 19:55 ` [PATCH v6 net-next 01/13] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Martin KaFai Lau
  2022-03-02 19:55 ` [PATCH v6 net-next 02/13] net: Add skb_clear_tstamp() to keep the mono delivery_time Martin KaFai Lau
@ 2022-03-02 19:55 ` Martin KaFai Lau
  2022-03-03 10:48   ` Daniel Borkmann
  2022-03-02 19:55 ` [PATCH v6 net-next 04/13] net: Clear mono_delivery_time bit in __skb_tstamp_tx() Martin KaFai Lau
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-02 19:55 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

A latter patch will set the skb->mono_delivery_time to flag the skb->tstamp
is used as the mono delivery_time (EDT) instead of the (rcv) timestamp.
skb_clear_tstamp() will then keep this delivery_time during forwarding.

This patch is to make the network tapping (with af_packet) to handle
the delivery_time stored in skb->tstamp.

Regardless of tapping at the ingress or egress,  the tapped skb is
received by the af_packet socket, so it is ingress to the af_packet
socket and it expects the (rcv) timestamp.

When tapping at egress, dev_queue_xmit_nit() is used.  It has already
expected skb->tstamp may have delivery_time,  so it does
skb_clone()+net_timestamp_set() to ensure the cloned skb has
the (rcv) timestamp before passing to the af_packet sk.
This patch only adds to clear the skb->mono_delivery_time
bit in net_timestamp_set().

When tapping at ingress, it currently expects the skb->tstamp is either 0
or the (rcv) timestamp.  Meaning, the tapping at ingress path
has already expected the skb->tstamp could be 0 and it will get
the (rcv) timestamp by ktime_get_real() when needed.

There are two cases for tapping at ingress:

One case is af_packet queues the skb to its sk_receive_queue.
The skb is either not shared or new clone created.  The newly
added skb_clear_delivery_time() is called to clear the
delivery_time (if any) and set the (rcv) timestamp if
needed before the skb is queued to the sk_receive_queue.

Another case, the ingress skb is directly copied to the rx_ring
and tpacket_get_timestamp() is used to get the (rcv) timestamp.
The newly added skb_tstamp() is used in tpacket_get_timestamp()
to check the skb->mono_delivery_time bit before returning skb->tstamp.
As mentioned earlier, the tapping@ingress has already expected
the skb may not have the (rcv) timestamp (because no sk has asked
for it) and has handled this case by directly calling ktime_get_real().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/skbuff.h | 24 ++++++++++++++++++++++++
 net/core/dev.c         |  4 +++-
 net/packet/af_packet.c |  4 +++-
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 27a28920e7b3..7e2d796ece80 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3996,6 +3996,22 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
 	skb->mono_delivery_time = 0;
 }
 
+DECLARE_STATIC_KEY_FALSE(netstamp_needed_key);
+
+/* It is used in the ingress path to clear the delivery_time.
+ * If needed, set the skb->tstamp to the (rcv) timestamp.
+ */
+static inline void skb_clear_delivery_time(struct sk_buff *skb)
+{
+	if (skb->mono_delivery_time) {
+		skb->mono_delivery_time = 0;
+		if (static_branch_unlikely(&netstamp_needed_key))
+			skb->tstamp = ktime_get_real();
+		else
+			skb->tstamp = 0;
+	}
+}
+
 static inline void skb_clear_tstamp(struct sk_buff *skb)
 {
 	if (skb->mono_delivery_time)
@@ -4004,6 +4020,14 @@ static inline void skb_clear_tstamp(struct sk_buff *skb)
 	skb->tstamp = 0;
 }
 
+static inline ktime_t skb_tstamp(const struct sk_buff *skb)
+{
+	if (skb->mono_delivery_time)
+		return 0;
+
+	return skb->tstamp;
+}
+
 static inline u8 skb_metadata_len(const struct sk_buff *skb)
 {
 	return skb_shinfo(skb)->meta_len;
diff --git a/net/core/dev.c b/net/core/dev.c
index 2d6771075720..6d81b5a7ef3f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2020,7 +2020,8 @@ void net_dec_egress_queue(void)
 EXPORT_SYMBOL_GPL(net_dec_egress_queue);
 #endif
 
-static DEFINE_STATIC_KEY_FALSE(netstamp_needed_key);
+DEFINE_STATIC_KEY_FALSE(netstamp_needed_key);
+EXPORT_SYMBOL(netstamp_needed_key);
 #ifdef CONFIG_JUMP_LABEL
 static atomic_t netstamp_needed_deferred;
 static atomic_t netstamp_wanted;
@@ -2081,6 +2082,7 @@ EXPORT_SYMBOL(net_disable_timestamp);
 static inline void net_timestamp_set(struct sk_buff *skb)
 {
 	skb->tstamp = 0;
+	skb->mono_delivery_time = 0;
 	if (static_branch_unlikely(&netstamp_needed_key))
 		__net_timestamp(skb);
 }
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index ab87f22cc7ec..1b93ce1a5600 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -460,7 +460,7 @@ static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts,
 		return TP_STATUS_TS_RAW_HARDWARE;
 
 	if ((flags & SOF_TIMESTAMPING_SOFTWARE) &&
-	    ktime_to_timespec64_cond(skb->tstamp, ts))
+	    ktime_to_timespec64_cond(skb_tstamp(skb), ts))
 		return TP_STATUS_TS_SOFTWARE;
 
 	return 0;
@@ -2199,6 +2199,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
 	spin_lock(&sk->sk_receive_queue.lock);
 	po->stats.stats1.tp_packets++;
 	sock_skb_set_dropcount(sk, skb);
+	skb_clear_delivery_time(skb);
 	__skb_queue_tail(&sk->sk_receive_queue, skb);
 	spin_unlock(&sk->sk_receive_queue.lock);
 	sk->sk_data_ready(sk);
@@ -2377,6 +2378,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	po->stats.stats1.tp_packets++;
 	if (copy_skb) {
 		status |= TP_STATUS_COPY;
+		skb_clear_delivery_time(copy_skb);
 		__skb_queue_tail(&sk->sk_receive_queue, copy_skb);
 	}
 	spin_unlock(&sk->sk_receive_queue.lock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 net-next 04/13] net: Clear mono_delivery_time bit in __skb_tstamp_tx()
  2022-03-02 19:55 [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2022-03-02 19:55 ` [PATCH v6 net-next 03/13] net: Handle delivery_time in skb->tstamp during network tapping with af_packet Martin KaFai Lau
@ 2022-03-02 19:55 ` Martin KaFai Lau
  2022-03-02 19:55 ` [PATCH v6 net-next 05/13] net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() Martin KaFai Lau
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-02 19:55 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

In __skb_tstamp_tx(), it may clone the egress skb and queues the clone to
the sk_error_queue.  The outgoing skb may have the mono delivery_time
while the (rcv) timestamp is expected for the clone, so the
skb->mono_delivery_time bit needs to be cleared from the clone.

This patch adds the skb->mono_delivery_time clearing to the existing
__net_timestamp() and use it in __skb_tstamp_tx().
The __net_timestamp() fast path usage in dev.c is changed to directly
call ktime_get_real() since the mono_delivery_time bit is not set at
that point.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/skbuff.h | 1 +
 net/core/dev.c         | 4 ++--
 net/core/skbuff.c      | 2 +-
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7e2d796ece80..8e8a4af4f9e2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3981,6 +3981,7 @@ static inline void skb_get_new_timestampns(const struct sk_buff *skb,
 static inline void __net_timestamp(struct sk_buff *skb)
 {
 	skb->tstamp = ktime_get_real();
+	skb->mono_delivery_time = 0;
 }
 
 static inline ktime_t net_timedelta(ktime_t t)
diff --git a/net/core/dev.c b/net/core/dev.c
index 6d81b5a7ef3f..3ff686cc8c84 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2084,13 +2084,13 @@ static inline void net_timestamp_set(struct sk_buff *skb)
 	skb->tstamp = 0;
 	skb->mono_delivery_time = 0;
 	if (static_branch_unlikely(&netstamp_needed_key))
-		__net_timestamp(skb);
+		skb->tstamp = ktime_get_real();
 }
 
 #define net_timestamp_check(COND, SKB)				\
 	if (static_branch_unlikely(&netstamp_needed_key)) {	\
 		if ((COND) && !(SKB)->tstamp)			\
-			__net_timestamp(SKB);			\
+			(SKB)->tstamp = ktime_get_real();	\
 	}							\
 
 bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9abb0028309f..e5082836295b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4851,7 +4851,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 	if (hwtstamps)
 		*skb_hwtstamps(skb) = *hwtstamps;
 	else
-		skb->tstamp = ktime_get_real();
+		__net_timestamp(skb);
 
 	__skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 net-next 05/13] net: Set skb->mono_delivery_time and clear it after sch_handle_ingress()
  2022-03-02 19:55 [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2022-03-02 19:55 ` [PATCH v6 net-next 04/13] net: Clear mono_delivery_time bit in __skb_tstamp_tx() Martin KaFai Lau
@ 2022-03-02 19:55 ` Martin KaFai Lau
  2022-03-02 19:55 ` [PATCH v6 net-next 06/13] net: ip: Handle delivery_time in ip defrag Martin KaFai Lau
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-02 19:55 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

The previous patches handled the delivery_time before sch_handle_ingress().

This patch can now set the skb->mono_delivery_time to flag the skb->tstamp
is used as the mono delivery_time (EDT) instead of the (rcv) timestamp
and also clear it with skb_clear_delivery_time() after
sch_handle_ingress().  This will make the bpf_redirect_*()
to keep the mono delivery_time and used by a qdisc (fq) of
the egress-ing interface.

A latter patch will postpone the skb_clear_delivery_time() until the
stack learns that the skb is being delivered locally and that will
make other kernel forwarding paths (ip[6]_forward) able to keep
the delivery_time also.  Thus, like the previous patches on using
the skb->mono_delivery_time bit, calling skb_clear_delivery_time()
is not limited within the CONFIG_NET_INGRESS to avoid too many code
churns among this set.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/skbuff.h | 3 +--
 net/core/dev.c         | 8 ++++++--
 net/ipv4/ip_output.c   | 3 +--
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8e8a4af4f9e2..0f5fd53059cd 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3993,8 +3993,7 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
 					 bool mono)
 {
 	skb->tstamp = kt;
-	/* Setting mono_delivery_time will be enabled later */
-	skb->mono_delivery_time = 0;
+	skb->mono_delivery_time = kt && mono;
 }
 
 DECLARE_STATIC_KEY_FALSE(netstamp_needed_key);
diff --git a/net/core/dev.c b/net/core/dev.c
index 3ff686cc8c84..0fc02cf32476 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5193,8 +5193,10 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
 			goto out;
 	}
 
-	if (skb_skip_tc_classify(skb))
+	if (skb_skip_tc_classify(skb)) {
+		skb_clear_delivery_time(skb);
 		goto skip_classify;
+	}
 
 	if (pfmemalloc)
 		goto skip_taps;
@@ -5223,12 +5225,14 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
 			goto another_round;
 		if (!skb)
 			goto out;
+		skb_clear_delivery_time(skb);
 
 		nf_skip_egress(skb, false);
 		if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0)
 			goto out;
-	}
+	} else
 #endif
+		skb_clear_delivery_time(skb);
 	skb_reset_redirect(skb);
 skip_classify:
 	if (pfmemalloc && !skb_pfmemalloc_protocol(skb))
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index a9588e0c82c5..00b4bf26fd93 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1728,8 +1728,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
 			  arg->csumoffset) = csum_fold(csum_add(nskb->csum,
 								arg->csum));
 		nskb->ip_summed = CHECKSUM_NONE;
-		/* Setting mono_delivery_time will be enabled later */
-		nskb->mono_delivery_time = 0;
+		nskb->mono_delivery_time = !!transmit_time;
 		ip_push_pending_frames(sk, &fl4);
 	}
 out:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 net-next 06/13] net: ip: Handle delivery_time in ip defrag
  2022-03-02 19:55 [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (4 preceding siblings ...)
  2022-03-02 19:55 ` [PATCH v6 net-next 05/13] net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() Martin KaFai Lau
@ 2022-03-02 19:55 ` Martin KaFai Lau
  2022-03-02 19:56 ` [PATCH v6 net-next 07/13] net: ipv6: Handle delivery_time in ipv6 defrag Martin KaFai Lau
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-02 19:55 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

A latter patch will postpone the delivery_time clearing until the stack
knows the skb is being delivered locally.  That will allow other kernel
forwarding path (e.g. ip[6]_forward) to keep the delivery_time also.

An earlier attempt was to do skb_clear_delivery_time() in
ip_local_deliver() and ip6_input().  The discussion [0] requested
to move it one step later into ip_local_deliver_finish()
and ip6_input_finish() so that the delivery_time can be kept
for the ip_vs forwarding path also.

To do that, this patch also needs to take care of the (rcv) timestamp
usecase in ip_is_fragment().  It needs to expect delivery_time in
the skb->tstamp, so it needs to save the mono_delivery_time bit in
inet_frag_queue such that the delivery_time (if any) can be restored
in the final defragmented skb.

[Note that it will only happen when the locally generated skb is looping
 from egress to ingress over a virtual interface (e.g. veth, loopback...),
 skb->tstamp may have the delivery time before it is known that it will
 be delivered locally and received by another sk.]

[0]: https://lore.kernel.org/netdev/ca728d81-80e8-3767-d5e-d44f6ad96e43@ssi.bg/

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/inet_frag.h  | 2 ++
 net/ipv4/inet_fragment.c | 1 +
 net/ipv4/ip_fragment.c   | 1 +
 3 files changed, 4 insertions(+)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 63540be0fc34..911ad930867d 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -70,6 +70,7 @@ struct frag_v6_compare_key {
  * @stamp: timestamp of the last received fragment
  * @len: total length of the original datagram
  * @meat: length of received fragments so far
+ * @mono_delivery_time: stamp has a mono delivery time (EDT)
  * @flags: fragment queue flags
  * @max_size: maximum received fragment size
  * @fqdir: pointer to struct fqdir
@@ -90,6 +91,7 @@ struct inet_frag_queue {
 	ktime_t			stamp;
 	int			len;
 	int			meat;
+	u8			mono_delivery_time;
 	__u8			flags;
 	u16			max_size;
 	struct fqdir		*fqdir;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 341096807100..63948f6aeca0 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -572,6 +572,7 @@ void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
 	skb_mark_not_on_list(head);
 	head->prev = NULL;
 	head->tstamp = q->stamp;
+	head->mono_delivery_time = q->mono_delivery_time;
 }
 EXPORT_SYMBOL(inet_frag_reasm_finish);
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index fad803d2d711..fb153569889e 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -349,6 +349,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		qp->iif = dev->ifindex;
 
 	qp->q.stamp = skb->tstamp;
+	qp->q.mono_delivery_time = skb->mono_delivery_time;
 	qp->q.meat += skb->len;
 	qp->ecn |= ecn;
 	add_frag_mem_limit(qp->q.fqdir, skb->truesize);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 net-next 07/13] net: ipv6: Handle delivery_time in ipv6 defrag
  2022-03-02 19:55 [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (5 preceding siblings ...)
  2022-03-02 19:55 ` [PATCH v6 net-next 06/13] net: ip: Handle delivery_time in ip defrag Martin KaFai Lau
@ 2022-03-02 19:56 ` Martin KaFai Lau
  2022-03-02 19:56 ` [PATCH v6 net-next 08/13] net: ipv6: Get rcv timestamp if needed when handling hop-by-hop IOAM option Martin KaFai Lau
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-02 19:56 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

A latter patch will postpone the delivery_time clearing until the stack
knows the skb is being delivered locally (i.e. calling
skb_clear_delivery_time() at ip_local_deliver_finish() for IPv4
and at ip6_input_finish() for IPv6).  That will allow other kernel
forwarding path (e.g. ip[6]_forward) to keep the delivery_time also.

A very similar IPv6 defrag codes have been duplicated in
multiple places: regular IPv6, nf_conntrack, and 6lowpan.

Unlike the IPv4 defrag which is done before ip_local_deliver_finish(),
the regular IPv6 defrag is done after ip6_input_finish().
Thus, no change should be needed in the regular IPv6 defrag
logic because skb_clear_delivery_time() should have been called.

6lowpan also does not need special handling on delivery_time
because it is a non-inet packet_type.

However, cf_conntrack has a case in NF_INET_PRE_ROUTING that needs
to do the IPv6 defrag earlier.  Thus, it needs to save the
mono_delivery_time bit in the inet_frag_queue which is similar
to how it is handled in the previous patch for the IPv4 defrag.

This patch chooses to do it consistently and stores the mono_delivery_time
in the inet_frag_queue for all cases such that it will be easier
for the future refactoring effort on the IPv6 reasm code.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/ieee802154/6lowpan/reassembly.c     | 1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c | 1 +
 net/ipv6/reassembly.c                   | 1 +
 3 files changed, 3 insertions(+)

diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c
index be6f06adefe0..a91283d1e5bf 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -130,6 +130,7 @@ static int lowpan_frag_queue(struct lowpan_frag_queue *fq,
 		goto err;
 
 	fq->q.stamp = skb->tstamp;
+	fq->q.mono_delivery_time = skb->mono_delivery_time;
 	if (frag_type == LOWPAN_DISPATCH_FRAG1)
 		fq->q.flags |= INET_FRAG_FIRST_IN;
 
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 5c47be29b9ee..7dd3629dd19e 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -264,6 +264,7 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, struct sk_buff *skb,
 		fq->iif = dev->ifindex;
 
 	fq->q.stamp = skb->tstamp;
+	fq->q.mono_delivery_time = skb->mono_delivery_time;
 	fq->q.meat += skb->len;
 	fq->ecn |= ecn;
 	if (payload_len > fq->q.max_size)
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 28e44782c94d..ff866f2a879e 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -194,6 +194,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
 		fq->iif = dev->ifindex;
 
 	fq->q.stamp = skb->tstamp;
+	fq->q.mono_delivery_time = skb->mono_delivery_time;
 	fq->q.meat += skb->len;
 	fq->ecn |= ecn;
 	add_frag_mem_limit(fq->q.fqdir, skb->truesize);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 net-next 08/13] net: ipv6: Get rcv timestamp if needed when handling hop-by-hop IOAM option
  2022-03-02 19:55 [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (6 preceding siblings ...)
  2022-03-02 19:56 ` [PATCH v6 net-next 07/13] net: ipv6: Handle delivery_time in ipv6 defrag Martin KaFai Lau
@ 2022-03-02 19:56 ` Martin KaFai Lau
  2022-03-02 19:56 ` [PATCH v6 net-next 09/13] net: Get rcv tstamp if needed in nfnetlink_{log, queue}.c Martin KaFai Lau
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-02 19:56 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

IOAM is a hop-by-hop option with a temporary iana allocation (49).
Since it is hop-by-hop, it is done before the input routing decision.
One of the traced data field is the (rcv) timestamp.

When the locally generated skb is looping from egress to ingress over
a virtual interface (e.g. veth, loopback...), skb->tstamp may have the
delivery time before it is known that it will be delivered locally
and received by another sk.

Like handling the network tapping (tcpdump) in the earlier patch,
this patch gets the timestamp if needed without over-writing the
delivery_time in the skb->tstamp.  skb_tstamp_cond() is added to do the
ktime_get_real() with an extra cond arg to check on top of the
netstamp_needed_key static key.  skb_tstamp_cond() will also be used in
a latter patch and it needs the netstamp_needed_key check.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/skbuff.h | 11 +++++++++++
 net/ipv6/ioam6.c       | 19 +++++++++----------
 2 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0f5fd53059cd..4b5b926a81f2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4028,6 +4028,17 @@ static inline ktime_t skb_tstamp(const struct sk_buff *skb)
 	return skb->tstamp;
 }
 
+static inline ktime_t skb_tstamp_cond(const struct sk_buff *skb, bool cond)
+{
+	if (!skb->mono_delivery_time && skb->tstamp)
+		return skb->tstamp;
+
+	if (static_branch_unlikely(&netstamp_needed_key) || cond)
+		return ktime_get_real();
+
+	return 0;
+}
+
 static inline u8 skb_metadata_len(const struct sk_buff *skb)
 {
 	return skb_shinfo(skb)->meta_len;
diff --git a/net/ipv6/ioam6.c b/net/ipv6/ioam6.c
index e159eb4328a8..1098131ed90c 100644
--- a/net/ipv6/ioam6.c
+++ b/net/ipv6/ioam6.c
@@ -635,7 +635,8 @@ static void __ioam6_fill_trace_data(struct sk_buff *skb,
 				    struct ioam6_schema *sc,
 				    u8 sclen, bool is_input)
 {
-	struct __kernel_sock_timeval ts;
+	struct timespec64 ts;
+	ktime_t tstamp;
 	u64 raw64;
 	u32 raw32;
 	u16 raw16;
@@ -680,10 +681,9 @@ static void __ioam6_fill_trace_data(struct sk_buff *skb,
 		if (!skb->dev) {
 			*(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
 		} else {
-			if (!skb->tstamp)
-				__net_timestamp(skb);
+			tstamp = skb_tstamp_cond(skb, true);
+			ts = ktime_to_timespec64(tstamp);
 
-			skb_get_new_timestamp(skb, &ts);
 			*(__be32 *)data = cpu_to_be32((u32)ts.tv_sec);
 		}
 		data += sizeof(__be32);
@@ -694,13 +694,12 @@ static void __ioam6_fill_trace_data(struct sk_buff *skb,
 		if (!skb->dev) {
 			*(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
 		} else {
-			if (!skb->tstamp)
-				__net_timestamp(skb);
+			if (!trace->type.bit2) {
+				tstamp = skb_tstamp_cond(skb, true);
+				ts = ktime_to_timespec64(tstamp);
+			}
 
-			if (!trace->type.bit2)
-				skb_get_new_timestamp(skb, &ts);
-
-			*(__be32 *)data = cpu_to_be32((u32)ts.tv_usec);
+			*(__be32 *)data = cpu_to_be32((u32)(ts.tv_nsec / NSEC_PER_USEC));
 		}
 		data += sizeof(__be32);
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 net-next 09/13] net: Get rcv tstamp if needed in nfnetlink_{log, queue}.c
  2022-03-02 19:55 [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (7 preceding siblings ...)
  2022-03-02 19:56 ` [PATCH v6 net-next 08/13] net: ipv6: Get rcv timestamp if needed when handling hop-by-hop IOAM option Martin KaFai Lau
@ 2022-03-02 19:56 ` Martin KaFai Lau
  2022-03-02 19:56 ` [PATCH v6 net-next 10/13] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally Martin KaFai Lau
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-02 19:56 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

If skb has the (rcv) timestamp available, nfnetlink_{log, queue}.c
logs/outputs it to the userspace.  When the locally generated skb is
looping from egress to ingress over a virtual interface (e.g. veth,
loopback...),  skb->tstamp may have the delivery time before it is
known that will be delivered locally and received by another sk.  Like
handling the delivery time in network tapping,  use ktime_get_real() to
get the (rcv) timestamp.  The earlier added helper skb_tstamp_cond() is
used to do this.  false is passed to the second 'cond' arg such
that doing ktime_get_real() or not only depends on the
netstamp_needed_key static key.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/netfilter/nfnetlink_log.c   | 6 ++++--
 net/netfilter/nfnetlink_queue.c | 8 +++++---
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c
index ae9c0756bba5..d97eb280cb2e 100644
--- a/net/netfilter/nfnetlink_log.c
+++ b/net/netfilter/nfnetlink_log.c
@@ -460,6 +460,7 @@ __build_packet_message(struct nfnl_log_net *log,
 	sk_buff_data_t old_tail = inst->skb->tail;
 	struct sock *sk;
 	const unsigned char *hwhdrp;
+	ktime_t tstamp;
 
 	nlh = nfnl_msg_put(inst->skb, 0, 0,
 			   nfnl_msg_type(NFNL_SUBSYS_ULOG, NFULNL_MSG_PACKET),
@@ -588,9 +589,10 @@ __build_packet_message(struct nfnl_log_net *log,
 			goto nla_put_failure;
 	}
 
-	if (hooknum <= NF_INET_FORWARD && skb->tstamp) {
+	tstamp = skb_tstamp_cond(skb, false);
+	if (hooknum <= NF_INET_FORWARD && tstamp) {
 		struct nfulnl_msg_packet_timestamp ts;
-		struct timespec64 kts = ktime_to_timespec64(skb->tstamp);
+		struct timespec64 kts = ktime_to_timespec64(tstamp);
 		ts.sec = cpu_to_be64(kts.tv_sec);
 		ts.usec = cpu_to_be64(kts.tv_nsec / NSEC_PER_USEC);
 
diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index 8c15978d9258..db9b5357f2ca 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -392,6 +392,7 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue,
 	bool csum_verify;
 	char *secdata = NULL;
 	u32 seclen = 0;
+	ktime_t tstamp;
 
 	size = nlmsg_total_size(sizeof(struct nfgenmsg))
 		+ nla_total_size(sizeof(struct nfqnl_msg_packet_hdr))
@@ -407,7 +408,8 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue,
 		+ nla_total_size(sizeof(u_int32_t))	/* skbinfo */
 		+ nla_total_size(sizeof(u_int32_t));	/* cap_len */
 
-	if (entskb->tstamp)
+	tstamp = skb_tstamp_cond(entskb, false);
+	if (tstamp)
 		size += nla_total_size(sizeof(struct nfqnl_msg_packet_timestamp));
 
 	size += nfqnl_get_bridge_size(entry);
@@ -582,9 +584,9 @@ nfqnl_build_packet_message(struct net *net, struct nfqnl_instance *queue,
 	if (nfqnl_put_bridge(entry, skb) < 0)
 		goto nla_put_failure;
 
-	if (entry->state.hook <= NF_INET_FORWARD && entskb->tstamp) {
+	if (entry->state.hook <= NF_INET_FORWARD && tstamp) {
 		struct nfqnl_msg_packet_timestamp ts;
-		struct timespec64 kts = ktime_to_timespec64(entskb->tstamp);
+		struct timespec64 kts = ktime_to_timespec64(tstamp);
 
 		ts.sec = cpu_to_be64(kts.tv_sec);
 		ts.usec = cpu_to_be64(kts.tv_nsec / NSEC_PER_USEC);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 net-next 10/13] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally
  2022-03-02 19:55 [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (8 preceding siblings ...)
  2022-03-02 19:56 ` [PATCH v6 net-next 09/13] net: Get rcv tstamp if needed in nfnetlink_{log, queue}.c Martin KaFai Lau
@ 2022-03-02 19:56 ` Martin KaFai Lau
  2022-03-02 20:30   ` Eric Dumazet
  2022-03-02 19:56 ` [PATCH v6 net-next 11/13] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress Martin KaFai Lau
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-02 19:56 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

The previous patches handled the delivery_time in the ingress path
before the routing decision is made.  This patch can postpone clearing
delivery_time in a skb until knowing it is delivered locally and also
set the (rcv) timestamp if needed.  This patch moves the
skb_clear_delivery_time() from dev.c to ip_local_deliver_finish()
and ip6_input_finish().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/core/dev.c       | 8 ++------
 net/ipv4/ip_input.c  | 1 +
 net/ipv6/ip6_input.c | 1 +
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 0fc02cf32476..3ff686cc8c84 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5193,10 +5193,8 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
 			goto out;
 	}
 
-	if (skb_skip_tc_classify(skb)) {
-		skb_clear_delivery_time(skb);
+	if (skb_skip_tc_classify(skb))
 		goto skip_classify;
-	}
 
 	if (pfmemalloc)
 		goto skip_taps;
@@ -5225,14 +5223,12 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
 			goto another_round;
 		if (!skb)
 			goto out;
-		skb_clear_delivery_time(skb);
 
 		nf_skip_egress(skb, false);
 		if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0)
 			goto out;
-	} else
+	}
 #endif
-		skb_clear_delivery_time(skb);
 	skb_reset_redirect(skb);
 skip_classify:
 	if (pfmemalloc && !skb_pfmemalloc_protocol(skb))
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index d94f9f7e60c3..95f7bb052784 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -226,6 +226,7 @@ void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int protocol)
 
 static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
+	skb_clear_delivery_time(skb);
 	__skb_pull(skb, skb_network_header_len(skb));
 
 	rcu_read_lock();
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index d4b1e2c5aa76..5b5ea35635f9 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -459,6 +459,7 @@ void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int nexthdr,
 
 static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
+	skb_clear_delivery_time(skb);
 	rcu_read_lock();
 	ip6_protocol_deliver_rcu(net, skb, 0, false);
 	rcu_read_unlock();
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 net-next 11/13] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress
  2022-03-02 19:55 [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (9 preceding siblings ...)
  2022-03-02 19:56 ` [PATCH v6 net-next 10/13] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally Martin KaFai Lau
@ 2022-03-02 19:56 ` Martin KaFai Lau
  2022-03-03 13:00   ` Daniel Borkmann
  2022-03-02 19:56 ` [PATCH v6 net-next 12/13] bpf: Add __sk_buff->delivery_time_type and bpf_skb_set_skb_delivery_time() Martin KaFai Lau
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-02 19:56 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

The current tc-bpf@ingress reads and writes the __sk_buff->tstamp
as a (rcv) timestamp which currently could either be 0 (not available)
or ktime_get_real().  This patch is to backward compatible with the
(rcv) timestamp expectation at ingress.  If the skb->tstamp has
the delivery_time, the bpf insn rewrite will read 0 for tc-bpf
running at ingress as it is not available.  When writing at ingress,
it will also clear the skb->mono_delivery_time bit.

/* BPF_READ: a = __sk_buff->tstamp */
if (!skb->tc_at_ingress || !skb->mono_delivery_time)
	a = skb->tstamp;
else
	a = 0

/* BPF_WRITE: __sk_buff->tstamp = a */
if (skb->tc_at_ingress)
	skb->mono_delivery_time = 0;
skb->tstamp = a;

[ A note on the BPF_CGROUP_INET_INGRESS which can also access
  skb->tstamp.  At that point, the skb is delivered locally
  and skb_clear_delivery_time() has already been done,
  so the skb->tstamp will only have the (rcv) timestamp. ]

If the tc-bpf@egress writes 0 to skb->tstamp, the skb->mono_delivery_time
has to be cleared also.  It could be done together during
convert_ctx_access().  However, the latter patch will also expose
the skb->mono_delivery_time bit as __sk_buff->delivery_time_type.
Changing the delivery_time_type in the background may surprise
the user, e.g. the 2nd read on __sk_buff->delivery_time_type
may need a READ_ONCE() to avoid compiler optimization.  Thus,
in expecting the needs in the latter patch, this patch does a
check on !skb->tstamp after running the tc-bpf and clears the
skb->mono_delivery_time bit if needed.  The earlier discussion
on v4 [0].

The bpf insn rewrite requires the skb's mono_delivery_time bit and
tc_at_ingress bit.  They are moved up in sk_buff so that bpf rewrite
can be done at a fixed offset.  tc_skip_classify is moved together with
tc_at_ingress.  To get one bit for mono_delivery_time, csum_not_inet is
moved down and this bit is currently used by sctp.

[0]: https://lore.kernel.org/bpf/20220217015043.khqwqklx45c4m4se@kafai-mbp.dhcp.thefacebook.com/

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/skbuff.h | 18 +++++++----
 net/core/filter.c      | 71 ++++++++++++++++++++++++++++++++++++------
 net/sched/act_bpf.c    |  2 ++
 net/sched/cls_bpf.c    |  2 ++
 4 files changed, 77 insertions(+), 16 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4b5b926a81f2..5445860e1ba6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -941,8 +941,12 @@ struct sk_buff {
 	__u8			vlan_present:1;	/* See PKT_VLAN_PRESENT_BIT */
 	__u8			csum_complete_sw:1;
 	__u8			csum_level:2;
-	__u8			csum_not_inet:1;
 	__u8			dst_pending_confirm:1;
+	__u8			mono_delivery_time:1;
+#ifdef CONFIG_NET_CLS_ACT
+	__u8			tc_skip_classify:1;
+	__u8			tc_at_ingress:1;
+#endif
 #ifdef CONFIG_IPV6_NDISC_NODETYPE
 	__u8			ndisc_nodetype:2;
 #endif
@@ -953,10 +957,6 @@ struct sk_buff {
 #ifdef CONFIG_NET_SWITCHDEV
 	__u8			offload_fwd_mark:1;
 	__u8			offload_l3_fwd_mark:1;
-#endif
-#ifdef CONFIG_NET_CLS_ACT
-	__u8			tc_skip_classify:1;
-	__u8			tc_at_ingress:1;
 #endif
 	__u8			redirected:1;
 #ifdef CONFIG_NET_REDIRECT
@@ -969,7 +969,7 @@ struct sk_buff {
 	__u8			decrypted:1;
 #endif
 	__u8			slow_gro:1;
-	__u8			mono_delivery_time:1;
+	__u8			csum_not_inet:1;
 
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
@@ -1047,10 +1047,16 @@ struct sk_buff {
 /* if you move pkt_vlan_present around you also must adapt these constants */
 #ifdef __BIG_ENDIAN_BITFIELD
 #define PKT_VLAN_PRESENT_BIT	7
+#define TC_AT_INGRESS_MASK		(1 << 0)
+#define SKB_MONO_DELIVERY_TIME_MASK	(1 << 2)
 #else
 #define PKT_VLAN_PRESENT_BIT	0
+#define TC_AT_INGRESS_MASK		(1 << 7)
+#define SKB_MONO_DELIVERY_TIME_MASK	(1 << 5)
 #endif
 #define PKT_VLAN_PRESENT_OFFSET	offsetof(struct sk_buff, __pkt_vlan_present_offset)
+#define TC_AT_INGRESS_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)
+#define SKB_MONO_DELIVERY_TIME_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)
 
 #ifdef __KERNEL__
 /*
diff --git a/net/core/filter.c b/net/core/filter.c
index cfcf9b4d1ec2..5072733743e9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8859,6 +8859,65 @@ static struct bpf_insn *bpf_convert_shinfo_access(const struct bpf_insn *si,
 	return insn;
 }
 
+static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_insn *si,
+						struct bpf_insn *insn)
+{
+	__u8 value_reg = si->dst_reg;
+	__u8 skb_reg = si->src_reg;
+
+#ifdef CONFIG_NET_CLS_ACT
+	__u8 tmp_reg = BPF_REG_AX;
+
+	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
+	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK);
+	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 5);
+	/* @ingress, read __sk_buff->tstamp as the (rcv) timestamp,
+	 * so check the skb->mono_delivery_time.
+	 */
+	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
+			      SKB_MONO_DELIVERY_TIME_OFFSET);
+	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
+				SKB_MONO_DELIVERY_TIME_MASK);
+	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2);
+	/* skb->mono_delivery_time is set, read 0 as the (rcv) timestamp. */
+	*insn++ = BPF_MOV64_IMM(value_reg, 0);
+	*insn++ = BPF_JMP_A(1);
+#endif
+
+	*insn++ = BPF_LDX_MEM(BPF_DW, value_reg, skb_reg,
+			      offsetof(struct sk_buff, tstamp));
+	return insn;
+}
+
+static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_insn *si,
+						 struct bpf_insn *insn)
+{
+	__u8 value_reg = si->src_reg;
+	__u8 skb_reg = si->dst_reg;
+
+#ifdef CONFIG_NET_CLS_ACT
+	__u8 tmp_reg = BPF_REG_AX;
+
+	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
+	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK);
+	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 3);
+	/* Writing __sk_buff->tstamp at ingress as the (rcv) timestamp.
+	 * Clear the skb->mono_delivery_time.
+	 */
+	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
+			      SKB_MONO_DELIVERY_TIME_OFFSET);
+	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
+				~SKB_MONO_DELIVERY_TIME_MASK);
+	*insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg,
+			      SKB_MONO_DELIVERY_TIME_OFFSET);
+#endif
+
+	/* skb->tstamp = tstamp */
+	*insn++ = BPF_STX_MEM(BPF_DW, skb_reg, value_reg,
+			      offsetof(struct sk_buff, tstamp));
+	return insn;
+}
+
 static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 				  const struct bpf_insn *si,
 				  struct bpf_insn *insn_buf,
@@ -9167,17 +9226,9 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 		BUILD_BUG_ON(sizeof_field(struct sk_buff, tstamp) != 8);
 
 		if (type == BPF_WRITE)
-			*insn++ = BPF_STX_MEM(BPF_DW,
-					      si->dst_reg, si->src_reg,
-					      bpf_target_off(struct sk_buff,
-							     tstamp, 8,
-							     target_size));
+			insn = bpf_convert_tstamp_write(si, insn);
 		else
-			*insn++ = BPF_LDX_MEM(BPF_DW,
-					      si->dst_reg, si->src_reg,
-					      bpf_target_off(struct sk_buff,
-							     tstamp, 8,
-							     target_size));
+			insn = bpf_convert_tstamp_read(si, insn);
 		break;
 
 	case offsetof(struct __sk_buff, gso_segs):
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index a77d8908e737..fea2d78b9ddc 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -53,6 +53,8 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act,
 		bpf_compute_data_pointers(skb);
 		filter_res = bpf_prog_run(filter, skb);
 	}
+	if (unlikely(!skb->tstamp && skb->mono_delivery_time))
+		skb->mono_delivery_time = 0;
 	if (skb_sk_is_prefetched(skb) && filter_res != TC_ACT_OK)
 		skb_orphan(skb);
 
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index df19a847829e..c85b85a192bf 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -102,6 +102,8 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 			bpf_compute_data_pointers(skb);
 			filter_res = bpf_prog_run(prog->filter, skb);
 		}
+		if (unlikely(!skb->tstamp && skb->mono_delivery_time))
+			skb->mono_delivery_time = 0;
 
 		if (prog->exts_integrated) {
 			res->class   = 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 net-next 12/13] bpf: Add __sk_buff->delivery_time_type and bpf_skb_set_skb_delivery_time()
  2022-03-02 19:55 [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (10 preceding siblings ...)
  2022-03-02 19:56 ` [PATCH v6 net-next 11/13] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress Martin KaFai Lau
@ 2022-03-02 19:56 ` Martin KaFai Lau
  2022-03-02 19:56 ` [PATCH v6 net-next 13/13] bpf: selftests: test skb->tstamp in redirect_neigh Martin KaFai Lau
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-02 19:56 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

* __sk_buff->delivery_time_type:
This patch adds __sk_buff->delivery_time_type.  It tells if the
delivery_time is stored in __sk_buff->tstamp or not.

It will be most useful for ingress to tell if the __sk_buff->tstamp
has the (rcv) timestamp or delivery_time.  If delivery_time_type
is 0 (BPF_SKB_DELIVERY_TIME_NONE), it has the (rcv) timestamp.

Two non-zero types are defined for the delivery_time_type,
BPF_SKB_DELIVERY_TIME_MONO and BPF_SKB_DELIVERY_TIME_UNSPEC.  For UNSPEC,
it can only happen in egress because only mono delivery_time can be
forwarded to ingress now.  The clock of UNSPEC delivery_time
can be deduced from the skb->sk->sk_clockid which is how
the sch_etf doing it also.

* Provide forwarded delivery_time to tc-bpf@ingress:
With the help of the new delivery_time_type, the tc-bpf has a way
to tell if the __sk_buff->tstamp has the (rcv) timestamp or
the delivery_time.  During bpf load time, the verifier will learn if
the bpf prog has accessed the new __sk_buff->delivery_time_type.
If it does, it means the tc-bpf@ingress is expecting the
skb->tstamp could have the delivery_time.  The kernel will then
read the skb->tstamp as-is during bpf insn rewrite without
checking the skb->mono_delivery_time.  This is done by adding a
new prog->delivery_time_access bit.  The same goes for
writing skb->tstamp.

* bpf_skb_set_delivery_time():
The bpf_skb_set_delivery_time() helper is added to allow setting both
delivery_time and the delivery_time_type at the same time.  If the
tc-bpf does not need to change the delivery_time_type, it can directly
write to the __sk_buff->tstamp as the existing tc-bpf has already been
doing.  It will be most useful at ingress to change the
__sk_buff->tstamp from the (rcv) timestamp to
a mono delivery_time and then bpf_redirect_*().

bpf only has mono clock helper (bpf_ktime_get_ns), and
the current known use case is the mono EDT for fq, and
only mono delivery time can be kept during forward now,
so bpf_skb_set_delivery_time() only supports setting
BPF_SKB_DELIVERY_TIME_MONO.  It can be extended later when use cases
come up and the forwarding path also supports other clock bases.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/filter.h         |   3 +-
 include/uapi/linux/bpf.h       |  41 +++++++-
 net/core/filter.c              | 169 ++++++++++++++++++++++++++-------
 tools/include/uapi/linux/bpf.h |  41 +++++++-
 4 files changed, 216 insertions(+), 38 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1cb1af917617..9bf26307247f 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -572,7 +572,8 @@ struct bpf_prog {
 				has_callchain_buf:1, /* callchain buffer allocated? */
 				enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */
 				call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */
-				call_get_func_ip:1; /* Do we call get_func_ip() */
+				call_get_func_ip:1, /* Do we call get_func_ip() */
+				delivery_time_access:1; /* Accessed __sk_buff->delivery_time_type */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index afe3d0d7f5f2..4eebea830613 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5086,6 +5086,37 @@ union bpf_attr {
  *	Return
  *		0 on success, or a negative error in case of failure. On error
  *		*dst* buffer is zeroed out.
+ *
+ * long bpf_skb_set_delivery_time(struct sk_buff *skb, u64 dtime, u32 dtime_type)
+ *	Description
+ *		Set a *dtime* (delivery time) to the __sk_buff->tstamp and also
+ *		change the __sk_buff->delivery_time_type to *dtime_type*.
+ *
+ *		When setting a delivery time (non zero *dtime*) to
+ *		__sk_buff->tstamp, only BPF_SKB_DELIVERY_TIME_MONO *dtime_type*
+ *		is supported.  It is the only delivery_time_type that will be
+ *		kept after bpf_redirect_*().
+ *
+ *		If there is no need to change the __sk_buff->delivery_time_type,
+ *		the delivery time can be directly written to __sk_buff->tstamp
+ *		instead.
+ *
+ *		*dtime* 0 and *dtime_type* BPF_SKB_DELIVERY_TIME_NONE
+ *		can be used to clear any delivery time stored in
+ *		__sk_buff->tstamp.
+ *
+ *		Only IPv4 and IPv6 skb->protocol are supported.
+ *
+ *		This function is most useful when it needs to set a
+ *		mono delivery time to __sk_buff->tstamp and then
+ *		bpf_redirect_*() to the egress of an iface.  For example,
+ *		changing the (rcv) timestamp in __sk_buff->tstamp at
+ *		ingress to a mono delivery time and then bpf_redirect_*()
+ *		to sch_fq@phy-dev.
+ *	Return
+ *		0 on success.
+ *		**-EINVAL** for invalid input
+ *		**-EOPNOTSUPP** for unsupported delivery_time_type and protocol
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5280,6 +5311,7 @@ union bpf_attr {
 	FN(xdp_load_bytes),		\
 	FN(xdp_store_bytes),		\
 	FN(copy_from_user_task),	\
+	FN(skb_set_delivery_time),      \
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5469,6 +5501,12 @@ union {					\
 	__u64 :64;			\
 } __attribute__((aligned(8)))
 
+enum {
+	BPF_SKB_DELIVERY_TIME_NONE,
+	BPF_SKB_DELIVERY_TIME_UNSPEC,
+	BPF_SKB_DELIVERY_TIME_MONO,
+};
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
@@ -5509,7 +5547,8 @@ struct __sk_buff {
 	__u32 gso_segs;
 	__bpf_md_ptr(struct bpf_sock *, sk);
 	__u32 gso_size;
-	__u32 :32;		/* Padding, future use. */
+	__u8  delivery_time_type;
+	__u32 :24;		/* Padding, future use. */
 	__u64 hwtstamp;
 };
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 5072733743e9..88767f7da150 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7388,6 +7388,43 @@ static const struct bpf_func_proto bpf_sock_ops_reserve_hdr_opt_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_skb_set_delivery_time, struct sk_buff *, skb,
+	   u64, dtime, u32, dtime_type)
+{
+	/* skb_clear_delivery_time() is done for inet protocol */
+	if (skb->protocol != htons(ETH_P_IP) &&
+	    skb->protocol != htons(ETH_P_IPV6))
+		return -EOPNOTSUPP;
+
+	switch (dtime_type) {
+	case BPF_SKB_DELIVERY_TIME_MONO:
+		if (!dtime)
+			return -EINVAL;
+		skb->tstamp = dtime;
+		skb->mono_delivery_time = 1;
+		break;
+	case BPF_SKB_DELIVERY_TIME_NONE:
+		if (dtime)
+			return -EINVAL;
+		skb->tstamp = 0;
+		skb->mono_delivery_time = 0;
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_skb_set_delivery_time_proto = {
+	.func           = bpf_skb_set_delivery_time,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_ANYTHING,
+};
+
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -7749,6 +7786,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_gen_syncookie_proto;
 	case BPF_FUNC_sk_assign:
 		return &bpf_sk_assign_proto;
+	case BPF_FUNC_skb_set_delivery_time:
+		return &bpf_skb_set_delivery_time_proto;
 #endif
 	default:
 		return bpf_sk_base_func_proto(func_id);
@@ -8088,7 +8127,9 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type
 			return false;
 		info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL;
 		break;
-	case offsetofend(struct __sk_buff, gso_size) ... offsetof(struct __sk_buff, hwtstamp) - 1:
+	case offsetof(struct __sk_buff, delivery_time_type):
+		return false;
+	case offsetofend(struct __sk_buff, delivery_time_type) ... offsetof(struct __sk_buff, hwtstamp) - 1:
 		/* Explicitly prohibit access to padding in __sk_buff. */
 		return false;
 	default:
@@ -8443,6 +8484,15 @@ static bool tc_cls_act_is_valid_access(int off, int size,
 		break;
 	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
 		return false;
+	case offsetof(struct __sk_buff, delivery_time_type):
+		/* The convert_ctx_access() on reading and writing
+		 * __sk_buff->tstamp depends on whether the bpf prog
+		 * has used __sk_buff->delivery_time_type or not.
+		 * Thus, we need to set prog->delivery_time_access
+		 * earlier during is_valid_access() here.
+		 */
+		((struct bpf_prog *)prog)->delivery_time_access = 1;
+		return size == sizeof(__u8);
 	}
 
 	return bpf_skb_is_valid_access(off, size, type, prog, info);
@@ -8838,6 +8888,45 @@ static u32 flow_dissector_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+static struct bpf_insn *bpf_convert_dtime_type_read(const struct bpf_insn *si,
+						    struct bpf_insn *insn)
+{
+	__u8 value_reg = si->dst_reg;
+	__u8 skb_reg = si->src_reg;
+	__u8 tmp_reg = BPF_REG_AX;
+
+	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
+			      SKB_MONO_DELIVERY_TIME_OFFSET);
+	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
+				SKB_MONO_DELIVERY_TIME_MASK);
+	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2);
+	/* value_reg = BPF_SKB_DELIVERY_TIME_MONO */
+	*insn++ = BPF_MOV32_IMM(value_reg, BPF_SKB_DELIVERY_TIME_MONO);
+	*insn++ = BPF_JMP_A(IS_ENABLED(CONFIG_NET_CLS_ACT) ? 10 : 5);
+
+	*insn++ = BPF_LDX_MEM(BPF_DW, tmp_reg, skb_reg,
+			      offsetof(struct sk_buff, tstamp));
+	*insn++ = BPF_JMP_IMM(BPF_JNE, tmp_reg, 0, 2);
+	/* value_reg = BPF_SKB_DELIVERY_TIME_NONE */
+	*insn++ = BPF_MOV32_IMM(value_reg, BPF_SKB_DELIVERY_TIME_NONE);
+	*insn++ = BPF_JMP_A(IS_ENABLED(CONFIG_NET_CLS_ACT) ? 6 : 1);
+
+#ifdef CONFIG_NET_CLS_ACT
+	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
+	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK);
+	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2);
+	/* At ingress, value_reg = 0 */
+	*insn++ = BPF_MOV32_IMM(value_reg, 0);
+	*insn++ = BPF_JMP_A(1);
+#endif
+
+	/* value_reg = BPF_SKB_DELIVERYT_TIME_UNSPEC */
+	*insn++ = BPF_MOV32_IMM(value_reg, BPF_SKB_DELIVERY_TIME_UNSPEC);
+
+	/* 15 insns with CONFIG_NET_CLS_ACT */
+	return insn;
+}
+
 static struct bpf_insn *bpf_convert_shinfo_access(const struct bpf_insn *si,
 						  struct bpf_insn *insn)
 {
@@ -8859,29 +8948,32 @@ static struct bpf_insn *bpf_convert_shinfo_access(const struct bpf_insn *si,
 	return insn;
 }
 
-static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_insn *si,
+static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_prog *prog,
+						const struct bpf_insn *si,
 						struct bpf_insn *insn)
 {
 	__u8 value_reg = si->dst_reg;
 	__u8 skb_reg = si->src_reg;
 
 #ifdef CONFIG_NET_CLS_ACT
-	__u8 tmp_reg = BPF_REG_AX;
-
-	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
-	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK);
-	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 5);
-	/* @ingress, read __sk_buff->tstamp as the (rcv) timestamp,
-	 * so check the skb->mono_delivery_time.
-	 */
-	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
-			      SKB_MONO_DELIVERY_TIME_OFFSET);
-	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
-				SKB_MONO_DELIVERY_TIME_MASK);
-	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2);
-	/* skb->mono_delivery_time is set, read 0 as the (rcv) timestamp. */
-	*insn++ = BPF_MOV64_IMM(value_reg, 0);
-	*insn++ = BPF_JMP_A(1);
+	if (!prog->delivery_time_access) {
+		__u8 tmp_reg = BPF_REG_AX;
+
+		*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
+		*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK);
+		*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 5);
+		/* @ingress, read __sk_buff->tstamp as the (rcv) timestamp,
+		 * so check the skb->mono_delivery_time.
+		 */
+		*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
+				      SKB_MONO_DELIVERY_TIME_OFFSET);
+		*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
+					SKB_MONO_DELIVERY_TIME_MASK);
+		*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2);
+		/* skb->mono_delivery_time is set, read 0 as the (rcv) timestamp. */
+		*insn++ = BPF_MOV64_IMM(value_reg, 0);
+		*insn++ = BPF_JMP_A(1);
+	}
 #endif
 
 	*insn++ = BPF_LDX_MEM(BPF_DW, value_reg, skb_reg,
@@ -8889,27 +8981,30 @@ static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_insn *si,
 	return insn;
 }
 
-static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_insn *si,
+static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_prog *prog,
+						 const struct bpf_insn *si,
 						 struct bpf_insn *insn)
 {
 	__u8 value_reg = si->src_reg;
 	__u8 skb_reg = si->dst_reg;
 
 #ifdef CONFIG_NET_CLS_ACT
-	__u8 tmp_reg = BPF_REG_AX;
-
-	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
-	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK);
-	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 3);
-	/* Writing __sk_buff->tstamp at ingress as the (rcv) timestamp.
-	 * Clear the skb->mono_delivery_time.
-	 */
-	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
-			      SKB_MONO_DELIVERY_TIME_OFFSET);
-	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
-				~SKB_MONO_DELIVERY_TIME_MASK);
-	*insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg,
-			      SKB_MONO_DELIVERY_TIME_OFFSET);
+	if (!prog->delivery_time_access) {
+		__u8 tmp_reg = BPF_REG_AX;
+
+		*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
+		*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK);
+		*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 3);
+		/* Writing __sk_buff->tstamp at ingress as the (rcv) timestamp.
+		 * Clear the skb->mono_delivery_time.
+		 */
+		*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
+				      SKB_MONO_DELIVERY_TIME_OFFSET);
+		*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
+					~SKB_MONO_DELIVERY_TIME_MASK);
+		*insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg,
+				      SKB_MONO_DELIVERY_TIME_OFFSET);
+	}
 #endif
 
 	/* skb->tstamp = tstamp */
@@ -9226,9 +9321,13 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 		BUILD_BUG_ON(sizeof_field(struct sk_buff, tstamp) != 8);
 
 		if (type == BPF_WRITE)
-			insn = bpf_convert_tstamp_write(si, insn);
+			insn = bpf_convert_tstamp_write(prog, si, insn);
 		else
-			insn = bpf_convert_tstamp_read(si, insn);
+			insn = bpf_convert_tstamp_read(prog, si, insn);
+		break;
+
+	case offsetof(struct __sk_buff, delivery_time_type):
+		insn = bpf_convert_dtime_type_read(si, insn);
 		break;
 
 	case offsetof(struct __sk_buff, gso_segs):
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index afe3d0d7f5f2..4eebea830613 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5086,6 +5086,37 @@ union bpf_attr {
  *	Return
  *		0 on success, or a negative error in case of failure. On error
  *		*dst* buffer is zeroed out.
+ *
+ * long bpf_skb_set_delivery_time(struct sk_buff *skb, u64 dtime, u32 dtime_type)
+ *	Description
+ *		Set a *dtime* (delivery time) to the __sk_buff->tstamp and also
+ *		change the __sk_buff->delivery_time_type to *dtime_type*.
+ *
+ *		When setting a delivery time (non zero *dtime*) to
+ *		__sk_buff->tstamp, only BPF_SKB_DELIVERY_TIME_MONO *dtime_type*
+ *		is supported.  It is the only delivery_time_type that will be
+ *		kept after bpf_redirect_*().
+ *
+ *		If there is no need to change the __sk_buff->delivery_time_type,
+ *		the delivery time can be directly written to __sk_buff->tstamp
+ *		instead.
+ *
+ *		*dtime* 0 and *dtime_type* BPF_SKB_DELIVERY_TIME_NONE
+ *		can be used to clear any delivery time stored in
+ *		__sk_buff->tstamp.
+ *
+ *		Only IPv4 and IPv6 skb->protocol are supported.
+ *
+ *		This function is most useful when it needs to set a
+ *		mono delivery time to __sk_buff->tstamp and then
+ *		bpf_redirect_*() to the egress of an iface.  For example,
+ *		changing the (rcv) timestamp in __sk_buff->tstamp at
+ *		ingress to a mono delivery time and then bpf_redirect_*()
+ *		to sch_fq@phy-dev.
+ *	Return
+ *		0 on success.
+ *		**-EINVAL** for invalid input
+ *		**-EOPNOTSUPP** for unsupported delivery_time_type and protocol
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5280,6 +5311,7 @@ union bpf_attr {
 	FN(xdp_load_bytes),		\
 	FN(xdp_store_bytes),		\
 	FN(copy_from_user_task),	\
+	FN(skb_set_delivery_time),      \
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5469,6 +5501,12 @@ union {					\
 	__u64 :64;			\
 } __attribute__((aligned(8)))
 
+enum {
+	BPF_SKB_DELIVERY_TIME_NONE,
+	BPF_SKB_DELIVERY_TIME_UNSPEC,
+	BPF_SKB_DELIVERY_TIME_MONO,
+};
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
@@ -5509,7 +5547,8 @@ struct __sk_buff {
 	__u32 gso_segs;
 	__bpf_md_ptr(struct bpf_sock *, sk);
 	__u32 gso_size;
-	__u32 :32;		/* Padding, future use. */
+	__u8  delivery_time_type;
+	__u32 :24;		/* Padding, future use. */
 	__u64 hwtstamp;
 };
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v6 net-next 13/13] bpf: selftests: test skb->tstamp in redirect_neigh
  2022-03-02 19:55 [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (11 preceding siblings ...)
  2022-03-02 19:56 ` [PATCH v6 net-next 12/13] bpf: Add __sk_buff->delivery_time_type and bpf_skb_set_skb_delivery_time() Martin KaFai Lau
@ 2022-03-02 19:56 ` Martin KaFai Lau
  2022-03-03  0:36 ` [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Eric Dumazet
  2022-03-03 14:50 ` patchwork-bot+netdevbpf
  14 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-02 19:56 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

This patch adds tests on forwarding the delivery_time for
the following cases
- tcp/udp + ip4/ip6 + bpf_redirect_neigh
- tcp/udp + ip4/ip6 + ip[6]_forward
- bpf_skb_set_delivery_time
- The old rcv timestamp expectation on tc-bpf@ingress

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 .../selftests/bpf/prog_tests/tc_redirect.c    | 434 ++++++++++++++++++
 .../selftests/bpf/progs/test_tc_dtime.c       | 349 ++++++++++++++
 2 files changed, 783 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/test_tc_dtime.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tc_redirect.c b/tools/testing/selftests/bpf/prog_tests/tc_redirect.c
index 647b0a833628..2b255e28ed26 100644
--- a/tools/testing/selftests/bpf/prog_tests/tc_redirect.c
+++ b/tools/testing/selftests/bpf/prog_tests/tc_redirect.c
@@ -17,6 +17,8 @@
 #include <linux/if_tun.h>
 #include <linux/limits.h>
 #include <linux/sysctl.h>
+#include <linux/time_types.h>
+#include <linux/net_tstamp.h>
 #include <sched.h>
 #include <stdbool.h>
 #include <stdio.h>
@@ -29,6 +31,11 @@
 #include "test_tc_neigh_fib.skel.h"
 #include "test_tc_neigh.skel.h"
 #include "test_tc_peer.skel.h"
+#include "test_tc_dtime.skel.h"
+
+#ifndef TCP_TX_DELAY
+#define TCP_TX_DELAY 37
+#endif
 
 #define NS_SRC "ns_src"
 #define NS_FWD "ns_fwd"
@@ -61,6 +68,7 @@
 #define CHK_PROG_PIN_FILE "/sys/fs/bpf/test_tc_chk"
 
 #define TIMEOUT_MILLIS 10000
+#define NSEC_PER_SEC 1000000000ULL
 
 #define log_err(MSG, ...) \
 	fprintf(stderr, "(%s:%d: errno: %s) " MSG "\n", \
@@ -440,6 +448,431 @@ static int set_forwarding(bool enable)
 	return 0;
 }
 
+static void rcv_tstamp(int fd, const char *expected, size_t s)
+{
+	struct __kernel_timespec pkt_ts = {};
+	char ctl[CMSG_SPACE(sizeof(pkt_ts))];
+	struct timespec now_ts;
+	struct msghdr msg = {};
+	__u64 now_ns, pkt_ns;
+	struct cmsghdr *cmsg;
+	struct iovec iov;
+	char data[32];
+	int ret;
+
+	iov.iov_base = data;
+	iov.iov_len = sizeof(data);
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+	msg.msg_control = &ctl;
+	msg.msg_controllen = sizeof(ctl);
+
+	ret = recvmsg(fd, &msg, 0);
+	if (!ASSERT_EQ(ret, s, "recvmsg"))
+		return;
+	ASSERT_STRNEQ(data, expected, s, "expected rcv data");
+
+	cmsg = CMSG_FIRSTHDR(&msg);
+	if (cmsg && cmsg->cmsg_level == SOL_SOCKET &&
+	    cmsg->cmsg_type == SO_TIMESTAMPNS_NEW)
+		memcpy(&pkt_ts, CMSG_DATA(cmsg), sizeof(pkt_ts));
+
+	pkt_ns = pkt_ts.tv_sec * NSEC_PER_SEC + pkt_ts.tv_nsec;
+	ASSERT_NEQ(pkt_ns, 0, "pkt rcv tstamp");
+
+	ret = clock_gettime(CLOCK_REALTIME, &now_ts);
+	ASSERT_OK(ret, "clock_gettime");
+	now_ns = now_ts.tv_sec * NSEC_PER_SEC + now_ts.tv_nsec;
+
+	if (ASSERT_GE(now_ns, pkt_ns, "check rcv tstamp"))
+		ASSERT_LT(now_ns - pkt_ns, 5 * NSEC_PER_SEC,
+			  "check rcv tstamp");
+}
+
+static void snd_tstamp(int fd, char *b, size_t s)
+{
+	struct sock_txtime opt = { .clockid = CLOCK_TAI };
+	char ctl[CMSG_SPACE(sizeof(__u64))];
+	struct timespec now_ts;
+	struct msghdr msg = {};
+	struct cmsghdr *cmsg;
+	struct iovec iov;
+	__u64 now_ns;
+	int ret;
+
+	ret = clock_gettime(CLOCK_TAI, &now_ts);
+	ASSERT_OK(ret, "clock_get_time(CLOCK_TAI)");
+	now_ns = now_ts.tv_sec * NSEC_PER_SEC + now_ts.tv_nsec;
+
+	iov.iov_base = b;
+	iov.iov_len = s;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+	msg.msg_control = &ctl;
+	msg.msg_controllen = sizeof(ctl);
+
+	cmsg = CMSG_FIRSTHDR(&msg);
+	cmsg->cmsg_level = SOL_SOCKET;
+	cmsg->cmsg_type = SCM_TXTIME;
+	cmsg->cmsg_len = CMSG_LEN(sizeof(now_ns));
+	*(__u64 *)CMSG_DATA(cmsg) = now_ns;
+
+	ret = setsockopt(fd, SOL_SOCKET, SO_TXTIME, &opt, sizeof(opt));
+	ASSERT_OK(ret, "setsockopt(SO_TXTIME)");
+
+	ret = sendmsg(fd, &msg, 0);
+	ASSERT_EQ(ret, s, "sendmsg");
+}
+
+static void test_inet_dtime(int family, int type, const char *addr, __u16 port)
+{
+	int opt = 1, accept_fd = -1, client_fd = -1, listen_fd, err;
+	char buf[] = "testing testing";
+	struct nstoken *nstoken;
+
+	nstoken = open_netns(NS_DST);
+	if (!ASSERT_OK_PTR(nstoken, "setns dst"))
+		return;
+	listen_fd = start_server(family, type, addr, port, 0);
+	close_netns(nstoken);
+
+	if (!ASSERT_GE(listen_fd, 0, "listen"))
+		return;
+
+	/* Ensure the kernel puts the (rcv) timestamp for all skb */
+	err = setsockopt(listen_fd, SOL_SOCKET, SO_TIMESTAMPNS_NEW,
+			 &opt, sizeof(opt));
+	if (!ASSERT_OK(err, "setsockopt(SO_TIMESTAMPNS_NEW)"))
+		goto done;
+
+	if (type == SOCK_STREAM) {
+		/* Ensure the kernel set EDT when sending out rst/ack
+		 * from the kernel's ctl_sk.
+		 */
+		err = setsockopt(listen_fd, SOL_TCP, TCP_TX_DELAY, &opt,
+				 sizeof(opt));
+		if (!ASSERT_OK(err, "setsockopt(TCP_TX_DELAY)"))
+			goto done;
+	}
+
+	nstoken = open_netns(NS_SRC);
+	if (!ASSERT_OK_PTR(nstoken, "setns src"))
+		goto done;
+	client_fd = connect_to_fd(listen_fd, TIMEOUT_MILLIS);
+	close_netns(nstoken);
+
+	if (!ASSERT_GE(client_fd, 0, "connect_to_fd"))
+		goto done;
+
+	if (type == SOCK_STREAM) {
+		int n;
+
+		accept_fd = accept(listen_fd, NULL, NULL);
+		if (!ASSERT_GE(accept_fd, 0, "accept"))
+			goto done;
+
+		n = write(client_fd, buf, sizeof(buf));
+		if (!ASSERT_EQ(n, sizeof(buf), "send to server"))
+			goto done;
+		rcv_tstamp(accept_fd, buf, sizeof(buf));
+	} else {
+		snd_tstamp(client_fd, buf, sizeof(buf));
+		rcv_tstamp(listen_fd, buf, sizeof(buf));
+	}
+
+done:
+	close(listen_fd);
+	if (accept_fd != -1)
+		close(accept_fd);
+	if (client_fd != -1)
+		close(client_fd);
+}
+
+static int netns_load_dtime_bpf(struct test_tc_dtime *skel)
+{
+	struct nstoken *nstoken;
+
+#define PIN_FNAME(__file) "/sys/fs/bpf/" #__file
+#define PIN(__prog) ({							\
+		int err = bpf_program__pin(skel->progs.__prog, PIN_FNAME(__prog)); \
+		if (!ASSERT_OK(err, "pin " #__prog))		\
+			goto fail;					\
+		})
+
+	/* setup ns_src tc progs */
+	nstoken = open_netns(NS_SRC);
+	if (!ASSERT_OK_PTR(nstoken, "setns " NS_SRC))
+		return -1;
+	PIN(egress_host);
+	PIN(ingress_host);
+	SYS("tc qdisc add dev veth_src clsact");
+	SYS("tc filter add dev veth_src ingress bpf da object-pinned "
+	    PIN_FNAME(ingress_host));
+	SYS("tc filter add dev veth_src egress bpf da object-pinned "
+	    PIN_FNAME(egress_host));
+	close_netns(nstoken);
+
+	/* setup ns_dst tc progs */
+	nstoken = open_netns(NS_DST);
+	if (!ASSERT_OK_PTR(nstoken, "setns " NS_DST))
+		return -1;
+	PIN(egress_host);
+	PIN(ingress_host);
+	SYS("tc qdisc add dev veth_dst clsact");
+	SYS("tc filter add dev veth_dst ingress bpf da object-pinned "
+	    PIN_FNAME(ingress_host));
+	SYS("tc filter add dev veth_dst egress bpf da object-pinned "
+	    PIN_FNAME(egress_host));
+	close_netns(nstoken);
+
+	/* setup ns_fwd tc progs */
+	nstoken = open_netns(NS_FWD);
+	if (!ASSERT_OK_PTR(nstoken, "setns " NS_FWD))
+		return -1;
+	PIN(ingress_fwdns_prio100);
+	PIN(egress_fwdns_prio100);
+	PIN(ingress_fwdns_prio101);
+	PIN(egress_fwdns_prio101);
+	SYS("tc qdisc add dev veth_dst_fwd clsact");
+	SYS("tc filter add dev veth_dst_fwd ingress prio 100 bpf da object-pinned "
+	    PIN_FNAME(ingress_fwdns_prio100));
+	SYS("tc filter add dev veth_dst_fwd ingress prio 101 bpf da object-pinned "
+	    PIN_FNAME(ingress_fwdns_prio101));
+	SYS("tc filter add dev veth_dst_fwd egress prio 100 bpf da object-pinned "
+	    PIN_FNAME(egress_fwdns_prio100));
+	SYS("tc filter add dev veth_dst_fwd egress prio 101 bpf da object-pinned "
+	    PIN_FNAME(egress_fwdns_prio101));
+	SYS("tc qdisc add dev veth_src_fwd clsact");
+	SYS("tc filter add dev veth_src_fwd ingress prio 100 bpf da object-pinned "
+	    PIN_FNAME(ingress_fwdns_prio100));
+	SYS("tc filter add dev veth_src_fwd ingress prio 101 bpf da object-pinned "
+	    PIN_FNAME(ingress_fwdns_prio101));
+	SYS("tc filter add dev veth_src_fwd egress prio 100 bpf da object-pinned "
+	    PIN_FNAME(egress_fwdns_prio100));
+	SYS("tc filter add dev veth_src_fwd egress prio 101 bpf da object-pinned "
+	    PIN_FNAME(egress_fwdns_prio101));
+	close_netns(nstoken);
+
+#undef PIN
+
+	return 0;
+
+fail:
+	close_netns(nstoken);
+	return -1;
+}
+
+enum {
+	INGRESS_FWDNS_P100,
+	INGRESS_FWDNS_P101,
+	EGRESS_FWDNS_P100,
+	EGRESS_FWDNS_P101,
+	INGRESS_ENDHOST,
+	EGRESS_ENDHOST,
+	SET_DTIME,
+	__MAX_CNT,
+};
+
+const char *cnt_names[] = {
+	"ingress_fwdns_p100",
+	"ingress_fwdns_p101",
+	"egress_fwdns_p100",
+	"egress_fwdns_p101",
+	"ingress_endhost",
+	"egress_endhost",
+	"set_dtime",
+};
+
+enum {
+	TCP_IP6_CLEAR_DTIME,
+	TCP_IP4,
+	TCP_IP6,
+	UDP_IP4,
+	UDP_IP6,
+	TCP_IP4_RT_FWD,
+	TCP_IP6_RT_FWD,
+	UDP_IP4_RT_FWD,
+	UDP_IP6_RT_FWD,
+	UKN_TEST,
+	__NR_TESTS,
+};
+
+const char *test_names[] = {
+	"tcp ip6 clear dtime",
+	"tcp ip4",
+	"tcp ip6",
+	"udp ip4",
+	"udp ip6",
+	"tcp ip4 rt fwd",
+	"tcp ip6 rt fwd",
+	"udp ip4 rt fwd",
+	"udp ip6 rt fwd",
+};
+
+static const char *dtime_cnt_str(int test, int cnt)
+{
+	static char name[64];
+
+	snprintf(name, sizeof(name), "%s %s", test_names[test], cnt_names[cnt]);
+
+	return name;
+}
+
+static const char *dtime_err_str(int test, int cnt)
+{
+	static char name[64];
+
+	snprintf(name, sizeof(name), "%s %s errs", test_names[test],
+		 cnt_names[cnt]);
+
+	return name;
+}
+
+static void test_tcp_clear_dtime(struct test_tc_dtime *skel)
+{
+	int i, t = TCP_IP6_CLEAR_DTIME;
+	__u32 *dtimes = skel->bss->dtimes[t];
+	__u32 *errs = skel->bss->errs[t];
+
+	skel->bss->test = t;
+	test_inet_dtime(AF_INET6, SOCK_STREAM, IP6_DST, 0);
+
+	ASSERT_EQ(dtimes[INGRESS_FWDNS_P100], 0,
+		  dtime_cnt_str(t, INGRESS_FWDNS_P100));
+	ASSERT_EQ(dtimes[INGRESS_FWDNS_P101], 0,
+		  dtime_cnt_str(t, INGRESS_FWDNS_P101));
+	ASSERT_GT(dtimes[EGRESS_FWDNS_P100], 0,
+		  dtime_cnt_str(t, EGRESS_FWDNS_P100));
+	ASSERT_EQ(dtimes[EGRESS_FWDNS_P101], 0,
+		  dtime_cnt_str(t, EGRESS_FWDNS_P101));
+	ASSERT_GT(dtimes[EGRESS_ENDHOST], 0,
+		  dtime_cnt_str(t, EGRESS_ENDHOST));
+	ASSERT_GT(dtimes[INGRESS_ENDHOST], 0,
+		  dtime_cnt_str(t, INGRESS_ENDHOST));
+
+	for (i = INGRESS_FWDNS_P100; i < __MAX_CNT; i++)
+		ASSERT_EQ(errs[i], 0, dtime_err_str(t, i));
+}
+
+static void test_tcp_dtime(struct test_tc_dtime *skel, int family, bool bpf_fwd)
+{
+	__u32 *dtimes, *errs;
+	const char *addr;
+	int i, t;
+
+	if (family == AF_INET) {
+		t = bpf_fwd ? TCP_IP4 : TCP_IP4_RT_FWD;
+		addr = IP4_DST;
+	} else {
+		t = bpf_fwd ? TCP_IP6 : TCP_IP6_RT_FWD;
+		addr = IP6_DST;
+	}
+
+	dtimes = skel->bss->dtimes[t];
+	errs = skel->bss->errs[t];
+
+	skel->bss->test = t;
+	test_inet_dtime(family, SOCK_STREAM, addr, 0);
+
+	/* fwdns_prio100 prog does not read delivery_time_type, so
+	 * kernel puts the (rcv) timetamp in __sk_buff->tstamp
+	 */
+	ASSERT_EQ(dtimes[INGRESS_FWDNS_P100], 0,
+		  dtime_cnt_str(t, INGRESS_FWDNS_P100));
+	for (i = INGRESS_FWDNS_P101; i < SET_DTIME; i++)
+		ASSERT_GT(dtimes[i], 0, dtime_cnt_str(t, i));
+
+	for (i = INGRESS_FWDNS_P100; i < __MAX_CNT; i++)
+		ASSERT_EQ(errs[i], 0, dtime_err_str(t, i));
+}
+
+static void test_udp_dtime(struct test_tc_dtime *skel, int family, bool bpf_fwd)
+{
+	__u32 *dtimes, *errs;
+	const char *addr;
+	int i, t;
+
+	if (family == AF_INET) {
+		t = bpf_fwd ? UDP_IP4 : UDP_IP4_RT_FWD;
+		addr = IP4_DST;
+	} else {
+		t = bpf_fwd ? UDP_IP6 : UDP_IP6_RT_FWD;
+		addr = IP6_DST;
+	}
+
+	dtimes = skel->bss->dtimes[t];
+	errs = skel->bss->errs[t];
+
+	skel->bss->test = t;
+	test_inet_dtime(family, SOCK_DGRAM, addr, 0);
+
+	ASSERT_EQ(dtimes[INGRESS_FWDNS_P100], 0,
+		  dtime_cnt_str(t, INGRESS_FWDNS_P100));
+	/* non mono delivery time is not forwarded */
+	ASSERT_EQ(dtimes[INGRESS_FWDNS_P101], 0,
+		  dtime_cnt_str(t, INGRESS_FWDNS_P100));
+	for (i = EGRESS_FWDNS_P100; i < SET_DTIME; i++)
+		ASSERT_GT(dtimes[i], 0, dtime_cnt_str(t, i));
+
+	for (i = INGRESS_FWDNS_P100; i < __MAX_CNT; i++)
+		ASSERT_EQ(errs[i], 0, dtime_err_str(t, i));
+}
+
+static void test_tc_redirect_dtime(struct netns_setup_result *setup_result)
+{
+	struct test_tc_dtime *skel;
+	struct nstoken *nstoken;
+	int err;
+
+	skel = test_tc_dtime__open();
+	if (!ASSERT_OK_PTR(skel, "test_tc_dtime__open"))
+		return;
+
+	skel->rodata->IFINDEX_SRC = setup_result->ifindex_veth_src_fwd;
+	skel->rodata->IFINDEX_DST = setup_result->ifindex_veth_dst_fwd;
+
+	err = test_tc_dtime__load(skel);
+	if (!ASSERT_OK(err, "test_tc_dtime__load"))
+		goto done;
+
+	if (netns_load_dtime_bpf(skel))
+		goto done;
+
+	nstoken = open_netns(NS_FWD);
+	if (!ASSERT_OK_PTR(nstoken, "setns fwd"))
+		goto done;
+	err = set_forwarding(false);
+	close_netns(nstoken);
+	if (!ASSERT_OK(err, "disable forwarding"))
+		goto done;
+
+	test_tcp_clear_dtime(skel);
+
+	test_tcp_dtime(skel, AF_INET, true);
+	test_tcp_dtime(skel, AF_INET6, true);
+	test_udp_dtime(skel, AF_INET, true);
+	test_udp_dtime(skel, AF_INET6, true);
+
+	/* Test the kernel ip[6]_forward path instead
+	 * of bpf_redirect_neigh().
+	 */
+	nstoken = open_netns(NS_FWD);
+	if (!ASSERT_OK_PTR(nstoken, "setns fwd"))
+		goto done;
+	err = set_forwarding(true);
+	close_netns(nstoken);
+	if (!ASSERT_OK(err, "enable forwarding"))
+		goto done;
+
+	test_tcp_dtime(skel, AF_INET, false);
+	test_tcp_dtime(skel, AF_INET6, false);
+	test_udp_dtime(skel, AF_INET, false);
+	test_udp_dtime(skel, AF_INET6, false);
+
+done:
+	test_tc_dtime__destroy(skel);
+}
+
 static void test_tc_redirect_neigh_fib(struct netns_setup_result *setup_result)
 {
 	struct nstoken *nstoken = NULL;
@@ -787,6 +1220,7 @@ static void *test_tc_redirect_run_tests(void *arg)
 	RUN_TEST(tc_redirect_peer_l3);
 	RUN_TEST(tc_redirect_neigh);
 	RUN_TEST(tc_redirect_neigh_fib);
+	RUN_TEST(tc_redirect_dtime);
 	return NULL;
 }
 
diff --git a/tools/testing/selftests/bpf/progs/test_tc_dtime.c b/tools/testing/selftests/bpf/progs/test_tc_dtime.c
new file mode 100644
index 000000000000..9d9e8e17b8a0
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tc_dtime.c
@@ -0,0 +1,349 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2022 Meta
+
+#include <stddef.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <linux/bpf.h>
+#include <linux/stddef.h>
+#include <linux/pkt_cls.h>
+#include <linux/if_ether.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#include <sys/socket.h>
+
+/* veth_src --- veth_src_fwd --- veth_det_fwd --- veth_dst
+ *           |                                 |
+ *  ns_src   |              ns_fwd             |   ns_dst
+ *
+ * ns_src and ns_dst: ENDHOST namespace
+ *            ns_fwd: Fowarding namespace
+ */
+
+#define ctx_ptr(field)		(void *)(long)(field)
+
+#define ip4_src			__bpf_htonl(0xac100164) /* 172.16.1.100 */
+#define ip4_dst			__bpf_htonl(0xac100264) /* 172.16.2.100 */
+
+#define ip6_src			{ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, \
+				  0x00, 0x01, 0xde, 0xad, 0xbe, 0xef, 0xca, 0xfe }
+#define ip6_dst			{ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, \
+				  0x00, 0x02, 0xde, 0xad, 0xbe, 0xef, 0xca, 0xfe }
+
+#define v6_equal(a, b)		(a.s6_addr32[0] == b.s6_addr32[0] && \
+				 a.s6_addr32[1] == b.s6_addr32[1] && \
+				 a.s6_addr32[2] == b.s6_addr32[2] && \
+				 a.s6_addr32[3] == b.s6_addr32[3])
+
+volatile const __u32 IFINDEX_SRC;
+volatile const __u32 IFINDEX_DST;
+
+#define EGRESS_ENDHOST_MAGIC	0x0b9fbeef
+#define INGRESS_FWDNS_MAGIC	0x1b9fbeef
+#define EGRESS_FWDNS_MAGIC	0x2b9fbeef
+
+enum {
+	INGRESS_FWDNS_P100,
+	INGRESS_FWDNS_P101,
+	EGRESS_FWDNS_P100,
+	EGRESS_FWDNS_P101,
+	INGRESS_ENDHOST,
+	EGRESS_ENDHOST,
+	SET_DTIME,
+	__MAX_CNT,
+};
+
+enum {
+	TCP_IP6_CLEAR_DTIME,
+	TCP_IP4,
+	TCP_IP6,
+	UDP_IP4,
+	UDP_IP6,
+	TCP_IP4_RT_FWD,
+	TCP_IP6_RT_FWD,
+	UDP_IP4_RT_FWD,
+	UDP_IP6_RT_FWD,
+	UKN_TEST,
+	__NR_TESTS,
+};
+
+enum {
+	SRC_NS = 1,
+	DST_NS,
+};
+
+__u32 dtimes[__NR_TESTS][__MAX_CNT] = {};
+__u32 errs[__NR_TESTS][__MAX_CNT] = {};
+__u32 test = 0;
+
+static void inc_dtimes(__u32 idx)
+{
+	if (test < __NR_TESTS)
+		dtimes[test][idx]++;
+	else
+		dtimes[UKN_TEST][idx]++;
+}
+
+static void inc_errs(__u32 idx)
+{
+	if (test < __NR_TESTS)
+		errs[test][idx]++;
+	else
+		errs[UKN_TEST][idx]++;
+}
+
+static int skb_proto(int type)
+{
+	return type & 0xff;
+}
+
+static int skb_ns(int type)
+{
+	return (type >> 8) & 0xff;
+}
+
+static bool fwdns_clear_dtime(void)
+{
+	return test == TCP_IP6_CLEAR_DTIME;
+}
+
+static bool bpf_fwd(void)
+{
+	return test < TCP_IP4_RT_FWD;
+}
+
+/* -1: parse error: TC_ACT_SHOT
+ *  0: not testing traffic: TC_ACT_OK
+ * >0: first byte is the inet_proto, second byte has the netns
+ *     of the sender
+ */
+static int skb_get_type(struct __sk_buff *skb)
+{
+	void *data_end = ctx_ptr(skb->data_end);
+	void *data = ctx_ptr(skb->data);
+	__u8 inet_proto = 0, ns = 0;
+	struct ipv6hdr *ip6h;
+	struct iphdr *iph;
+
+	switch (skb->protocol) {
+	case __bpf_htons(ETH_P_IP):
+		iph = data + sizeof(struct ethhdr);
+		if (iph + 1 > data_end)
+			return -1;
+		if (iph->saddr == ip4_src)
+			ns = SRC_NS;
+		else if (iph->saddr == ip4_dst)
+			ns = DST_NS;
+		inet_proto = iph->protocol;
+		break;
+	case __bpf_htons(ETH_P_IPV6):
+		ip6h = data + sizeof(struct ethhdr);
+		if (ip6h + 1 > data_end)
+			return -1;
+		if (v6_equal(ip6h->saddr, (struct in6_addr)ip6_src))
+			ns = SRC_NS;
+		else if (v6_equal(ip6h->saddr, (struct in6_addr)ip6_dst))
+			ns = DST_NS;
+		inet_proto = ip6h->nexthdr;
+		break;
+	default:
+		return 0;
+	}
+
+	if ((inet_proto != IPPROTO_TCP && inet_proto != IPPROTO_UDP) || !ns)
+		return 0;
+
+	return (ns << 8 | inet_proto);
+}
+
+/* format: direction@iface@netns
+ * egress@veth_(src|dst)@ns_(src|dst)
+ */
+SEC("tc")
+int egress_host(struct __sk_buff *skb)
+{
+	int skb_type;
+
+	skb_type = skb_get_type(skb);
+	if (skb_type == -1)
+		return TC_ACT_SHOT;
+	if (!skb_type)
+		return TC_ACT_OK;
+
+	if (skb_proto(skb_type) == IPPROTO_TCP) {
+		if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO &&
+		    skb->tstamp)
+			inc_dtimes(EGRESS_ENDHOST);
+		else
+			inc_errs(EGRESS_ENDHOST);
+	} else {
+		if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_UNSPEC &&
+		    skb->tstamp)
+			inc_dtimes(EGRESS_ENDHOST);
+		else
+			inc_errs(EGRESS_ENDHOST);
+	}
+
+	skb->tstamp = EGRESS_ENDHOST_MAGIC;
+
+	return TC_ACT_OK;
+}
+
+/* ingress@veth_(src|dst)@ns_(src|dst) */
+SEC("tc")
+int ingress_host(struct __sk_buff *skb)
+{
+	int skb_type;
+
+	skb_type = skb_get_type(skb);
+	if (skb_type == -1)
+		return TC_ACT_SHOT;
+	if (!skb_type)
+		return TC_ACT_OK;
+
+	if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO &&
+	    skb->tstamp == EGRESS_FWDNS_MAGIC)
+		inc_dtimes(INGRESS_ENDHOST);
+	else
+		inc_errs(INGRESS_ENDHOST);
+
+	return TC_ACT_OK;
+}
+
+/* ingress@veth_(src|dst)_fwd@ns_fwd priority 100 */
+SEC("tc")
+int ingress_fwdns_prio100(struct __sk_buff *skb)
+{
+	int skb_type;
+
+	skb_type = skb_get_type(skb);
+	if (skb_type == -1)
+		return TC_ACT_SHOT;
+	if (!skb_type)
+		return TC_ACT_OK;
+
+	/* delivery_time is only available to the ingress
+	 * if the tc-bpf checks the skb->delivery_time_type.
+	 */
+	if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
+		inc_errs(INGRESS_FWDNS_P100);
+
+	if (fwdns_clear_dtime())
+		skb->tstamp = 0;
+
+	return TC_ACT_UNSPEC;
+}
+
+/* egress@veth_(src|dst)_fwd@ns_fwd priority 100 */
+SEC("tc")
+int egress_fwdns_prio100(struct __sk_buff *skb)
+{
+	int skb_type;
+
+	skb_type = skb_get_type(skb);
+	if (skb_type == -1)
+		return TC_ACT_SHOT;
+	if (!skb_type)
+		return TC_ACT_OK;
+
+	/* delivery_time is always available to egress even
+	 * the tc-bpf did not use the delivery_time_type.
+	 */
+	if (skb->tstamp == INGRESS_FWDNS_MAGIC)
+		inc_dtimes(EGRESS_FWDNS_P100);
+	else
+		inc_errs(EGRESS_FWDNS_P100);
+
+	if (fwdns_clear_dtime())
+		skb->tstamp = 0;
+
+	return TC_ACT_UNSPEC;
+}
+
+/* ingress@veth_(src|dst)_fwd@ns_fwd priority 101 */
+SEC("tc")
+int ingress_fwdns_prio101(struct __sk_buff *skb)
+{
+	__u64 expected_dtime = EGRESS_ENDHOST_MAGIC;
+	int skb_type;
+
+	skb_type = skb_get_type(skb);
+	if (skb_type == -1 || !skb_type)
+		/* Should have handled in prio100 */
+		return TC_ACT_SHOT;
+
+	if (skb_proto(skb_type) == IPPROTO_UDP)
+		expected_dtime = 0;
+
+	if (skb->delivery_time_type) {
+		if (fwdns_clear_dtime() ||
+		    skb->delivery_time_type != BPF_SKB_DELIVERY_TIME_MONO ||
+		    skb->tstamp != expected_dtime)
+			inc_errs(INGRESS_FWDNS_P101);
+		else
+			inc_dtimes(INGRESS_FWDNS_P101);
+	} else {
+		if (!fwdns_clear_dtime() && expected_dtime)
+			inc_errs(INGRESS_FWDNS_P101);
+	}
+
+	if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO) {
+		skb->tstamp = INGRESS_FWDNS_MAGIC;
+	} else {
+		if (bpf_skb_set_delivery_time(skb, INGRESS_FWDNS_MAGIC,
+					      BPF_SKB_DELIVERY_TIME_MONO))
+			inc_errs(SET_DTIME);
+		if (!bpf_skb_set_delivery_time(skb, INGRESS_FWDNS_MAGIC,
+					       BPF_SKB_DELIVERY_TIME_UNSPEC))
+			inc_errs(SET_DTIME);
+	}
+
+	if (skb_ns(skb_type) == SRC_NS)
+		return bpf_fwd() ?
+			bpf_redirect_neigh(IFINDEX_DST, NULL, 0, 0) : TC_ACT_OK;
+	else
+		return bpf_fwd() ?
+			bpf_redirect_neigh(IFINDEX_SRC, NULL, 0, 0) : TC_ACT_OK;
+}
+
+/* egress@veth_(src|dst)_fwd@ns_fwd priority 101 */
+SEC("tc")
+int egress_fwdns_prio101(struct __sk_buff *skb)
+{
+	int skb_type;
+
+	skb_type = skb_get_type(skb);
+	if (skb_type == -1 || !skb_type)
+		/* Should have handled in prio100 */
+		return TC_ACT_SHOT;
+
+	if (skb->delivery_time_type) {
+		if (fwdns_clear_dtime() ||
+		    skb->delivery_time_type != BPF_SKB_DELIVERY_TIME_MONO ||
+		    skb->tstamp != INGRESS_FWDNS_MAGIC)
+			inc_errs(EGRESS_FWDNS_P101);
+		else
+			inc_dtimes(EGRESS_FWDNS_P101);
+	} else {
+		if (!fwdns_clear_dtime())
+			inc_errs(EGRESS_FWDNS_P101);
+	}
+
+	if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO) {
+		skb->tstamp = EGRESS_FWDNS_MAGIC;
+	} else {
+		if (bpf_skb_set_delivery_time(skb, EGRESS_FWDNS_MAGIC,
+					      BPF_SKB_DELIVERY_TIME_MONO))
+			inc_errs(SET_DTIME);
+		if (!bpf_skb_set_delivery_time(skb, EGRESS_FWDNS_MAGIC,
+					       BPF_SKB_DELIVERY_TIME_UNSPEC))
+			inc_errs(SET_DTIME);
+	}
+
+	return TC_ACT_OK;
+}
+
+char __license[] SEC("license") = "GPL";
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 net-next 10/13] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally
  2022-03-02 19:56 ` [PATCH v6 net-next 10/13] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally Martin KaFai Lau
@ 2022-03-02 20:30   ` Eric Dumazet
  2022-03-02 22:33     ` Martin KaFai Lau
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2022-03-02 20:30 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Jakub Kicinski, kernel-team,
	Willem de Bruijn

On Wed, Mar 2, 2022 at 11:56 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> The previous patches handled the delivery_time in the ingress path
> before the routing decision is made.  This patch can postpone clearing
> delivery_time in a skb until knowing it is delivered locally and also
> set the (rcv) timestamp if needed.  This patch moves the
> skb_clear_delivery_time() from dev.c to ip_local_deliver_finish()
> and ip6_input_finish().
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  net/core/dev.c       | 8 ++------
>  net/ipv4/ip_input.c  | 1 +
>  net/ipv6/ip6_input.c | 1 +
>  3 files changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0fc02cf32476..3ff686cc8c84 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5193,10 +5193,8 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
>                         goto out;
>         }
>
> -       if (skb_skip_tc_classify(skb)) {
> -               skb_clear_delivery_time(skb);
> +       if (skb_skip_tc_classify(skb))
>                 goto skip_classify;
> -       }
>
>         if (pfmemalloc)
>                 goto skip_taps;
> @@ -5225,14 +5223,12 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
>                         goto another_round;
>                 if (!skb)
>                         goto out;
> -               skb_clear_delivery_time(skb);
>
>                 nf_skip_egress(skb, false);
>                 if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0)
>                         goto out;
> -       } else
> +       }
>  #endif
> -               skb_clear_delivery_time(skb);
>         skb_reset_redirect(skb);
>  skip_classify:
>         if (pfmemalloc && !skb_pfmemalloc_protocol(skb))
> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> index d94f9f7e60c3..95f7bb052784 100644
> --- a/net/ipv4/ip_input.c
> +++ b/net/ipv4/ip_input.c
> @@ -226,6 +226,7 @@ void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int protocol)
>
>  static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
>  {
> +       skb_clear_delivery_time(skb);
>         __skb_pull(skb, skb_network_header_len(skb));
>
>         rcu_read_lock();
> diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> index d4b1e2c5aa76..5b5ea35635f9 100644
> --- a/net/ipv6/ip6_input.c
> +++ b/net/ipv6/ip6_input.c
> @@ -459,6 +459,7 @@ void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int nexthdr,
>
>  static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
>  {
> +       skb_clear_delivery_time(skb);
>         rcu_read_lock();
>         ip6_protocol_deliver_rcu(net, skb, 0, false);
>         rcu_read_unlock();
> --
> 2.30.2
>

It is not clear to me why we need to clear tstamp if packet is locally
delivered ?

TCP stack is using tstamp for incoming packets (look for
TCP_SKB_CB(skb)->has_rxtstamp)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 net-next 10/13] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally
  2022-03-02 20:30   ` Eric Dumazet
@ 2022-03-02 22:33     ` Martin KaFai Lau
  2022-03-02 23:41       ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-02 22:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Jakub Kicinski, kernel-team,
	Willem de Bruijn

On Wed, Mar 02, 2022 at 12:30:14PM -0800, Eric Dumazet wrote:
> On Wed, Mar 2, 2022 at 11:56 AM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > The previous patches handled the delivery_time in the ingress path
> > before the routing decision is made.  This patch can postpone clearing
> > delivery_time in a skb until knowing it is delivered locally and also
> > set the (rcv) timestamp if needed.  This patch moves the
> > skb_clear_delivery_time() from dev.c to ip_local_deliver_finish()
> > and ip6_input_finish().
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  net/core/dev.c       | 8 ++------
> >  net/ipv4/ip_input.c  | 1 +
> >  net/ipv6/ip6_input.c | 1 +
> >  3 files changed, 4 insertions(+), 6 deletions(-)
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 0fc02cf32476..3ff686cc8c84 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -5193,10 +5193,8 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
> >                         goto out;
> >         }
> >
> > -       if (skb_skip_tc_classify(skb)) {
> > -               skb_clear_delivery_time(skb);
> > +       if (skb_skip_tc_classify(skb))
> >                 goto skip_classify;
> > -       }
> >
> >         if (pfmemalloc)
> >                 goto skip_taps;
> > @@ -5225,14 +5223,12 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
> >                         goto another_round;
> >                 if (!skb)
> >                         goto out;
> > -               skb_clear_delivery_time(skb);
> >
> >                 nf_skip_egress(skb, false);
> >                 if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0)
> >                         goto out;
> > -       } else
> > +       }
> >  #endif
> > -               skb_clear_delivery_time(skb);
> >         skb_reset_redirect(skb);
> >  skip_classify:
> >         if (pfmemalloc && !skb_pfmemalloc_protocol(skb))
> > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> > index d94f9f7e60c3..95f7bb052784 100644
> > --- a/net/ipv4/ip_input.c
> > +++ b/net/ipv4/ip_input.c
> > @@ -226,6 +226,7 @@ void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int protocol)
> >
> >  static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
> >  {
> > +       skb_clear_delivery_time(skb);
> >         __skb_pull(skb, skb_network_header_len(skb));
> >
> >         rcu_read_lock();
> > diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> > index d4b1e2c5aa76..5b5ea35635f9 100644
> > --- a/net/ipv6/ip6_input.c
> > +++ b/net/ipv6/ip6_input.c
> > @@ -459,6 +459,7 @@ void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int nexthdr,
> >
> >  static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
> >  {
> > +       skb_clear_delivery_time(skb);
> >         rcu_read_lock();
> >         ip6_protocol_deliver_rcu(net, skb, 0, false);
> >         rcu_read_unlock();
> > --
> > 2.30.2
> >
> 
> It is not clear to me why we need to clear tstamp if packet is locally
> delivered ?
It does not clear the rx tstamp in skb->tstamp.

It only clears the EDT in skb->tstamp when the skb
is transmitted out of a local tcp_sock and then loop back from egress
to ingress through virtual interface like veth.

> 
> TCP stack is using tstamp for incoming packets (look for
> TCP_SKB_CB(skb)->has_rxtstamp)
skb_clear_delivery_time() will put ktime_get_real() back to skb->tstamp
so that the receiving tcp_sock can use it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 net-next 10/13] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally
  2022-03-02 22:33     ` Martin KaFai Lau
@ 2022-03-02 23:41       ` Eric Dumazet
  2022-03-03  0:19         ` Martin KaFai Lau
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2022-03-02 23:41 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Jakub Kicinski, kernel-team,
	Willem de Bruijn

On Wed, Mar 2, 2022 at 2:34 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Mar 02, 2022 at 12:30:14PM -0800, Eric Dumazet wrote:
> > On Wed, Mar 2, 2022 at 11:56 AM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > The previous patches handled the delivery_time in the ingress path
> > > before the routing decision is made.  This patch can postpone clearing
> > > delivery_time in a skb until knowing it is delivered locally and also
> > > set the (rcv) timestamp if needed.  This patch moves the
> > > skb_clear_delivery_time() from dev.c to ip_local_deliver_finish()
> > > and ip6_input_finish().
> > >
> > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > ---
> > >  net/core/dev.c       | 8 ++------
> > >  net/ipv4/ip_input.c  | 1 +
> > >  net/ipv6/ip6_input.c | 1 +
> > >  3 files changed, 4 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 0fc02cf32476..3ff686cc8c84 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -5193,10 +5193,8 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
> > >                         goto out;
> > >         }
> > >
> > > -       if (skb_skip_tc_classify(skb)) {
> > > -               skb_clear_delivery_time(skb);
> > > +       if (skb_skip_tc_classify(skb))
> > >                 goto skip_classify;
> > > -       }
> > >
> > >         if (pfmemalloc)
> > >                 goto skip_taps;
> > > @@ -5225,14 +5223,12 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
> > >                         goto another_round;
> > >                 if (!skb)
> > >                         goto out;
> > > -               skb_clear_delivery_time(skb);
> > >
> > >                 nf_skip_egress(skb, false);
> > >                 if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0)
> > >                         goto out;
> > > -       } else
> > > +       }
> > >  #endif
> > > -               skb_clear_delivery_time(skb);
> > >         skb_reset_redirect(skb);
> > >  skip_classify:
> > >         if (pfmemalloc && !skb_pfmemalloc_protocol(skb))
> > > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> > > index d94f9f7e60c3..95f7bb052784 100644
> > > --- a/net/ipv4/ip_input.c
> > > +++ b/net/ipv4/ip_input.c
> > > @@ -226,6 +226,7 @@ void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int protocol)
> > >
> > >  static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
> > >  {
> > > +       skb_clear_delivery_time(skb);
> > >         __skb_pull(skb, skb_network_header_len(skb));
> > >
> > >         rcu_read_lock();
> > > diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> > > index d4b1e2c5aa76..5b5ea35635f9 100644
> > > --- a/net/ipv6/ip6_input.c
> > > +++ b/net/ipv6/ip6_input.c
> > > @@ -459,6 +459,7 @@ void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int nexthdr,
> > >
> > >  static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
> > >  {
> > > +       skb_clear_delivery_time(skb);
> > >         rcu_read_lock();
> > >         ip6_protocol_deliver_rcu(net, skb, 0, false);
> > >         rcu_read_unlock();
> > > --
> > > 2.30.2
> > >
> >
> > It is not clear to me why we need to clear tstamp if packet is locally
> > delivered ?
> It does not clear the rx tstamp in skb->tstamp.
>
> It only clears the EDT in skb->tstamp when the skb
> is transmitted out of a local tcp_sock and then loop back from egress
> to ingress through virtual interface like veth.
>
> >
> > TCP stack is using tstamp for incoming packets (look for
> > TCP_SKB_CB(skb)->has_rxtstamp)
> skb_clear_delivery_time() will put ktime_get_real() back to skb->tstamp
> so that the receiving tcp_sock can use it.


Oh... I had to look at

+static inline void skb_clear_delivery_time(struct sk_buff *skb)
+{
+       if (skb->mono_delivery_time) {
+               skb->mono_delivery_time = 0;
+               if (static_branch_unlikely(&netstamp_needed_key))
+                       skb->tstamp = ktime_get_real();
+               else
+                       skb->tstamp = 0;
+       }
+}

Name was a bit confusing :)

And it seems you have a big opportunity to not call ktime_get_real()
when skb->sk is known at this point (early demux)
because few sockets actually enable timestamping ?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 net-next 10/13] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally
  2022-03-02 23:41       ` Eric Dumazet
@ 2022-03-03  0:19         ` Martin KaFai Lau
  2022-03-03  0:47           ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-03  0:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Jakub Kicinski, kernel-team,
	Willem de Bruijn

On Wed, Mar 02, 2022 at 03:41:59PM -0800, Eric Dumazet wrote:
> On Wed, Mar 2, 2022 at 2:34 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Wed, Mar 02, 2022 at 12:30:14PM -0800, Eric Dumazet wrote:
> > > On Wed, Mar 2, 2022 at 11:56 AM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > The previous patches handled the delivery_time in the ingress path
> > > > before the routing decision is made.  This patch can postpone clearing
> > > > delivery_time in a skb until knowing it is delivered locally and also
> > > > set the (rcv) timestamp if needed.  This patch moves the
> > > > skb_clear_delivery_time() from dev.c to ip_local_deliver_finish()
> > > > and ip6_input_finish().
> > > >
> > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > ---
> > > >  net/core/dev.c       | 8 ++------
> > > >  net/ipv4/ip_input.c  | 1 +
> > > >  net/ipv6/ip6_input.c | 1 +
> > > >  3 files changed, 4 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > > index 0fc02cf32476..3ff686cc8c84 100644
> > > > --- a/net/core/dev.c
> > > > +++ b/net/core/dev.c
> > > > @@ -5193,10 +5193,8 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
> > > >                         goto out;
> > > >         }
> > > >
> > > > -       if (skb_skip_tc_classify(skb)) {
> > > > -               skb_clear_delivery_time(skb);
> > > > +       if (skb_skip_tc_classify(skb))
> > > >                 goto skip_classify;
> > > > -       }
> > > >
> > > >         if (pfmemalloc)
> > > >                 goto skip_taps;
> > > > @@ -5225,14 +5223,12 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
> > > >                         goto another_round;
> > > >                 if (!skb)
> > > >                         goto out;
> > > > -               skb_clear_delivery_time(skb);
> > > >
> > > >                 nf_skip_egress(skb, false);
> > > >                 if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0)
> > > >                         goto out;
> > > > -       } else
> > > > +       }
> > > >  #endif
> > > > -               skb_clear_delivery_time(skb);
> > > >         skb_reset_redirect(skb);
> > > >  skip_classify:
> > > >         if (pfmemalloc && !skb_pfmemalloc_protocol(skb))
> > > > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> > > > index d94f9f7e60c3..95f7bb052784 100644
> > > > --- a/net/ipv4/ip_input.c
> > > > +++ b/net/ipv4/ip_input.c
> > > > @@ -226,6 +226,7 @@ void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int protocol)
> > > >
> > > >  static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
> > > >  {
> > > > +       skb_clear_delivery_time(skb);
> > > >         __skb_pull(skb, skb_network_header_len(skb));
> > > >
> > > >         rcu_read_lock();
> > > > diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> > > > index d4b1e2c5aa76..5b5ea35635f9 100644
> > > > --- a/net/ipv6/ip6_input.c
> > > > +++ b/net/ipv6/ip6_input.c
> > > > @@ -459,6 +459,7 @@ void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int nexthdr,
> > > >
> > > >  static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
> > > >  {
> > > > +       skb_clear_delivery_time(skb);
> > > >         rcu_read_lock();
> > > >         ip6_protocol_deliver_rcu(net, skb, 0, false);
> > > >         rcu_read_unlock();
> > > > --
> > > > 2.30.2
> > > >
> > >
> > > It is not clear to me why we need to clear tstamp if packet is locally
> > > delivered ?
> > It does not clear the rx tstamp in skb->tstamp.
> >
> > It only clears the EDT in skb->tstamp when the skb
> > is transmitted out of a local tcp_sock and then loop back from egress
> > to ingress through virtual interface like veth.
> >
> > >
> > > TCP stack is using tstamp for incoming packets (look for
> > > TCP_SKB_CB(skb)->has_rxtstamp)
> > skb_clear_delivery_time() will put ktime_get_real() back to skb->tstamp
> > so that the receiving tcp_sock can use it.
> 
> 
> Oh... I had to look at
> 
> +static inline void skb_clear_delivery_time(struct sk_buff *skb)
> +{
> +       if (skb->mono_delivery_time) {
> +               skb->mono_delivery_time = 0;
> +               if (static_branch_unlikely(&netstamp_needed_key))
> +                       skb->tstamp = ktime_get_real();
> +               else
> +                       skb->tstamp = 0;
> +       }
> +}
> 
> Name was a bit confusing :)
A few names were attempted in the early version and
then concluded on delivery_time. :p

> 
> And it seems you have a big opportunity to not call ktime_get_real()
> when skb->sk is known at this point (early demux)
> because few sockets actually enable timestamping ?
iiuc, you are suggesting to also check the skb->sk (if early demux)
and check for SK_FLAGS_TIMESTAMP.

Without checking skb->sk here, it should not be worse than the
current ktime_get_real() done in dev.c where it also does not have sk
available?  netstamp_needed_key should have been enabled as
long as there is one sk asked for it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp
  2022-03-02 19:55 [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (12 preceding siblings ...)
  2022-03-02 19:56 ` [PATCH v6 net-next 13/13] bpf: selftests: test skb->tstamp in redirect_neigh Martin KaFai Lau
@ 2022-03-03  0:36 ` Eric Dumazet
  2022-03-03 14:50 ` patchwork-bot+netdevbpf
  14 siblings, 0 replies; 27+ messages in thread
From: Eric Dumazet @ 2022-03-03  0:36 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Jakub Kicinski, kernel-team,
	Willem de Bruijn

On Wed, Mar 2, 2022 at 11:55 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> skb->tstamp was first used as the (rcv) timestamp.
> The major usage is to report it to the user (e.g. SO_TIMESTAMP).
>
> Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
> during egress and used by the qdisc (e.g. sch_fq) to make decision on when
> the skb can be passed to the dev.
>
> Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
> or the delivery_time, so it is always reset to 0 whenever forwarded
> between egress and ingress.
>
> While it makes sense to always clear the (rcv) timestamp in skb->tstamp
> to avoid confusing sch_fq that expects the delivery_time, it is a
> performance issue [0] to clear the delivery_time if the skb finally
> egress to a fq@phy-dev.
>
> This set is to keep the mono delivery time and make it available to
> the final egress interface.  Please see individual patch for
> the details.
>
> [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf
>

For the whole series

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 net-next 10/13] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally
  2022-03-03  0:19         ` Martin KaFai Lau
@ 2022-03-03  0:47           ` Eric Dumazet
  0 siblings, 0 replies; 27+ messages in thread
From: Eric Dumazet @ 2022-03-03  0:47 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Jakub Kicinski, kernel-team,
	Willem de Bruijn

On Wed, Mar 2, 2022 at 4:20 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Mar 02, 2022 at 03:41:59PM -0800, Eric Dumazet wrote:

> > Name was a bit confusing :)
> A few names were attempted in the early version and
> then concluded on delivery_time. :p
>
> >
> > And it seems you have a big opportunity to not call ktime_get_real()
> > when skb->sk is known at this point (early demux)
> > because few sockets actually enable timestamping ?
> iiuc, you are suggesting to also check the skb->sk (if early demux)
> and check for SK_FLAGS_TIMESTAMP.
>
> Without checking skb->sk here, it should not be worse than the
> current ktime_get_real() done in dev.c where it also does not have sk
> available?  netstamp_needed_key should have been enabled as
> long as there is one sk asked for it.

Yes, but typically there is often one active timestamp consumer,
enabling the static key.

This is a corner case anyway, not sure if hosts running VM would have
slow  ktime_get_real()

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 net-next 03/13] net: Handle delivery_time in skb->tstamp during network tapping with af_packet
  2022-03-02 19:55 ` [PATCH v6 net-next 03/13] net: Handle delivery_time in skb->tstamp during network tapping with af_packet Martin KaFai Lau
@ 2022-03-03 10:48   ` Daniel Borkmann
  2022-03-03 19:17     ` Martin KaFai Lau
  0 siblings, 1 reply; 27+ messages in thread
From: Daniel Borkmann @ 2022-03-03 10:48 UTC (permalink / raw)
  To: Martin KaFai Lau, bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, David Miller, Eric Dumazet,
	Jakub Kicinski, kernel-team, Willem de Bruijn

On 3/2/22 8:55 PM, Martin KaFai Lau wrote:
[...]
> When tapping at ingress, it currently expects the skb->tstamp is either 0
> or the (rcv) timestamp.  Meaning, the tapping at ingress path
> has already expected the skb->tstamp could be 0 and it will get
> the (rcv) timestamp by ktime_get_real() when needed.
> 
> There are two cases for tapping at ingress:
> 
> One case is af_packet queues the skb to its sk_receive_queue.
> The skb is either not shared or new clone created.  The newly
> added skb_clear_delivery_time() is called to clear the
> delivery_time (if any) and set the (rcv) timestamp if
> needed before the skb is queued to the sk_receive_queue.
[...]
>   
> +DECLARE_STATIC_KEY_FALSE(netstamp_needed_key);
> +
> +/* It is used in the ingress path to clear the delivery_time.
> + * If needed, set the skb->tstamp to the (rcv) timestamp.
> + */
> +static inline void skb_clear_delivery_time(struct sk_buff *skb)
> +{
> +	if (skb->mono_delivery_time) {
> +		skb->mono_delivery_time = 0;
> +		if (static_branch_unlikely(&netstamp_needed_key))
> +			skb->tstamp = ktime_get_real();
> +		else
> +			skb->tstamp = 0;
> +	}
> +}
> +
>   static inline void skb_clear_tstamp(struct sk_buff *skb)
[...]
> @@ -2199,6 +2199,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
>   	spin_lock(&sk->sk_receive_queue.lock);
>   	po->stats.stats1.tp_packets++;
>   	sock_skb_set_dropcount(sk, skb);
> +	skb_clear_delivery_time(skb);

Maybe not fully clear from your description, but for ingress taps, we are allowed
to mangle timestamp here because main recv loop enters taps via deliver_skb(), which
bumps skb->users refcount and {t,}packet_rcv() always hits the skb_shared(skb) case
which then clones skb.. (and for egress we are covered anyway given dev_queue_xmit_nit()
will skb_clone() once anyway for tx tstamp)?

>   	__skb_queue_tail(&sk->sk_receive_queue, skb);
>   	spin_unlock(&sk->sk_receive_queue.lock);
>   	sk->sk_data_ready(sk);
> @@ -2377,6 +2378,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>   	po->stats.stats1.tp_packets++;
>   	if (copy_skb) {
>   		status |= TP_STATUS_COPY;
> +		skb_clear_delivery_time(copy_skb);
>   		__skb_queue_tail(&sk->sk_receive_queue, copy_skb);
>   	}
>   	spin_unlock(&sk->sk_receive_queue.lock);
> 

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 net-next 11/13] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress
  2022-03-02 19:56 ` [PATCH v6 net-next 11/13] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress Martin KaFai Lau
@ 2022-03-03 13:00   ` Daniel Borkmann
  2022-03-03 20:43     ` Martin KaFai Lau
  0 siblings, 1 reply; 27+ messages in thread
From: Daniel Borkmann @ 2022-03-03 13:00 UTC (permalink / raw)
  To: Martin KaFai Lau, bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, David Miller, Eric Dumazet,
	Jakub Kicinski, kernel-team, Willem de Bruijn

On 3/2/22 8:56 PM, Martin KaFai Lau wrote:
> The current tc-bpf@ingress reads and writes the __sk_buff->tstamp
> as a (rcv) timestamp which currently could either be 0 (not available)
> or ktime_get_real().  This patch is to backward compatible with the
> (rcv) timestamp expectation at ingress.  If the skb->tstamp has
> the delivery_time, the bpf insn rewrite will read 0 for tc-bpf
> running at ingress as it is not available.  When writing at ingress,
> it will also clear the skb->mono_delivery_time bit.
> 
> /* BPF_READ: a = __sk_buff->tstamp */
> if (!skb->tc_at_ingress || !skb->mono_delivery_time)
> 	a = skb->tstamp;
> else
> 	a = 0
> 
> /* BPF_WRITE: __sk_buff->tstamp = a */
> if (skb->tc_at_ingress)
> 	skb->mono_delivery_time = 0;
> skb->tstamp = a;
> 
> [ A note on the BPF_CGROUP_INET_INGRESS which can also access
>    skb->tstamp.  At that point, the skb is delivered locally
>    and skb_clear_delivery_time() has already been done,
>    so the skb->tstamp will only have the (rcv) timestamp. ]
> 
> If the tc-bpf@egress writes 0 to skb->tstamp, the skb->mono_delivery_time
> has to be cleared also.  It could be done together during
> convert_ctx_access().  However, the latter patch will also expose
> the skb->mono_delivery_time bit as __sk_buff->delivery_time_type.
> Changing the delivery_time_type in the background may surprise
> the user, e.g. the 2nd read on __sk_buff->delivery_time_type
> may need a READ_ONCE() to avoid compiler optimization.  Thus,
> in expecting the needs in the latter patch, this patch does a
> check on !skb->tstamp after running the tc-bpf and clears the
> skb->mono_delivery_time bit if needed.  The earlier discussion
> on v4 [0].
> 
> The bpf insn rewrite requires the skb's mono_delivery_time bit and
> tc_at_ingress bit.  They are moved up in sk_buff so that bpf rewrite
> can be done at a fixed offset.  tc_skip_classify is moved together with
> tc_at_ingress.  To get one bit for mono_delivery_time, csum_not_inet is
> moved down and this bit is currently used by sctp.
> 
> [0]: https://lore.kernel.org/bpf/20220217015043.khqwqklx45c4m4se@kafai-mbp.dhcp.thefacebook.com/
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>   include/linux/skbuff.h | 18 +++++++----
>   net/core/filter.c      | 71 ++++++++++++++++++++++++++++++++++++------
>   net/sched/act_bpf.c    |  2 ++
>   net/sched/cls_bpf.c    |  2 ++
>   4 files changed, 77 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 4b5b926a81f2..5445860e1ba6 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -941,8 +941,12 @@ struct sk_buff {
>   	__u8			vlan_present:1;	/* See PKT_VLAN_PRESENT_BIT */
>   	__u8			csum_complete_sw:1;
>   	__u8			csum_level:2;
> -	__u8			csum_not_inet:1;
>   	__u8			dst_pending_confirm:1;
> +	__u8			mono_delivery_time:1;
> +#ifdef CONFIG_NET_CLS_ACT
> +	__u8			tc_skip_classify:1;
> +	__u8			tc_at_ingress:1;
> +#endif
>   #ifdef CONFIG_IPV6_NDISC_NODETYPE
>   	__u8			ndisc_nodetype:2;
>   #endif
> @@ -953,10 +957,6 @@ struct sk_buff {
>   #ifdef CONFIG_NET_SWITCHDEV
>   	__u8			offload_fwd_mark:1;
>   	__u8			offload_l3_fwd_mark:1;
> -#endif
> -#ifdef CONFIG_NET_CLS_ACT
> -	__u8			tc_skip_classify:1;
> -	__u8			tc_at_ingress:1;
>   #endif
>   	__u8			redirected:1;
>   #ifdef CONFIG_NET_REDIRECT
> @@ -969,7 +969,7 @@ struct sk_buff {
>   	__u8			decrypted:1;
>   #endif
>   	__u8			slow_gro:1;
> -	__u8			mono_delivery_time:1;
> +	__u8			csum_not_inet:1;
>   
>   #ifdef CONFIG_NET_SCHED
>   	__u16			tc_index;	/* traffic control index */
> @@ -1047,10 +1047,16 @@ struct sk_buff {
>   /* if you move pkt_vlan_present around you also must adapt these constants */
>   #ifdef __BIG_ENDIAN_BITFIELD
>   #define PKT_VLAN_PRESENT_BIT	7
> +#define TC_AT_INGRESS_MASK		(1 << 0)
> +#define SKB_MONO_DELIVERY_TIME_MASK	(1 << 2)
>   #else
>   #define PKT_VLAN_PRESENT_BIT	0
> +#define TC_AT_INGRESS_MASK		(1 << 7)
> +#define SKB_MONO_DELIVERY_TIME_MASK	(1 << 5)
>   #endif
>   #define PKT_VLAN_PRESENT_OFFSET	offsetof(struct sk_buff, __pkt_vlan_present_offset)
> +#define TC_AT_INGRESS_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)
> +#define SKB_MONO_DELIVERY_TIME_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)

Just nit, but given PKT_VLAN_PRESENT_OFFSET, TC_AT_INGRESS_OFFSET and SKB_MONO_DELIVERY_TIME_OFFSET
are all the same offsetof(struct sk_buff, __pkt_vlan_present_offset), maybe lets use just one single
define? If anyone moves them out, they would have to adopt as per comment.

>   #ifdef __KERNEL__
>   /*
> diff --git a/net/core/filter.c b/net/core/filter.c
> index cfcf9b4d1ec2..5072733743e9 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -8859,6 +8859,65 @@ static struct bpf_insn *bpf_convert_shinfo_access(const struct bpf_insn *si,
>   	return insn;
>   }
>   
> +static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_insn *si,
> +						struct bpf_insn *insn)
> +{
> +	__u8 value_reg = si->dst_reg;
> +	__u8 skb_reg = si->src_reg;
> +
> +#ifdef CONFIG_NET_CLS_ACT
> +	__u8 tmp_reg = BPF_REG_AX;
> +
> +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
> +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK);

nit: As far as I can see, can't si->dst_reg be used instead of AX?

> +	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 5);
> +	/* @ingress, read __sk_buff->tstamp as the (rcv) timestamp,
> +	 * so check the skb->mono_delivery_time.
> +	 */
> +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
> +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
> +				SKB_MONO_DELIVERY_TIME_MASK);
> +	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2);
> +	/* skb->mono_delivery_time is set, read 0 as the (rcv) timestamp. */
> +	*insn++ = BPF_MOV64_IMM(value_reg, 0);
> +	*insn++ = BPF_JMP_A(1);
> +#endif
> +
> +	*insn++ = BPF_LDX_MEM(BPF_DW, value_reg, skb_reg,
> +			      offsetof(struct sk_buff, tstamp));
> +	return insn;
> +}
> +
> +static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_insn *si,
> +						 struct bpf_insn *insn)
> +{
> +	__u8 value_reg = si->src_reg;
> +	__u8 skb_reg = si->dst_reg;
> +
> +#ifdef CONFIG_NET_CLS_ACT
> +	__u8 tmp_reg = BPF_REG_AX;
> +
> +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
> +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK);

Can't we get rid of tcf_bpf_act() and cls_bpf_classify() changes altogether by just doing:

   /* BPF_WRITE: __sk_buff->tstamp = a */
   skb->mono_delivery_time = !skb->tc_at_ingress && a;
   skb->tstamp = a;

(Untested) pseudo code:

   // or see comment on common SKB_FLAGS_OFFSET define or such
   BUILD_BUG_ON(TC_AT_INGRESS_OFFSET != SKB_MONO_DELIVERY_TIME_OFFSET)

   BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, SKB_MONO_DELIVERY_TIME_OFFSET)
   BPF_ALU32_IMM(BPF_OR, tmp_reg, SKB_MONO_DELIVERY_TIME_MASK)
   BPF_JMP32_IMM(BPF_JSET, tmp_reg, TC_AT_INGRESS_MASK, <clear>)
   BPF_JMP32_REG(BPF_JGE, value_reg, tmp_reg, <store>)
<clear>:
   BPF_ALU32_IMM(BPF_AND, tmp_reg, ~SKB_MONO_DELIVERY_TIME_MASK)
<store>:
   BPF_STX_MEM(BPF_B, skb_reg, tmp_reg, SKB_MONO_DELIVERY_TIME_OFFSET)
   BPF_STX_MEM(BPF_DW, skb_reg, value_reg, offsetof(struct sk_buff, tstamp))

(There's a small hack with the BPF_JGE for tmp_reg, so constant blinding for AX doesn't
get into our way.)

> +	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 3);
> +	/* Writing __sk_buff->tstamp at ingress as the (rcv) timestamp.
> +	 * Clear the skb->mono_delivery_time.
> +	 */
> +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
> +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
> +				~SKB_MONO_DELIVERY_TIME_MASK);
> +	*insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg,
> +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> +#endif
> +
> +	/* skb->tstamp = tstamp */
> +	*insn++ = BPF_STX_MEM(BPF_DW, skb_reg, value_reg,
> +			      offsetof(struct sk_buff, tstamp));
> +	return insn;
> +}
> +
>   static u32 bpf_convert_ctx_access(enum bpf_access_type type,
>   				  const struct bpf_insn *si,
>   				  struct bpf_insn *insn_buf,
> @@ -9167,17 +9226,9 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
>   		BUILD_BUG_ON(sizeof_field(struct sk_buff, tstamp) != 8);
>   
>   		if (type == BPF_WRITE)
> -			*insn++ = BPF_STX_MEM(BPF_DW,
> -					      si->dst_reg, si->src_reg,
> -					      bpf_target_off(struct sk_buff,
> -							     tstamp, 8,
> -							     target_size));
> +			insn = bpf_convert_tstamp_write(si, insn);
>   		else
> -			*insn++ = BPF_LDX_MEM(BPF_DW,
> -					      si->dst_reg, si->src_reg,
> -					      bpf_target_off(struct sk_buff,
> -							     tstamp, 8,
> -							     target_size));
> +			insn = bpf_convert_tstamp_read(si, insn);
>   		break;
>   
>   	case offsetof(struct __sk_buff, gso_segs):
> diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
> index a77d8908e737..fea2d78b9ddc 100644
> --- a/net/sched/act_bpf.c
> +++ b/net/sched/act_bpf.c
> @@ -53,6 +53,8 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act,
>   		bpf_compute_data_pointers(skb);
>   		filter_res = bpf_prog_run(filter, skb);
>   	}
> +	if (unlikely(!skb->tstamp && skb->mono_delivery_time))
> +		skb->mono_delivery_time = 0;
>   	if (skb_sk_is_prefetched(skb) && filter_res != TC_ACT_OK)
>   		skb_orphan(skb);
>   
> diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
> index df19a847829e..c85b85a192bf 100644
> --- a/net/sched/cls_bpf.c
> +++ b/net/sched/cls_bpf.c
> @@ -102,6 +102,8 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
>   			bpf_compute_data_pointers(skb);
>   			filter_res = bpf_prog_run(prog->filter, skb);
>   		}
> +		if (unlikely(!skb->tstamp && skb->mono_delivery_time))
> +			skb->mono_delivery_time = 0;
>   
>   		if (prog->exts_integrated) {
>   			res->class   = 0;
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp
  2022-03-02 19:55 [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (13 preceding siblings ...)
  2022-03-03  0:36 ` [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Eric Dumazet
@ 2022-03-03 14:50 ` patchwork-bot+netdevbpf
  14 siblings, 0 replies; 27+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-03-03 14:50 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, ast, andrii, daniel, davem, edumazet, kuba,
	kernel-team, willemb

Hello:

This series was applied to netdev/net-next.git (master)
by David S. Miller <davem@davemloft.net>:

On Wed, 2 Mar 2022 11:55:19 -0800 you wrote:
> skb->tstamp was first used as the (rcv) timestamp.
> The major usage is to report it to the user (e.g. SO_TIMESTAMP).
> 
> Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
> during egress and used by the qdisc (e.g. sch_fq) to make decision on when
> the skb can be passed to the dev.
> 
> [...]

Here is the summary with links:
  - [v6,net-next,01/13] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp
    https://git.kernel.org/netdev/net-next/c/a1ac9c8acec1
  - [v6,net-next,02/13] net: Add skb_clear_tstamp() to keep the mono delivery_time
    https://git.kernel.org/netdev/net-next/c/de799101519a
  - [v6,net-next,03/13] net: Handle delivery_time in skb->tstamp during network tapping with af_packet
    https://git.kernel.org/netdev/net-next/c/27942a15209f
  - [v6,net-next,04/13] net: Clear mono_delivery_time bit in __skb_tstamp_tx()
    https://git.kernel.org/netdev/net-next/c/d93376f503c7
  - [v6,net-next,05/13] net: Set skb->mono_delivery_time and clear it after sch_handle_ingress()
    https://git.kernel.org/netdev/net-next/c/d98d58a00261
  - [v6,net-next,06/13] net: ip: Handle delivery_time in ip defrag
    https://git.kernel.org/netdev/net-next/c/8672406eb5d7
  - [v6,net-next,07/13] net: ipv6: Handle delivery_time in ipv6 defrag
    https://git.kernel.org/netdev/net-next/c/335c8cf3b537
  - [v6,net-next,08/13] net: ipv6: Get rcv timestamp if needed when handling hop-by-hop IOAM option
    https://git.kernel.org/netdev/net-next/c/b6561f8491ca
  - [v6,net-next,09/13] net: Get rcv tstamp if needed in nfnetlink_{log, queue}.c
    https://git.kernel.org/netdev/net-next/c/80fcec675112
  - [v6,net-next,10/13] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally
    https://git.kernel.org/netdev/net-next/c/cd14e9b7b8d3
  - [v6,net-next,11/13] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress
    https://git.kernel.org/netdev/net-next/c/7449197d600d
  - [v6,net-next,12/13] bpf: Add __sk_buff->delivery_time_type and bpf_skb_set_skb_delivery_time()
    https://git.kernel.org/netdev/net-next/c/8d21ec0e46ed
  - [v6,net-next,13/13] bpf: selftests: test skb->tstamp in redirect_neigh
    https://git.kernel.org/netdev/net-next/c/c803475fd8dd

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 net-next 03/13] net: Handle delivery_time in skb->tstamp during network tapping with af_packet
  2022-03-03 10:48   ` Daniel Borkmann
@ 2022-03-03 19:17     ` Martin KaFai Lau
  0 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-03 19:17 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team, Willem de Bruijn

On Thu, Mar 03, 2022 at 11:48:19AM +0100, Daniel Borkmann wrote:
> On 3/2/22 8:55 PM, Martin KaFai Lau wrote:
> [...]
> > When tapping at ingress, it currently expects the skb->tstamp is either 0
> > or the (rcv) timestamp.  Meaning, the tapping at ingress path
> > has already expected the skb->tstamp could be 0 and it will get
> > the (rcv) timestamp by ktime_get_real() when needed.
> > 
> > There are two cases for tapping at ingress:
> > 
> > One case is af_packet queues the skb to its sk_receive_queue.
> > The skb is either not shared or new clone created.  The newly
> > added skb_clear_delivery_time() is called to clear the
> > delivery_time (if any) and set the (rcv) timestamp if
> > needed before the skb is queued to the sk_receive_queue.
> [...]
> > +DECLARE_STATIC_KEY_FALSE(netstamp_needed_key);
> > +
> > +/* It is used in the ingress path to clear the delivery_time.
> > + * If needed, set the skb->tstamp to the (rcv) timestamp.
> > + */
> > +static inline void skb_clear_delivery_time(struct sk_buff *skb)
> > +{
> > +	if (skb->mono_delivery_time) {
> > +		skb->mono_delivery_time = 0;
> > +		if (static_branch_unlikely(&netstamp_needed_key))
> > +			skb->tstamp = ktime_get_real();
> > +		else
> > +			skb->tstamp = 0;
> > +	}
> > +}
> > +
> >   static inline void skb_clear_tstamp(struct sk_buff *skb)
> [...]
> > @@ -2199,6 +2199,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
> >   	spin_lock(&sk->sk_receive_queue.lock);
> >   	po->stats.stats1.tp_packets++;
> >   	sock_skb_set_dropcount(sk, skb);
> > +	skb_clear_delivery_time(skb);
> 
> Maybe not fully clear from your description, but for ingress taps, we are allowed
> to mangle timestamp here because main recv loop enters taps via deliver_skb(), which
> bumps skb->users refcount and {t,}packet_rcv() always hits the skb_shared(skb) case
> which then clones skb.. (and for egress we are covered anyway given dev_queue_xmit_nit()
> will skb_clone() once anyway for tx tstamp)?
Yes, refcount_inc(&skb->users) and then skb_clone() is my understanding also
for the ingress tapping path.  On top of that, in general, the current
{t,}packet_rcv() is changing other fields of a skb also before queuing it,
so it has to be a clone.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 net-next 11/13] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress
  2022-03-03 13:00   ` Daniel Borkmann
@ 2022-03-03 20:43     ` Martin KaFai Lau
  2022-03-03 22:55       ` Daniel Borkmann
  0 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-03 20:43 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team, Willem de Bruijn

On Thu, Mar 03, 2022 at 02:00:37PM +0100, Daniel Borkmann wrote:
> On 3/2/22 8:56 PM, Martin KaFai Lau wrote:
> > If the tc-bpf@egress writes 0 to skb->tstamp, the skb->mono_delivery_time
> > has to be cleared also.  It could be done together during
> > convert_ctx_access().  However, the latter patch will also expose
> > the skb->mono_delivery_time bit as __sk_buff->delivery_time_type.
> > Changing the delivery_time_type in the background may surprise
> > the user, e.g. the 2nd read on __sk_buff->delivery_time_type
> > may need a READ_ONCE() to avoid compiler optimization.  Thus,
> > in expecting the needs in the latter patch, this patch does a
> > check on !skb->tstamp after running the tc-bpf and clears the
> > skb->mono_delivery_time bit if needed.  The earlier discussion
> > on v4 [0].

[ ... ]

> > @@ -1047,10 +1047,16 @@ struct sk_buff {
> >   /* if you move pkt_vlan_present around you also must adapt these constants */
> >   #ifdef __BIG_ENDIAN_BITFIELD
> >   #define PKT_VLAN_PRESENT_BIT	7
> > +#define TC_AT_INGRESS_MASK		(1 << 0)
> > +#define SKB_MONO_DELIVERY_TIME_MASK	(1 << 2)
> >   #else
> >   #define PKT_VLAN_PRESENT_BIT	0
> > +#define TC_AT_INGRESS_MASK		(1 << 7)
> > +#define SKB_MONO_DELIVERY_TIME_MASK	(1 << 5)
> >   #endif
> >   #define PKT_VLAN_PRESENT_OFFSET	offsetof(struct sk_buff, __pkt_vlan_present_offset)
> > +#define TC_AT_INGRESS_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)
> > +#define SKB_MONO_DELIVERY_TIME_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)
> 
> Just nit, but given PKT_VLAN_PRESENT_OFFSET, TC_AT_INGRESS_OFFSET and SKB_MONO_DELIVERY_TIME_OFFSET
> are all the same offsetof(struct sk_buff, __pkt_vlan_present_offset), maybe lets use just one single
> define? If anyone moves them out, they would have to adopt as per comment.
Make sense.  I will update the comment, remove these two defines
and reuse the PKT_VLAN_PRESENT_OFFSET.  Considering it
is more bpf insn rewrite specific, I will do a
follow-up in filter.c and skbuff.h at bpf-next.

> >   #ifdef __KERNEL__
> >   /*
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index cfcf9b4d1ec2..5072733743e9 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -8859,6 +8859,65 @@ static struct bpf_insn *bpf_convert_shinfo_access(const struct bpf_insn *si,
> >   	return insn;
> >   }
> > +static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_insn *si,
> > +						struct bpf_insn *insn)
> > +{
> > +	__u8 value_reg = si->dst_reg;
> > +	__u8 skb_reg = si->src_reg;
> > +
> > +#ifdef CONFIG_NET_CLS_ACT
> > +	__u8 tmp_reg = BPF_REG_AX;
> > +
> > +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
> > +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK);
> 
> nit: As far as I can see, can't si->dst_reg be used instead of AX?
Ah.  This one got me also when using dst_reg as a tmp. dst_reg and src_reg
can be the same:

;           skb->tstamp == EGRESS_FWDNS_MAGIC)
     169:       r1 = *(u64 *)(r1 + 152)

> 
> > +	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 5);
> > +	/* @ingress, read __sk_buff->tstamp as the (rcv) timestamp,
> > +	 * so check the skb->mono_delivery_time.
> > +	 */
> > +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
> > +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> > +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
> > +				SKB_MONO_DELIVERY_TIME_MASK);
> > +	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2);
> > +	/* skb->mono_delivery_time is set, read 0 as the (rcv) timestamp. */
> > +	*insn++ = BPF_MOV64_IMM(value_reg, 0);
> > +	*insn++ = BPF_JMP_A(1);
> > +#endif
> > +
> > +	*insn++ = BPF_LDX_MEM(BPF_DW, value_reg, skb_reg,
> > +			      offsetof(struct sk_buff, tstamp));
> > +	return insn;
> > +}
> > +
> > +static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_insn *si,
> > +						 struct bpf_insn *insn)
> > +{
> > +	__u8 value_reg = si->src_reg;
> > +	__u8 skb_reg = si->dst_reg;
> > +
> > +#ifdef CONFIG_NET_CLS_ACT
> > +	__u8 tmp_reg = BPF_REG_AX;
> > +
> > +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
> > +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK);
> 
> Can't we get rid of tcf_bpf_act() and cls_bpf_classify() changes altogether by just doing:
> 
>   /* BPF_WRITE: __sk_buff->tstamp = a */
>   skb->mono_delivery_time = !skb->tc_at_ingress && a;
>   skb->tstamp = a;
It will then assume the bpf prog is writing a mono time.
Although mono should always be the case now,  this assumption will be
an issue in the future if we need to support non-mono.

> 
> (Untested) pseudo code:
> 
>   // or see comment on common SKB_FLAGS_OFFSET define or such
>   BUILD_BUG_ON(TC_AT_INGRESS_OFFSET != SKB_MONO_DELIVERY_TIME_OFFSET)
> 
>   BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, SKB_MONO_DELIVERY_TIME_OFFSET)
>   BPF_ALU32_IMM(BPF_OR, tmp_reg, SKB_MONO_DELIVERY_TIME_MASK)
>   BPF_JMP32_IMM(BPF_JSET, tmp_reg, TC_AT_INGRESS_MASK, <clear>)
This can save a BPF_ALU32_IMM(BPF_AND).  I will do that
together in the follow up. Thanks for the idea !

>   BPF_JMP32_REG(BPF_JGE, value_reg, tmp_reg, <store>)
> <clear>:
>   BPF_ALU32_IMM(BPF_AND, tmp_reg, ~SKB_MONO_DELIVERY_TIME_MASK)
> <store>:
>   BPF_STX_MEM(BPF_B, skb_reg, tmp_reg, SKB_MONO_DELIVERY_TIME_OFFSET)
>   BPF_STX_MEM(BPF_DW, skb_reg, value_reg, offsetof(struct sk_buff, tstamp))
> 
> (There's a small hack with the BPF_JGE for tmp_reg, so constant blinding for AX doesn't
> get into our way.)
> 
> > +	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 3);
> > +	/* Writing __sk_buff->tstamp at ingress as the (rcv) timestamp.
> > +	 * Clear the skb->mono_delivery_time.
> > +	 */
> > +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
> > +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> > +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
> > +				~SKB_MONO_DELIVERY_TIME_MASK);
> > +	*insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg,
> > +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> > +#endif
> > +
> > +	/* skb->tstamp = tstamp */
> > +	*insn++ = BPF_STX_MEM(BPF_DW, skb_reg, value_reg,
> > +			      offsetof(struct sk_buff, tstamp));
> > +	return insn;
> > +}
> > +

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 net-next 11/13] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress
  2022-03-03 20:43     ` Martin KaFai Lau
@ 2022-03-03 22:55       ` Daniel Borkmann
  2022-03-03 23:42         ` Martin KaFai Lau
  0 siblings, 1 reply; 27+ messages in thread
From: Daniel Borkmann @ 2022-03-03 22:55 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team, Willem de Bruijn

On 3/3/22 9:43 PM, Martin KaFai Lau wrote:
> On Thu, Mar 03, 2022 at 02:00:37PM +0100, Daniel Borkmann wrote:
>> On 3/2/22 8:56 PM, Martin KaFai Lau wrote:
>>> If the tc-bpf@egress writes 0 to skb->tstamp, the skb->mono_delivery_time
>>> has to be cleared also.  It could be done together during
>>> convert_ctx_access().  However, the latter patch will also expose
>>> the skb->mono_delivery_time bit as __sk_buff->delivery_time_type.
>>> Changing the delivery_time_type in the background may surprise
>>> the user, e.g. the 2nd read on __sk_buff->delivery_time_type
>>> may need a READ_ONCE() to avoid compiler optimization.  Thus,
>>> in expecting the needs in the latter patch, this patch does a
>>> check on !skb->tstamp after running the tc-bpf and clears the
>>> skb->mono_delivery_time bit if needed.  The earlier discussion
>>> on v4 [0].
> 
> [ ... ]
> 
>>> @@ -1047,10 +1047,16 @@ struct sk_buff {
>>>    /* if you move pkt_vlan_present around you also must adapt these constants */
>>>    #ifdef __BIG_ENDIAN_BITFIELD
>>>    #define PKT_VLAN_PRESENT_BIT	7
>>> +#define TC_AT_INGRESS_MASK		(1 << 0)
>>> +#define SKB_MONO_DELIVERY_TIME_MASK	(1 << 2)
>>>    #else
>>>    #define PKT_VLAN_PRESENT_BIT	0
>>> +#define TC_AT_INGRESS_MASK		(1 << 7)
>>> +#define SKB_MONO_DELIVERY_TIME_MASK	(1 << 5)
>>>    #endif
>>>    #define PKT_VLAN_PRESENT_OFFSET	offsetof(struct sk_buff, __pkt_vlan_present_offset)
>>> +#define TC_AT_INGRESS_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)
>>> +#define SKB_MONO_DELIVERY_TIME_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)
>>
>> Just nit, but given PKT_VLAN_PRESENT_OFFSET, TC_AT_INGRESS_OFFSET and SKB_MONO_DELIVERY_TIME_OFFSET
>> are all the same offsetof(struct sk_buff, __pkt_vlan_present_offset), maybe lets use just one single
>> define? If anyone moves them out, they would have to adopt as per comment.
> Make sense.  I will update the comment, remove these two defines
> and reuse the PKT_VLAN_PRESENT_OFFSET.  Considering it
> is more bpf insn rewrite specific, I will do a
> follow-up in filter.c and skbuff.h at bpf-next.

Ok, sounds good!

>>>    #ifdef __KERNEL__
>>>    /*
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index cfcf9b4d1ec2..5072733743e9 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -8859,6 +8859,65 @@ static struct bpf_insn *bpf_convert_shinfo_access(const struct bpf_insn *si,
>>>    	return insn;
>>>    }
>>> +static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_insn *si,
>>> +						struct bpf_insn *insn)
>>> +{
>>> +	__u8 value_reg = si->dst_reg;
>>> +	__u8 skb_reg = si->src_reg;
>>> +
>>> +#ifdef CONFIG_NET_CLS_ACT
>>> +	__u8 tmp_reg = BPF_REG_AX;
>>> +
>>> +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
>>> +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK);
>>
>> nit: As far as I can see, can't si->dst_reg be used instead of AX?
> Ah.  This one got me also when using dst_reg as a tmp. dst_reg and src_reg
> can be the same:
> 
> ;           skb->tstamp == EGRESS_FWDNS_MAGIC)
>       169:       r1 = *(u64 *)(r1 + 152)

Ah true, good point. Probably makes sense to add a small comment explaining
the rationale for use of AX here.

>>> +	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 5);
>>> +	/* @ingress, read __sk_buff->tstamp as the (rcv) timestamp,
>>> +	 * so check the skb->mono_delivery_time.
>>> +	 */
>>> +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
>>> +			      SKB_MONO_DELIVERY_TIME_OFFSET);
>>> +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
>>> +				SKB_MONO_DELIVERY_TIME_MASK);
>>> +	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2);
>>> +	/* skb->mono_delivery_time is set, read 0 as the (rcv) timestamp. */
>>> +	*insn++ = BPF_MOV64_IMM(value_reg, 0);
>>> +	*insn++ = BPF_JMP_A(1);
>>> +#endif
>>> +
>>> +	*insn++ = BPF_LDX_MEM(BPF_DW, value_reg, skb_reg,
>>> +			      offsetof(struct sk_buff, tstamp));
>>> +	return insn;
>>> +}
>>> +
>>> +static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_insn *si,
>>> +						 struct bpf_insn *insn)
>>> +{
>>> +	__u8 value_reg = si->src_reg;
>>> +	__u8 skb_reg = si->dst_reg;
>>> +
>>> +#ifdef CONFIG_NET_CLS_ACT
>>> +	__u8 tmp_reg = BPF_REG_AX;
>>> +
>>> +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
>>> +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK);
>>
>> Can't we get rid of tcf_bpf_act() and cls_bpf_classify() changes altogether by just doing:
>>
>>    /* BPF_WRITE: __sk_buff->tstamp = a */
>>    skb->mono_delivery_time = !skb->tc_at_ingress && a;
>>    skb->tstamp = a;
> It will then assume the bpf prog is writing a mono time.
> Although mono should always be the case now,  this assumption will be
> an issue in the future if we need to support non-mono.

Right, for that we should probably instrument verifier to track base based
on ktime helper call once we get to that point.

>> (Untested) pseudo code:
>>
>>    // or see comment on common SKB_FLAGS_OFFSET define or such
>>    BUILD_BUG_ON(TC_AT_INGRESS_OFFSET != SKB_MONO_DELIVERY_TIME_OFFSET)
>>
>>    BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, SKB_MONO_DELIVERY_TIME_OFFSET)
>>    BPF_ALU32_IMM(BPF_OR, tmp_reg, SKB_MONO_DELIVERY_TIME_MASK)
>>    BPF_JMP32_IMM(BPF_JSET, tmp_reg, TC_AT_INGRESS_MASK, <clear>)
> This can save a BPF_ALU32_IMM(BPF_AND).  I will do that
> together in the follow up. Thanks for the idea !

Yeah the JSET comes in handy here.

>>    BPF_JMP32_REG(BPF_JGE, value_reg, tmp_reg, <store>)
>> <clear>:
>>    BPF_ALU32_IMM(BPF_AND, tmp_reg, ~SKB_MONO_DELIVERY_TIME_MASK)
>> <store>:
>>    BPF_STX_MEM(BPF_B, skb_reg, tmp_reg, SKB_MONO_DELIVERY_TIME_OFFSET)
>>    BPF_STX_MEM(BPF_DW, skb_reg, value_reg, offsetof(struct sk_buff, tstamp))
>>
>> (There's a small hack with the BPF_JGE for tmp_reg, so constant blinding for AX doesn't
>> get into our way.)
>>
>>> +	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 3);
>>> +	/* Writing __sk_buff->tstamp at ingress as the (rcv) timestamp.
>>> +	 * Clear the skb->mono_delivery_time.
>>> +	 */
>>> +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
>>> +			      SKB_MONO_DELIVERY_TIME_OFFSET);
>>> +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
>>> +				~SKB_MONO_DELIVERY_TIME_MASK);
>>> +	*insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg,
>>> +			      SKB_MONO_DELIVERY_TIME_OFFSET);
>>> +#endif
>>> +
>>> +	/* skb->tstamp = tstamp */
>>> +	*insn++ = BPF_STX_MEM(BPF_DW, skb_reg, value_reg,
>>> +			      offsetof(struct sk_buff, tstamp));
>>> +	return insn;
>>> +}
>>> +


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v6 net-next 11/13] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress
  2022-03-03 22:55       ` Daniel Borkmann
@ 2022-03-03 23:42         ` Martin KaFai Lau
  0 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2022-03-03 23:42 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team, Willem de Bruijn

On Thu, Mar 03, 2022 at 11:55:55PM +0100, Daniel Borkmann wrote:
> On 3/3/22 9:43 PM, Martin KaFai Lau wrote:
> > On Thu, Mar 03, 2022 at 02:00:37PM +0100, Daniel Borkmann wrote:
> > > On 3/2/22 8:56 PM, Martin KaFai Lau wrote:

> > > > +static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_insn *si,
> > > > +						 struct bpf_insn *insn)
> > > > +{
> > > > +	__u8 value_reg = si->src_reg;
> > > > +	__u8 skb_reg = si->dst_reg;
> > > > +
> > > > +#ifdef CONFIG_NET_CLS_ACT
> > > > +	__u8 tmp_reg = BPF_REG_AX;
> > > > +
> > > > +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
> > > > +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK);
> > > 
> > > Can't we get rid of tcf_bpf_act() and cls_bpf_classify() changes altogether by just doing:
> > > 
> > >    /* BPF_WRITE: __sk_buff->tstamp = a */
> > >    skb->mono_delivery_time = !skb->tc_at_ingress && a;
> > >    skb->tstamp = a;
> > It will then assume the bpf prog is writing a mono time.
> > Although mono should always be the case now,  this assumption will be
> > an issue in the future if we need to support non-mono.
> 
> Right, for that we should probably instrument verifier to track base based
> on ktime helper call once we get to that point.
The bpf prog does not necessary write a value which is obtained from a
bpf ktime helper.  It could be a value based on the previous skb sent
from a cgroup.

> 
> > > (Untested) pseudo code:
> > > 
> > >    // or see comment on common SKB_FLAGS_OFFSET define or such
> > >    BUILD_BUG_ON(TC_AT_INGRESS_OFFSET != SKB_MONO_DELIVERY_TIME_OFFSET)
> > > 
> > >    BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, SKB_MONO_DELIVERY_TIME_OFFSET)
> > >    BPF_ALU32_IMM(BPF_OR, tmp_reg, SKB_MONO_DELIVERY_TIME_MASK)
> > >    BPF_JMP32_IMM(BPF_JSET, tmp_reg, TC_AT_INGRESS_MASK, <clear>)
> > This can save a BPF_ALU32_IMM(BPF_AND).  I will do that
> > together in the follow up. Thanks for the idea !
> 
> Yeah the JSET comes in handy here.
> 
> > >    BPF_JMP32_REG(BPF_JGE, value_reg, tmp_reg, <store>)
> > > <clear>:
> > >    BPF_ALU32_IMM(BPF_AND, tmp_reg, ~SKB_MONO_DELIVERY_TIME_MASK)
> > > <store>:
> > >    BPF_STX_MEM(BPF_B, skb_reg, tmp_reg, SKB_MONO_DELIVERY_TIME_OFFSET)
> > >    BPF_STX_MEM(BPF_DW, skb_reg, value_reg, offsetof(struct sk_buff, tstamp))
> > > 
> > > (There's a small hack with the BPF_JGE for tmp_reg, so constant blinding for AX doesn't
> > > get into our way.)
> > > 
> > > > +	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 3);
> > > > +	/* Writing __sk_buff->tstamp at ingress as the (rcv) timestamp.
> > > > +	 * Clear the skb->mono_delivery_time.
> > > > +	 */
> > > > +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
> > > > +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> > > > +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
> > > > +				~SKB_MONO_DELIVERY_TIME_MASK);
> > > > +	*insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg,
> > > > +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> > > > +#endif
> > > > +
> > > > +	/* skb->tstamp = tstamp */
> > > > +	*insn++ = BPF_STX_MEM(BPF_DW, skb_reg, value_reg,
> > > > +			      offsetof(struct sk_buff, tstamp));
> > > > +	return insn;
> > > > +}
> > > > +
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-03-03 23:42 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-02 19:55 [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
2022-03-02 19:55 ` [PATCH v6 net-next 01/13] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Martin KaFai Lau
2022-03-02 19:55 ` [PATCH v6 net-next 02/13] net: Add skb_clear_tstamp() to keep the mono delivery_time Martin KaFai Lau
2022-03-02 19:55 ` [PATCH v6 net-next 03/13] net: Handle delivery_time in skb->tstamp during network tapping with af_packet Martin KaFai Lau
2022-03-03 10:48   ` Daniel Borkmann
2022-03-03 19:17     ` Martin KaFai Lau
2022-03-02 19:55 ` [PATCH v6 net-next 04/13] net: Clear mono_delivery_time bit in __skb_tstamp_tx() Martin KaFai Lau
2022-03-02 19:55 ` [PATCH v6 net-next 05/13] net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() Martin KaFai Lau
2022-03-02 19:55 ` [PATCH v6 net-next 06/13] net: ip: Handle delivery_time in ip defrag Martin KaFai Lau
2022-03-02 19:56 ` [PATCH v6 net-next 07/13] net: ipv6: Handle delivery_time in ipv6 defrag Martin KaFai Lau
2022-03-02 19:56 ` [PATCH v6 net-next 08/13] net: ipv6: Get rcv timestamp if needed when handling hop-by-hop IOAM option Martin KaFai Lau
2022-03-02 19:56 ` [PATCH v6 net-next 09/13] net: Get rcv tstamp if needed in nfnetlink_{log, queue}.c Martin KaFai Lau
2022-03-02 19:56 ` [PATCH v6 net-next 10/13] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally Martin KaFai Lau
2022-03-02 20:30   ` Eric Dumazet
2022-03-02 22:33     ` Martin KaFai Lau
2022-03-02 23:41       ` Eric Dumazet
2022-03-03  0:19         ` Martin KaFai Lau
2022-03-03  0:47           ` Eric Dumazet
2022-03-02 19:56 ` [PATCH v6 net-next 11/13] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress Martin KaFai Lau
2022-03-03 13:00   ` Daniel Borkmann
2022-03-03 20:43     ` Martin KaFai Lau
2022-03-03 22:55       ` Daniel Borkmann
2022-03-03 23:42         ` Martin KaFai Lau
2022-03-02 19:56 ` [PATCH v6 net-next 12/13] bpf: Add __sk_buff->delivery_time_type and bpf_skb_set_skb_delivery_time() Martin KaFai Lau
2022-03-02 19:56 ` [PATCH v6 net-next 13/13] bpf: selftests: test skb->tstamp in redirect_neigh Martin KaFai Lau
2022-03-03  0:36 ` [PATCH v6 net-next 0/13] Preserve mono delivery time (EDT) in skb->tstamp Eric Dumazet
2022-03-03 14:50 ` patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.