All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 net-next 0/8] Preserve mono delivery time (EDT) in skb->tstamp
@ 2022-02-11  7:12 Martin KaFai Lau
  2022-02-11  7:12 ` [PATCH v4 net-next 1/8] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Martin KaFai Lau
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: Martin KaFai Lau @ 2022-02-11  7:12 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

skb->tstamp was first used as the (rcv) timestamp.
The major usage is to report it to the user (e.g. SO_TIMESTAMP).

Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
during egress and used by the qdisc (e.g. sch_fq) to make decision on when
the skb can be passed to the dev.

Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
or the delivery_time, so it is always reset to 0 whenever forwarded
between egress and ingress.

While it makes sense to always clear the (rcv) timestamp in skb->tstamp
to avoid confusing sch_fq that expects the delivery_time, it is a
performance issue [0] to clear the delivery_time if the skb finally
egress to a fq@phy-dev.

This set is to keep the mono delivery time and make it available to
the final egress interface.  Please see individual patch for
the details.

[0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf

v4:
netdev:
- Push the skb_clear_delivery_time() from
  ip_local_deliver() and ip6_input() to
  ip_local_deliver_finish() and ip6_input_finish()
  to accommodate the ipvs forward path.
  This is the notable change in v4 at the netdev side.

    - Patch 3/8 first does the skb_clear_delivery_time() after
      sch_handle_ingress() in dev.c and this will make the
      tc-bpf forward path work via the bpf_redirect_*() helper.

    - The next patch 4/8 (new in v4) will then postpone the
      skb_clear_delivery_time() from dev.c to
      the ip_local_deliver_finish() and ip6_input_finish() after
      taking care of the tstamp usage in the ip defrag case.
      This will make the kernel forward path also work, e.g.
      the ip[6]_forward().

- Fixed a case v3 which missed setting the skb->mono_delivery_time bit
  when sending TCP rst/ack in some cases (e.g. from a ctl_sk).
  That case happens at ip_send_unicast_reply() and
  tcp_v6_send_response().  It is fixed in patch 1/8 (and
  then patch 3/8) in v4.

bpf:
- Adding __sk_buff->delivery_time_type instead of adding
  __sk_buff->mono_delivery_time as in v3.  The tc-bpf can stay with
  one __sk_buff->tstamp instead of having two 'time' fields
  while one is 0 and another is not.
  tc-bpf can use the new __sk_buff->delivery_time_type to tell
  what is stored in __sk_buff->tstamp.
- bpf_skb_set_delivery_time() helper is added to set
  __sk_buff->tstamp from non mono delivery_time to
  mono delivery_time
- Most of the convert_ctx_access() bpf insn rewrite in v3
  is gone, so no new rewrite added for __sk_buff->tstamp.
  The only rewrite added is for reading the new
  __sk_buff->delivery_time_type.
- Added selftests, test_tc_dtime.c

v3:
- Feedback from v2 is using shinfo(skb)->tx_flags could be racy.
- Considered to reuse a few bits in skb->tstamp to represent
  different semantics, other than more code churns, it will break
  the bpf usecase which currently can write and then read back
  the skb->tstamp.
- Went back to v1 idea on adding a bit to skb and address the
  feedbacks on v1:
- Added one bit skb->mono_delivery_time to flag that
  the skb->tstamp has the mono delivery_time (EDT), instead
  of adding a bit to flag if the skb->tstamp has been forwarded or not.
- Instead of resetting the delivery_time back to the (rcv) timestamp
  during recvmsg syscall which may be too late and not useful,
  the delivery_time reset in v3 happens earlier once the stack
  knows that the skb will be delivered locally.
- Handled the tapping@ingress case by af_packet
- No need to change the (rcv) timestamp to mono clock base as in v1.
  The added one bit to flag skb->mono_delivery_time is enough
  to keep the EDT delivery_time during forward.
- Added logic to the bpf side to make the existing bpf
  running at ingress can still get the (rcv) timestamp
  when reading the __sk_buff->tstamp.  New __sk_buff->mono_delivery_time
  is also added.  Test is still needed to test this piece.

Martin KaFai Lau (8):
  net: Add skb->mono_delivery_time to distinguish mono delivery_time
    from (rcv) timestamp
  net: Add skb_clear_tstamp() to keep the mono delivery_time
  net: Set skb->mono_delivery_time and clear it after
    sch_handle_ingress()
  net: Postpone skb_clear_delivery_time() until knowing the skb is
    delivered locally
  bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress
  bpf: Clear skb->mono_delivery_time bit if needed after running
    tc-bpf@egress
  bpf: Add __sk_buff->delivery_time_type and bpf_skb_set_delivery_time()
  bpf: selftests: test skb->tstamp in redirect_neigh

 drivers/net/loopback.c                        |   2 +-
 include/linux/filter.h                        |  33 +-
 include/linux/skbuff.h                        |  64 ++-
 include/net/inet_frag.h                       |   1 +
 include/uapi/linux/bpf.h                      |  35 +-
 net/bridge/br_forward.c                       |   2 +-
 net/bridge/netfilter/nf_conntrack_bridge.c    |   5 +-
 net/core/dev.c                                |   4 +-
 net/core/filter.c                             |  85 +++-
 net/core/skbuff.c                             |   8 +-
 net/ipv4/inet_fragment.c                      |   1 +
 net/ipv4/ip_forward.c                         |   2 +-
 net/ipv4/ip_fragment.c                        |   1 +
 net/ipv4/ip_input.c                           |   1 +
 net/ipv4/ip_output.c                          |   6 +-
 net/ipv4/tcp_output.c                         |  16 +-
 net/ipv6/ip6_input.c                          |   1 +
 net/ipv6/ip6_output.c                         |   7 +-
 net/ipv6/netfilter.c                          |   5 +-
 net/ipv6/tcp_ipv6.c                           |   2 +-
 net/netfilter/ipvs/ip_vs_xmit.c               |   6 +-
 net/netfilter/nf_dup_netdev.c                 |   2 +-
 net/netfilter/nf_flow_table_ip.c              |   4 +-
 net/netfilter/nft_fwd_netdev.c                |   2 +-
 net/openvswitch/vport.c                       |   2 +-
 net/packet/af_packet.c                        |   4 +-
 net/sched/act_bpf.c                           |   7 +-
 net/sched/cls_bpf.c                           |   8 +-
 net/xfrm/xfrm_interface.c                     |   2 +-
 tools/include/uapi/linux/bpf.h                |  35 +-
 .../selftests/bpf/prog_tests/tc_redirect.c    | 434 ++++++++++++++++++
 .../selftests/bpf/progs/test_tc_dtime.c       | 348 ++++++++++++++
 32 files changed, 1078 insertions(+), 57 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_tc_dtime.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 net-next 1/8] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp
  2022-02-11  7:12 [PATCH v4 net-next 0/8] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
@ 2022-02-11  7:12 ` Martin KaFai Lau
  2022-02-12 19:13   ` Cong Wang
  2022-02-15 20:49   ` Daniel Borkmann
  2022-02-11  7:12 ` [PATCH v4 net-next 2/8] net: Add skb_clear_tstamp() to keep the mono delivery_time Martin KaFai Lau
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 19+ messages in thread
From: Martin KaFai Lau @ 2022-02-11  7:12 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

skb->tstamp was first used as the (rcv) timestamp.
The major usage is to report it to the user (e.g. SO_TIMESTAMP).

Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
during egress and used by the qdisc (e.g. sch_fq) to make decision on when
the skb can be passed to the dev.

Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
or the delivery_time, so it is always reset to 0 whenever forwarded
between egress and ingress.

While it makes sense to always clear the (rcv) timestamp in skb->tstamp
to avoid confusing sch_fq that expects the delivery_time, it is a
performance issue [0] to clear the delivery_time if the skb finally
egress to a fq@phy-dev.  For example, when forwarding from egress to
ingress and then finally back to egress:

            tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
                                     ^              ^
                                     reset          rest

This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.

The current use case is to keep the TCP mono delivery_time (EDT) and
to be used with sch_fq.  A later patch will also allow tc-bpf@ingress
to read and change the mono delivery_time.

In the future, another bit (e.g. skb->user_delivery_time) can be added
for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.

[ This patch is a prep work.  The following patch will
  get the other parts of the stack ready first.  Then another patch
  after that will finally set the skb->mono_delivery_time. ]

skb_set_delivery_time() function is added.  It is used by the tcp_output.c
and during ip[6] fragmentation to assign the delivery_time to
the skb->tstamp and also set the skb->mono_delivery_time.

A note on the change in ip_send_unicast_reply() in ip_output.c.
It is only used by TCP to send reset/ack out of a ctl_sk.
Like the new skb_set_delivery_time(), this patch sets
the skb->mono_delivery_time to 0 for now as a place
holder.  It will be enabled in a later patch.
A similar case in tcp_ipv6 can be done with
skb_set_delivery_time() in tcp_v6_send_response().

[0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/skbuff.h                     | 13 +++++++++++++
 net/bridge/netfilter/nf_conntrack_bridge.c |  5 +++--
 net/ipv4/ip_output.c                       |  7 +++++--
 net/ipv4/tcp_output.c                      | 16 +++++++++-------
 net/ipv6/ip6_output.c                      |  5 +++--
 net/ipv6/netfilter.c                       |  5 +++--
 net/ipv6/tcp_ipv6.c                        |  2 +-
 7 files changed, 37 insertions(+), 16 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a5adbf6b51e8..32c793de3801 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -747,6 +747,10 @@ typedef unsigned char *sk_buff_data_t;
  *	@dst_pending_confirm: need to confirm neighbour
  *	@decrypted: Decrypted SKB
  *	@slow_gro: state present at GRO time, slower prepare step required
+ *	@mono_delivery_time: When set, skb->tstamp has the
+ *		delivery_time in mono clock base (i.e. EDT).  Otherwise, the
+ *		skb->tstamp has the (rcv) timestamp at ingress and
+ *		delivery_time at egress.
  *	@napi_id: id of the NAPI struct this skb came from
  *	@sender_cpu: (aka @napi_id) source CPU in XPS
  *	@secmark: security marking
@@ -917,6 +921,7 @@ struct sk_buff {
 	__u8			decrypted:1;
 #endif
 	__u8			slow_gro:1;
+	__u8			mono_delivery_time:1;
 
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
@@ -3925,6 +3930,14 @@ static inline ktime_t net_timedelta(ktime_t t)
 	return ktime_sub(ktime_get_real(), t);
 }
 
+static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
+					 bool mono)
+{
+	skb->tstamp = kt;
+	/* Setting mono_delivery_time will be enabled later */
+	skb->mono_delivery_time = 0;
+}
+
 static inline u8 skb_metadata_len(const struct sk_buff *skb)
 {
 	return skb_shinfo(skb)->meta_len;
diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c
index fdbed3158555..ebfb2a5c59e4 100644
--- a/net/bridge/netfilter/nf_conntrack_bridge.c
+++ b/net/bridge/netfilter/nf_conntrack_bridge.c
@@ -32,6 +32,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
 					   struct sk_buff *))
 {
 	int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
+	bool mono_delivery_time = skb->mono_delivery_time;
 	unsigned int hlen, ll_rs, mtu;
 	ktime_t tstamp = skb->tstamp;
 	struct ip_frag_state state;
@@ -81,7 +82,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
 			if (iter.frag)
 				ip_fraglist_prepare(skb, &iter);
 
-			skb->tstamp = tstamp;
+			skb_set_delivery_time(skb, tstamp, mono_delivery_time);
 			err = output(net, sk, data, skb);
 			if (err || !iter.frag)
 				break;
@@ -112,7 +113,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
 			goto blackhole;
 		}
 
-		skb2->tstamp = tstamp;
+		skb_set_delivery_time(skb2, tstamp, mono_delivery_time);
 		err = output(net, sk, data, skb2);
 		if (err)
 			goto blackhole;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 0c0574eb5f5b..7af5d1849bc9 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -761,6 +761,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 {
 	struct iphdr *iph;
 	struct sk_buff *skb2;
+	bool mono_delivery_time = skb->mono_delivery_time;
 	struct rtable *rt = skb_rtable(skb);
 	unsigned int mtu, hlen, ll_rs;
 	struct ip_fraglist_iter iter;
@@ -852,7 +853,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 				}
 			}
 
-			skb->tstamp = tstamp;
+			skb_set_delivery_time(skb, tstamp, mono_delivery_time);
 			err = output(net, sk, skb);
 
 			if (!err)
@@ -908,7 +909,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 		/*
 		 *	Put this fragment into the sending queue.
 		 */
-		skb2->tstamp = tstamp;
+		skb_set_delivery_time(skb2, tstamp, mono_delivery_time);
 		err = output(net, sk, skb2);
 		if (err)
 			goto fail;
@@ -1727,6 +1728,8 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
 			  arg->csumoffset) = csum_fold(csum_add(nskb->csum,
 								arg->csum));
 		nskb->ip_summed = CHECKSUM_NONE;
+		/* Setting mono_delivery_time will be enabled later */
+		nskb->mono_delivery_time = 0;
 		ip_push_pending_frames(sk, &fl4);
 	}
 out:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e76bf1e9251e..2319531267c6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1253,7 +1253,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 	tp = tcp_sk(sk);
 	prior_wstamp = tp->tcp_wstamp_ns;
 	tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache);
-	skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
+	skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true);
 	if (clone_it) {
 		oskb = skb;
 
@@ -1589,7 +1589,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 
 	skb_split(skb, buff, len);
 
-	buff->tstamp = skb->tstamp;
+	skb_set_delivery_time(buff, skb->tstamp, true);
 	tcp_fragment_tstamp(skb, buff);
 
 	old_factor = tcp_skb_pcount(skb);
@@ -2616,7 +2616,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 
 		if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
 			/* "skb_mstamp_ns" is used as a start point for the retransmit timer */
-			skb->skb_mstamp_ns = tp->tcp_wstamp_ns = tp->tcp_clock_cache;
+			tp->tcp_wstamp_ns = tp->tcp_clock_cache;
+			skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true);
 			list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue);
 			tcp_init_tso_segs(skb, mss_now);
 			goto repair; /* Skip network transmission */
@@ -3541,11 +3542,12 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	now = tcp_clock_ns();
 #ifdef CONFIG_SYN_COOKIES
 	if (unlikely(synack_type == TCP_SYNACK_COOKIE && ireq->tstamp_ok))
-		skb->skb_mstamp_ns = cookie_init_timestamp(req, now);
+		skb_set_delivery_time(skb, cookie_init_timestamp(req, now),
+				      true);
 	else
 #endif
 	{
-		skb->skb_mstamp_ns = now;
+		skb_set_delivery_time(skb, now, true);
 		if (!tcp_rsk(req)->snt_synack) /* Timestamp first SYNACK */
 			tcp_rsk(req)->snt_synack = tcp_skb_timestamp_us(skb);
 	}
@@ -3594,7 +3596,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb,
 				synack_type, &opts);
 
-	skb->skb_mstamp_ns = now;
+	skb_set_delivery_time(skb, now, true);
 	tcp_add_tx_delay(skb, tp);
 
 	return skb;
@@ -3771,7 +3773,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 
 	err = tcp_transmit_skb(sk, syn_data, 1, sk->sk_allocation);
 
-	syn->skb_mstamp_ns = syn_data->skb_mstamp_ns;
+	skb_set_delivery_time(syn, syn_data->skb_mstamp_ns, true);
 
 	/* Now full SYN+DATA was cloned and sent (or not),
 	 * remove the SYN from the original skb (syn_data)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 0c6c971ce0a5..55665f3f7a77 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -813,6 +813,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 	struct rt6_info *rt = (struct rt6_info *)skb_dst(skb);
 	struct ipv6_pinfo *np = skb->sk && !dev_recursion_level() ?
 				inet6_sk(skb->sk) : NULL;
+	bool mono_delivery_time = skb->mono_delivery_time;
 	struct ip6_frag_state state;
 	unsigned int mtu, hlen, nexthdr_offset;
 	ktime_t tstamp = skb->tstamp;
@@ -903,7 +904,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			if (iter.frag)
 				ip6_fraglist_prepare(skb, &iter);
 
-			skb->tstamp = tstamp;
+			skb_set_delivery_time(skb, tstamp, mono_delivery_time);
 			err = output(net, sk, skb);
 			if (!err)
 				IP6_INC_STATS(net, ip6_dst_idev(&rt->dst),
@@ -962,7 +963,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 		/*
 		 *	Put this fragment into the sending queue.
 		 */
-		frag->tstamp = tstamp;
+		skb_set_delivery_time(frag, tstamp, mono_delivery_time);
 		err = output(net, sk, frag);
 		if (err)
 			goto fail;
diff --git a/net/ipv6/netfilter.c b/net/ipv6/netfilter.c
index 6ab710b5a1a8..1da332450d98 100644
--- a/net/ipv6/netfilter.c
+++ b/net/ipv6/netfilter.c
@@ -121,6 +121,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 				  struct sk_buff *))
 {
 	int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
+	bool mono_delivery_time = skb->mono_delivery_time;
 	ktime_t tstamp = skb->tstamp;
 	struct ip6_frag_state state;
 	u8 *prevhdr, nexthdr = 0;
@@ -186,7 +187,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			if (iter.frag)
 				ip6_fraglist_prepare(skb, &iter);
 
-			skb->tstamp = tstamp;
+			skb_set_delivery_time(skb, tstamp, mono_delivery_time);
 			err = output(net, sk, data, skb);
 			if (err || !iter.frag)
 				break;
@@ -219,7 +220,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			goto blackhole;
 		}
 
-		skb2->tstamp = tstamp;
+		skb_set_delivery_time(skb2, tstamp, mono_delivery_time);
 		err = output(net, sk, data, skb2);
 		if (err)
 			goto blackhole;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 0c648bf07f39..d4fc6641afa4 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -992,7 +992,7 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32
 		} else {
 			mark = sk->sk_mark;
 		}
-		buff->tstamp = tcp_transmit_time(sk);
+		skb_set_delivery_time(buff, tcp_transmit_time(sk), true);
 	}
 	fl6.flowi6_mark = IP6_REPLY_MARK(net, skb->mark) ?: mark;
 	fl6.fl6_dport = t1->dest;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 net-next 2/8] net: Add skb_clear_tstamp() to keep the mono delivery_time
  2022-02-11  7:12 [PATCH v4 net-next 0/8] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
  2022-02-11  7:12 ` [PATCH v4 net-next 1/8] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Martin KaFai Lau
@ 2022-02-11  7:12 ` Martin KaFai Lau
  2022-02-11  7:12 ` [PATCH v4 net-next 3/8] net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() Martin KaFai Lau
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Martin KaFai Lau @ 2022-02-11  7:12 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

Right now, skb->tstamp is reset to 0 whenever the skb is forwarded.

If skb->tstamp has the mono delivery_time, clearing it can hurt
the performance when it finally transmits out to fq@phy-dev.

The earlier patch added a skb->mono_delivery_time bit to
flag the skb->tstamp carrying the mono delivery_time.

This patch adds skb_clear_tstamp() helper which keeps
the mono delivery_time and clear everything else.

The delivery_time clearing is postponed until the stack knows the
skb will be delivered locally.  It will be done in a later patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 drivers/net/loopback.c           |  2 +-
 include/linux/skbuff.h           | 10 +++++++++-
 net/bridge/br_forward.c          |  2 +-
 net/core/filter.c                |  6 +++---
 net/core/skbuff.c                |  2 +-
 net/ipv4/ip_forward.c            |  2 +-
 net/ipv6/ip6_output.c            |  2 +-
 net/netfilter/ipvs/ip_vs_xmit.c  |  6 +++---
 net/netfilter/nf_dup_netdev.c    |  2 +-
 net/netfilter/nf_flow_table_ip.c |  4 ++--
 net/netfilter/nft_fwd_netdev.c   |  2 +-
 net/openvswitch/vport.c          |  2 +-
 net/xfrm/xfrm_interface.c        |  2 +-
 13 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index ed0edf5884ef..70a38fa09299 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -74,7 +74,7 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb,
 	skb_tx_timestamp(skb);
 
 	/* do not fool net_timestamp_check() with various clock bases */
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	skb_orphan(skb);
 
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 32c793de3801..07e618f8b41a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3938,6 +3938,14 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
 	skb->mono_delivery_time = 0;
 }
 
+static inline void skb_clear_tstamp(struct sk_buff *skb)
+{
+	if (skb->mono_delivery_time)
+		return;
+
+	skb->tstamp = 0;
+}
+
 static inline u8 skb_metadata_len(const struct sk_buff *skb)
 {
 	return skb_shinfo(skb)->meta_len;
@@ -4794,7 +4802,7 @@ static inline void skb_set_redirected(struct sk_buff *skb, bool from_ingress)
 #ifdef CONFIG_NET_REDIRECT
 	skb->from_ingress = from_ingress;
 	if (skb->from_ingress)
-		skb->tstamp = 0;
+		skb_clear_tstamp(skb);
 #endif
 }
 
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index ec646656dbf1..02bb620d3b8d 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -62,7 +62,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit);
 
 int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,
 		       net, sk, skb, NULL, skb->dev,
 		       br_dev_queue_push_xmit);
diff --git a/net/core/filter.c b/net/core/filter.c
index f497ca7a16d2..a2d712be4985 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2107,7 +2107,7 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	dev_xmit_recursion_inc();
 	ret = dev_queue_xmit(skb);
@@ -2176,7 +2176,7 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb,
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
 		skb = skb_expand_head(skb, hh_len);
@@ -2274,7 +2274,7 @@ static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb,
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
 		skb = skb_expand_head(skb, hh_len);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0118f0afaa4f..3e3da8fdf8f5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5350,7 +5350,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 
 	ipvs_reset(skb);
 	skb->mark = 0;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 }
 EXPORT_SYMBOL_GPL(skb_scrub_packet);
 
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 00ec819f949b..92ba3350274b 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -79,7 +79,7 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s
 	if (unlikely(opt->optlen))
 		ip_forward_options(skb);
 
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	return dst_output(net, sk, skb);
 }
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 55665f3f7a77..9fc1b08cf622 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -440,7 +440,7 @@ static inline int ip6_forward_finish(struct net *net, struct sock *sk,
 	}
 #endif
 
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	return dst_output(net, sk, skb);
 }
 
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index d2e5a8f644b8..029171379884 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -610,7 +610,7 @@ static inline int ip_vs_tunnel_xmit_prepare(struct sk_buff *skb,
 		nf_reset_ct(skb);
 		skb_forward_csum(skb);
 		if (skb->dev)
-			skb->tstamp = 0;
+			skb_clear_tstamp(skb);
 	}
 	return ret;
 }
@@ -652,7 +652,7 @@ static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb,
 	if (!local) {
 		skb_forward_csum(skb);
 		if (skb->dev)
-			skb->tstamp = 0;
+			skb_clear_tstamp(skb);
 		NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb,
 			NULL, skb_dst(skb)->dev, dst_output);
 	} else
@@ -674,7 +674,7 @@ static inline int ip_vs_send_or_cont(int pf, struct sk_buff *skb,
 		ip_vs_drop_early_demux_sk(skb);
 		skb_forward_csum(skb);
 		if (skb->dev)
-			skb->tstamp = 0;
+			skb_clear_tstamp(skb);
 		NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb,
 			NULL, skb_dst(skb)->dev, dst_output);
 	} else
diff --git a/net/netfilter/nf_dup_netdev.c b/net/netfilter/nf_dup_netdev.c
index a579e59ee5c5..7873bd1389c3 100644
--- a/net/netfilter/nf_dup_netdev.c
+++ b/net/netfilter/nf_dup_netdev.c
@@ -19,7 +19,7 @@ static void nf_do_netdev_egress(struct sk_buff *skb, struct net_device *dev)
 		skb_push(skb, skb->mac_len);
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	dev_queue_xmit(skb);
 }
 
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 889cf88d3dba..f1d387129f02 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -376,7 +376,7 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
 	nf_flow_nat_ip(flow, skb, thoff, dir, iph);
 
 	ip_decrease_ttl(iph);
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	if (flow_table->flags & NF_FLOWTABLE_COUNTER)
 		nf_ct_acct_update(flow->ct, tuplehash->tuple.dir, skb->len);
@@ -611,7 +611,7 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 	nf_flow_nat_ipv6(flow, skb, dir, ip6h);
 
 	ip6h->hop_limit--;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	if (flow_table->flags & NF_FLOWTABLE_COUNTER)
 		nf_ct_acct_update(flow->ct, tuplehash->tuple.dir, skb->len);
diff --git a/net/netfilter/nft_fwd_netdev.c b/net/netfilter/nft_fwd_netdev.c
index fa9301ca6033..4b2b0946c0b6 100644
--- a/net/netfilter/nft_fwd_netdev.c
+++ b/net/netfilter/nft_fwd_netdev.c
@@ -140,7 +140,7 @@ static void nft_fwd_neigh_eval(const struct nft_expr *expr,
 		return;
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	neigh_xmit(neigh_table, dev, addr, skb);
 out:
 	regs->verdict.code = verdict;
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index cf2ce5812489..82a74f998966 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -507,7 +507,7 @@ void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto)
 	}
 
 	skb->dev = vport->dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	vport->ops->send(skb);
 	return;
 
diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 57448fc519fc..4991e99ced9a 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -190,7 +190,7 @@ static void xfrmi_dev_uninit(struct net_device *dev)
 
 static void xfrmi_scrub_packet(struct sk_buff *skb, bool xnet)
 {
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	skb->pkt_type = PACKET_HOST;
 	skb->skb_iif = 0;
 	skb->ignore_df = 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 net-next 3/8] net: Set skb->mono_delivery_time and clear it after sch_handle_ingress()
  2022-02-11  7:12 [PATCH v4 net-next 0/8] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
  2022-02-11  7:12 ` [PATCH v4 net-next 1/8] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Martin KaFai Lau
  2022-02-11  7:12 ` [PATCH v4 net-next 2/8] net: Add skb_clear_tstamp() to keep the mono delivery_time Martin KaFai Lau
@ 2022-02-11  7:12 ` Martin KaFai Lau
  2022-02-15 21:04   ` Daniel Borkmann
  2022-02-11  7:12 ` [PATCH v4 net-next 4/8] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally Martin KaFai Lau
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Martin KaFai Lau @ 2022-02-11  7:12 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

This patch sets the skb->mono_delivery_time to flag the skb->tstamp
is used as the mono delivery_time (EDT) instead of the (rcv) timestamp.

skb_clear_delivery_time() is added to clear the delivery_time and set
back to the (rcv) timestamp after sch_handle_ingress() such that
the tc-bpf prog can use bpf_redirect_*() to forward it to the egress
of another iface and keep the EDT delivery_time.

The next patch will postpone the skb_clear_delivery_time() until the
stack learns that the skb is being delivered locally and that will
make other kernel forwarding paths (ip[6]_forward) able to keep
the delivery_time also.  Thus, like the previous patches on using
the skb->mono_delivery_time bit, calling skb_clear_delivery_time()
is not done within the CONFIG_NET_INGRESS to avoid too many code
churns among this set.

Before sch_handle_ingress(), another case needs to clear the delivery_time
is the network tapping (e.g. af_packet by tcpdump).  Regardless of tapping
at the ingress or egress,  the tapped skb is received by the af_packet
socket, so it is ingress to the af_packet socket and it expects
the (rcv) timestamp.

When tapping at egress, dev_queue_xmit_nit() is used.  It has already
expected skb->tstamp may have delivery_time,  so it does
skb_clone()+net_timestamp_set() to ensure the cloned skb has
the (rcv) timestamp before passing to the af_packet sk.
This patch only adds to clear the skb->mono_delivery_time
bit in net_timestamp_set().

When tapping at ingress, it currently expects the skb->tstamp is either 0
or has the (rcv) timestamp.  Meaning, the tapping at ingress path
has already expected the skb->tstamp could be 0 and it will get
the (rcv) timestamp by ktime_get_real() when needed.

There are two cases for tapping at ingress:

One case is af_packet queues the skb to its sk_receive_queue.  The skb
is either not shared or new clone created.  The skb_clear_delivery_time()
is called to clear the delivery_time (if any) before it is queued to the
sk_receive_queue.

Another case, the ingress skb is directly copied to the rx_ring
and tpacket_get_timestamp() is used to get the (rcv) timestamp.
skb_tstamp() is used in tpacket_get_timestamp() to check
the skb->mono_delivery_time bit before returning skb->tstamp.
As mentioned earlier, the tapping@ingress has already expected
the skb may not have the (rcv) timestamp (because no sk has asked
for it) and has handled this case by directly calling ktime_get_real().

In __skb_tstamp_tx, it clones the egress skb and queues the clone to the
sk_error_queue.  The outgoing skb may have the mono delivery_time while
the (rcv) timestamp is expected for the clone, so the
skb->mono_delivery_time bit is also cleared from the clone.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/skbuff.h | 27 +++++++++++++++++++++++++--
 net/core/dev.c         |  5 ++++-
 net/core/skbuff.c      |  6 ++++--
 net/ipv4/ip_output.c   |  3 +--
 net/packet/af_packet.c |  4 +++-
 5 files changed, 37 insertions(+), 8 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 07e618f8b41a..0e09e75fa787 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3934,8 +3934,23 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
 					 bool mono)
 {
 	skb->tstamp = kt;
-	/* Setting mono_delivery_time will be enabled later */
-	skb->mono_delivery_time = 0;
+	skb->mono_delivery_time = kt && mono;
+}
+
+DECLARE_STATIC_KEY_FALSE(netstamp_needed_key);
+
+/* It is used in the ingress path to clear the delivery_time.
+ * If needed, set the skb->tstamp to the (rcv) timestamp.
+ */
+static inline void skb_clear_delivery_time(struct sk_buff *skb)
+{
+	if (unlikely(skb->mono_delivery_time)) {
+		skb->mono_delivery_time = 0;
+		if (static_branch_unlikely(&netstamp_needed_key))
+			skb->tstamp = ktime_get_real();
+		else
+			skb->tstamp = 0;
+	}
 }
 
 static inline void skb_clear_tstamp(struct sk_buff *skb)
@@ -3946,6 +3961,14 @@ static inline void skb_clear_tstamp(struct sk_buff *skb)
 	skb->tstamp = 0;
 }
 
+static inline ktime_t skb_tstamp(const struct sk_buff *skb)
+{
+	if (unlikely(skb->mono_delivery_time))
+		return 0;
+
+	return skb->tstamp;
+}
+
 static inline u8 skb_metadata_len(const struct sk_buff *skb)
 {
 	return skb_shinfo(skb)->meta_len;
diff --git a/net/core/dev.c b/net/core/dev.c
index f5ef51601081..f41707ab2fb9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2000,7 +2000,8 @@ void net_dec_egress_queue(void)
 EXPORT_SYMBOL_GPL(net_dec_egress_queue);
 #endif
 
-static DEFINE_STATIC_KEY_FALSE(netstamp_needed_key);
+DEFINE_STATIC_KEY_FALSE(netstamp_needed_key);
+EXPORT_SYMBOL(netstamp_needed_key);
 #ifdef CONFIG_JUMP_LABEL
 static atomic_t netstamp_needed_deferred;
 static atomic_t netstamp_wanted;
@@ -2061,6 +2062,7 @@ EXPORT_SYMBOL(net_disable_timestamp);
 static inline void net_timestamp_set(struct sk_buff *skb)
 {
 	skb->tstamp = 0;
+	skb->mono_delivery_time = 0;
 	if (static_branch_unlikely(&netstamp_needed_key))
 		__net_timestamp(skb);
 }
@@ -5220,6 +5222,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
 			goto out;
 	}
 #endif
+	skb_clear_delivery_time(skb);
 	skb_reset_redirect(skb);
 skip_classify:
 	if (pfmemalloc && !skb_pfmemalloc_protocol(skb))
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3e3da8fdf8f5..93dc763da8cb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4817,10 +4817,12 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 		skb_shinfo(skb)->tskey = skb_shinfo(orig_skb)->tskey;
 	}
 
-	if (hwtstamps)
+	if (hwtstamps) {
 		*skb_hwtstamps(skb) = *hwtstamps;
-	else
+	} else {
 		skb->tstamp = ktime_get_real();
+		skb->mono_delivery_time = 0;
+	}
 
 	__skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
 }
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7af5d1849bc9..bfe08feb5d82 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1728,8 +1728,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
 			  arg->csumoffset) = csum_fold(csum_add(nskb->csum,
 								arg->csum));
 		nskb->ip_summed = CHECKSUM_NONE;
-		/* Setting mono_delivery_time will be enabled later */
-		nskb->mono_delivery_time = 0;
+		nskb->mono_delivery_time = !!transmit_time;
 		ip_push_pending_frames(sk, &fl4);
 	}
 out:
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index ab87f22cc7ec..1b93ce1a5600 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -460,7 +460,7 @@ static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts,
 		return TP_STATUS_TS_RAW_HARDWARE;
 
 	if ((flags & SOF_TIMESTAMPING_SOFTWARE) &&
-	    ktime_to_timespec64_cond(skb->tstamp, ts))
+	    ktime_to_timespec64_cond(skb_tstamp(skb), ts))
 		return TP_STATUS_TS_SOFTWARE;
 
 	return 0;
@@ -2199,6 +2199,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
 	spin_lock(&sk->sk_receive_queue.lock);
 	po->stats.stats1.tp_packets++;
 	sock_skb_set_dropcount(sk, skb);
+	skb_clear_delivery_time(skb);
 	__skb_queue_tail(&sk->sk_receive_queue, skb);
 	spin_unlock(&sk->sk_receive_queue.lock);
 	sk->sk_data_ready(sk);
@@ -2377,6 +2378,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	po->stats.stats1.tp_packets++;
 	if (copy_skb) {
 		status |= TP_STATUS_COPY;
+		skb_clear_delivery_time(copy_skb);
 		__skb_queue_tail(&sk->sk_receive_queue, copy_skb);
 	}
 	spin_unlock(&sk->sk_receive_queue.lock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 net-next 4/8] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally
  2022-02-11  7:12 [PATCH v4 net-next 0/8] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2022-02-11  7:12 ` [PATCH v4 net-next 3/8] net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() Martin KaFai Lau
@ 2022-02-11  7:12 ` Martin KaFai Lau
  2022-02-11  7:13 ` [PATCH v4 net-next 5/8] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress Martin KaFai Lau
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Martin KaFai Lau @ 2022-02-11  7:12 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

This patch postpones the delivery_time clearing until the stack knows
the skb is being delivered locally.  That will allow other kernel
forwarding path (e.g. ip[6]_forward) to keep the delivery_time also.

An earlier attempt was to do skb_clear_delivery_time() in
ip_local_deliver() and ip6_input().  The discussion [0] requested to
move it one step later into ip_local_deliver_finish()
and ip6_input_finish() so that the delivery_time can be kept
for the ip_vs forwarding path also.  To do that, this patch also
needs to take care of the (rcv) timestamp usecase in ip_is_fragment().
It needs to expect delivery_time in the skb->tstamp, so it needs to
save the mono_delivery_time bit in inet_frag_queue such that the
delivery_time (if any) can be restored in the final defragmented skb.

The ipv6 defrag is done in ip6_protocol_deliver_rcu() when figuring
out how to handle nexthdr and IPPROTO_FRAGMENT (44) is one of the
ipv6 extension header.  ip6_protocol_deliver_rcu() is after
ip6_input_finish() where the skb_clear_delivery_time() has
already been done, so change is not needed.

[0]: https://lore.kernel.org/netdev/ca728d81-80e8-3767-d5e-d44f6ad96e43@ssi.bg/

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/inet_frag.h  | 1 +
 net/core/dev.c           | 1 -
 net/ipv4/inet_fragment.c | 1 +
 net/ipv4/ip_fragment.c   | 1 +
 net/ipv4/ip_input.c      | 1 +
 net/ipv6/ip6_input.c     | 1 +
 6 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 63540be0fc34..c0e517f31d82 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -90,6 +90,7 @@ struct inet_frag_queue {
 	ktime_t			stamp;
 	int			len;
 	int			meat;
+	bool			mono_delivery_time;
 	__u8			flags;
 	u16			max_size;
 	struct fqdir		*fqdir;
diff --git a/net/core/dev.c b/net/core/dev.c
index f41707ab2fb9..041cef7473fd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5222,7 +5222,6 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
 			goto out;
 	}
 #endif
-	skb_clear_delivery_time(skb);
 	skb_reset_redirect(skb);
 skip_classify:
 	if (pfmemalloc && !skb_pfmemalloc_protocol(skb))
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 341096807100..63948f6aeca0 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -572,6 +572,7 @@ void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
 	skb_mark_not_on_list(head);
 	head->prev = NULL;
 	head->tstamp = q->stamp;
+	head->mono_delivery_time = q->mono_delivery_time;
 }
 EXPORT_SYMBOL(inet_frag_reasm_finish);
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index fad803d2d711..fb153569889e 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -349,6 +349,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		qp->iif = dev->ifindex;
 
 	qp->q.stamp = skb->tstamp;
+	qp->q.mono_delivery_time = skb->mono_delivery_time;
 	qp->q.meat += skb->len;
 	qp->ecn |= ecn;
 	add_frag_mem_limit(qp->q.fqdir, skb->truesize);
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index d94f9f7e60c3..95f7bb052784 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -226,6 +226,7 @@ void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int protocol)
 
 static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
+	skb_clear_delivery_time(skb);
 	__skb_pull(skb, skb_network_header_len(skb));
 
 	rcu_read_lock();
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index d4b1e2c5aa76..5b5ea35635f9 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -459,6 +459,7 @@ void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int nexthdr,
 
 static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
+	skb_clear_delivery_time(skb);
 	rcu_read_lock();
 	ip6_protocol_deliver_rcu(net, skb, 0, false);
 	rcu_read_unlock();
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 net-next 5/8] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress
  2022-02-11  7:12 [PATCH v4 net-next 0/8] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2022-02-11  7:12 ` [PATCH v4 net-next 4/8] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally Martin KaFai Lau
@ 2022-02-11  7:13 ` Martin KaFai Lau
  2022-02-15 23:30   ` Daniel Borkmann
  2022-02-11  7:13 ` [PATCH v4 net-next 6/8] bpf: Clear skb->mono_delivery_time bit if needed after running tc-bpf@egress Martin KaFai Lau
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Martin KaFai Lau @ 2022-02-11  7:13 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

The current tc-bpf@ingress reads and writes the __sk_buff->tstamp
as a (rcv) timestamp.  This patch is to backward compatible with the
(rcv) timestamp expectation when the skb->tstamp has a mono delivery_time.

If needed, the patch first saves the mono delivery_time.  Depending on
the static key "netstamp_needed_key", it then resets the skb->tstamp to
either 0 or ktime_get_real() before running the tc-bpf@ingress.  After
the tc-bpf prog returns, if the (rcv) timestamp in skb->tstamp has not
been changed, it will restore the earlier saved mono delivery_time.

The current logic to run tc-bpf@ingress is refactored to a new
bpf_prog_run_at_ingress() function and shared between cls_bpf and act_bpf.
The above new delivery_time save/restore logic is also done together in
this function.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/filter.h | 28 ++++++++++++++++++++++++++++
 net/sched/act_bpf.c    |  5 +----
 net/sched/cls_bpf.c    |  6 +-----
 3 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index d23e999dc032..e43e1701a80e 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -699,6 +699,34 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb)
 	cb->data_end  = skb->data + skb_headlen(skb);
 }
 
+static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog,
+						   struct sk_buff *skb)
+{
+	ktime_t tstamp, saved_mono_dtime = 0;
+	int filter_res;
+
+	if (unlikely(skb->mono_delivery_time)) {
+		saved_mono_dtime = skb->tstamp;
+		skb->mono_delivery_time = 0;
+		if (static_branch_unlikely(&netstamp_needed_key))
+			skb->tstamp = tstamp = ktime_get_real();
+		else
+			skb->tstamp = tstamp = 0;
+	}
+
+	/* It is safe to push/pull even if skb_shared() */
+	__skb_push(skb, skb->mac_len);
+	bpf_compute_data_pointers(skb);
+	filter_res = bpf_prog_run(prog, skb);
+	__skb_pull(skb, skb->mac_len);
+
+	/* __sk_buff->tstamp was not changed, restore the delivery_time */
+	if (unlikely(saved_mono_dtime) && skb_tstamp(skb) == tstamp)
+		skb_set_delivery_time(skb, saved_mono_dtime, true);
+
+	return filter_res;
+}
+
 /* Similar to bpf_compute_data_pointers(), except that save orginal
  * data in cb->data and cb->meta_data for restore.
  */
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index a77d8908e737..14c3bd0a5088 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -45,10 +45,7 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act,
 
 	filter = rcu_dereference(prog->filter);
 	if (at_ingress) {
-		__skb_push(skb, skb->mac_len);
-		bpf_compute_data_pointers(skb);
-		filter_res = bpf_prog_run(filter, skb);
-		__skb_pull(skb, skb->mac_len);
+		filter_res = bpf_prog_run_at_ingress(filter, skb);
 	} else {
 		bpf_compute_data_pointers(skb);
 		filter_res = bpf_prog_run(filter, skb);
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index df19a847829e..036b2e1f74af 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -93,11 +93,7 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 		if (tc_skip_sw(prog->gen_flags)) {
 			filter_res = prog->exts_integrated ? TC_ACT_UNSPEC : 0;
 		} else if (at_ingress) {
-			/* It is safe to push/pull even if skb_shared() */
-			__skb_push(skb, skb->mac_len);
-			bpf_compute_data_pointers(skb);
-			filter_res = bpf_prog_run(prog->filter, skb);
-			__skb_pull(skb, skb->mac_len);
+			filter_res = bpf_prog_run_at_ingress(prog->filter, skb);
 		} else {
 			bpf_compute_data_pointers(skb);
 			filter_res = bpf_prog_run(prog->filter, skb);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 net-next 6/8] bpf: Clear skb->mono_delivery_time bit if needed after running tc-bpf@egress
  2022-02-11  7:12 [PATCH v4 net-next 0/8] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (4 preceding siblings ...)
  2022-02-11  7:13 ` [PATCH v4 net-next 5/8] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress Martin KaFai Lau
@ 2022-02-11  7:13 ` Martin KaFai Lau
  2022-02-11  7:13 ` [PATCH v4 net-next 7/8] bpf: Add __sk_buff->delivery_time_type and bpf_skb_set_delivery_time() Martin KaFai Lau
  2022-02-11  7:13 ` [PATCH v4 net-next 8/8] bpf: selftests: test skb->tstamp in redirect_neigh Martin KaFai Lau
  7 siblings, 0 replies; 19+ messages in thread
From: Martin KaFai Lau @ 2022-02-11  7:13 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

The tc-bpf@egress reads and writes the skb->tstamp as delivery_time.
If tc-bpf@egress sets skb->tstamp to 0, skb->mono_delivery_time should
also be cleared.  It is done in cls_bpf.c and act_bpf.c after
running the tc-bpf@egress.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/sched/act_bpf.c | 2 ++
 net/sched/cls_bpf.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 14c3bd0a5088..d89470976500 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -49,6 +49,8 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act,
 	} else {
 		bpf_compute_data_pointers(skb);
 		filter_res = bpf_prog_run(filter, skb);
+		if (unlikely(skb->mono_delivery_time && !skb->tstamp))
+			skb->mono_delivery_time = 0;
 	}
 	if (skb_sk_is_prefetched(skb) && filter_res != TC_ACT_OK)
 		skb_orphan(skb);
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 036b2e1f74af..c251d1df1fa3 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -97,6 +97,8 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 		} else {
 			bpf_compute_data_pointers(skb);
 			filter_res = bpf_prog_run(prog->filter, skb);
+			if (unlikely(skb->mono_delivery_time && !skb->tstamp))
+				skb->mono_delivery_time = 0;
 		}
 
 		if (prog->exts_integrated) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 net-next 7/8] bpf: Add __sk_buff->delivery_time_type and bpf_skb_set_delivery_time()
  2022-02-11  7:12 [PATCH v4 net-next 0/8] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (5 preceding siblings ...)
  2022-02-11  7:13 ` [PATCH v4 net-next 6/8] bpf: Clear skb->mono_delivery_time bit if needed after running tc-bpf@egress Martin KaFai Lau
@ 2022-02-11  7:13 ` Martin KaFai Lau
  2022-02-11  7:13 ` [PATCH v4 net-next 8/8] bpf: selftests: test skb->tstamp in redirect_neigh Martin KaFai Lau
  7 siblings, 0 replies; 19+ messages in thread
From: Martin KaFai Lau @ 2022-02-11  7:13 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

* __sk_buff->delivery_time_type:
This patch adds __sk_buff->delivery_time_type.  It tells if the
delivery_time is stored in __sk_buff->tstamp or not.

It will be most useful for ingress to tell if the __sk_buff->tstamp
has the (rcv) timestamp or delivery_time.  If delivery_time_type
is 0 (BPF_SKB_DELIVERY_TIME_NONE), it has the (rcv) timestamp.

Two non-zero types are defined for the delivery_time_type,
BPF_SKB_DELIVERY_TIME_MONO and BPF_SKB_DELIVERY_TIME_UNSPEC.  For UNSPEC,
it can only happen in egress because only mono delivery_time can be
forwarded to ingress now.  The clock of UNSPEC delivery_time
can be deduced from the skb->sk->sk_clockid which is how
the sch_etf doing it also.

Thus, while delivery_time_type provides (rcv) timestamp
vs delivery_time signal to tc-bpf@ingress, it should not change the
existing way of doing thing for tc-bpf@egress other than spelling
out more explicitly in the new __sk_buff->delivery_time_type
instead of having the tc-bpf to deduce it by checking the sk
is tcp (mono EDT) or by checking sk->sk_clockid for non-tcp.

delivery_time_type is read only.  Its convert_ctx_access() requires
the skb's mono_delivery_time bit and tc_at_ingress bit.
They are moved up in sk_buff so that bpf rewrite can be done at a
fixed offset.  tc_skip_classify is moved together with tc_at_ingress.
To get one bit for mono_delivery_time, csum_not_inet is moved down and
this bit is currently used by sctp.

* Provide forwarded delivery_time to tc-bpf@ingress:
With the help of the new delivery_time_type, the tc-bpf has a way
to tell if the __sk_buff->tstamp has the (rcv) timestamp or
the delivery_time.  During bpf load time, the verifier will learn if
the bpf prog has accessed the new __sk_buff->delivery_time_type.
If it does, it means the tc-bpf@ingress is expecting the
skb->tstamp could have the delivery_time.  The kernel will then keep
the forwarded delivery_time in skb->tstamp.  This is done by adding a
new prog->delivery_time_access bit.

Since the tc-bpf@ingress can access the delivery_time,
it also needs to clear the skb->mono_delivery_time after
running the bpf if 0 has been written to skb->tstamp.  This
is the same as the tc-bpf@egress in the previous patch.

For tail call, the callee will follow the __sk_buff->tstamp
expectation of its caller at ingress.  If caller does not have
its prog->delivery_time_access set, the callee prog will not have
the forwarded delivery_time in __sk_buff->tstamp and will have
the (rcv) timestamp instead.  If needed, in the future, a new
attach_type can be added to allow the tc-bpf to explicitly specify
its expectation on the __sk_buff->tstamp.

* bpf_skb_set_delivery_time():
The bpf_skb_set_delivery_time() helper is added to allow setting both
delivery_time and the delivery_time_type at the same time.  If the
tc-bpf does not need to change the delivery_time_type, it can directly
write to the __sk_buff->tstamp as the existing tc-bpf has already been
doing.  It will be most useful at ingress to change the
__sk_buff->tstamp from the (rcv) timestamp to
a mono delivery_time and then bpf_redirect_*().

bpf only has mono clock helper (bpf_ktime_get_ns), and
the current known use case is the mono EDT for fq, and
only mono delivery time can be kept during forward now,
so bpf_skb_set_delivery_time() only supports setting
BPF_SKB_DELIVERY_TIME_MONO.  It can be extended later when use cases
come up and the forwarding path also supports other clock bases.
This function could be inline and is left as a future exercise.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/filter.h         |  7 ++-
 include/linux/skbuff.h         | 20 ++++++---
 include/uapi/linux/bpf.h       | 35 ++++++++++++++-
 net/core/filter.c              | 79 +++++++++++++++++++++++++++++++++-
 tools/include/uapi/linux/bpf.h | 35 ++++++++++++++-
 5 files changed, 164 insertions(+), 12 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index e43e1701a80e..00bbde352ad0 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -572,7 +572,8 @@ struct bpf_prog {
 				has_callchain_buf:1, /* callchain buffer allocated? */
 				enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */
 				call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */
-				call_get_func_ip:1; /* Do we call get_func_ip() */
+				call_get_func_ip:1, /* Do we call get_func_ip() */
+				delivery_time_access:1; /* Accessed __sk_buff->delivery_time_type */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
@@ -705,7 +706,7 @@ static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog,
 	ktime_t tstamp, saved_mono_dtime = 0;
 	int filter_res;
 
-	if (unlikely(skb->mono_delivery_time)) {
+	if (unlikely(skb->mono_delivery_time) && !prog->delivery_time_access) {
 		saved_mono_dtime = skb->tstamp;
 		skb->mono_delivery_time = 0;
 		if (static_branch_unlikely(&netstamp_needed_key))
@@ -723,6 +724,8 @@ static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog,
 	/* __sk_buff->tstamp was not changed, restore the delivery_time */
 	if (unlikely(saved_mono_dtime) && skb_tstamp(skb) == tstamp)
 		skb_set_delivery_time(skb, saved_mono_dtime, true);
+	if (unlikely(skb->mono_delivery_time && !skb->tstamp))
+		skb->mono_delivery_time = 0;
 
 	return filter_res;
 }
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0e09e75fa787..fb7146be48f7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -893,22 +893,23 @@ struct sk_buff {
 	__u8			vlan_present:1;	/* See PKT_VLAN_PRESENT_BIT */
 	__u8			csum_complete_sw:1;
 	__u8			csum_level:2;
-	__u8			csum_not_inet:1;
 	__u8			dst_pending_confirm:1;
+	__u8			mono_delivery_time:1;
+
+#ifdef CONFIG_NET_CLS_ACT
+	__u8			tc_skip_classify:1;
+	__u8			tc_at_ingress:1;
+#endif
 #ifdef CONFIG_IPV6_NDISC_NODETYPE
 	__u8			ndisc_nodetype:2;
 #endif
-
+	__u8			csum_not_inet:1;
 	__u8			ipvs_property:1;
 	__u8			inner_protocol_type:1;
 	__u8			remcsum_offload:1;
 #ifdef CONFIG_NET_SWITCHDEV
 	__u8			offload_fwd_mark:1;
 	__u8			offload_l3_fwd_mark:1;
-#endif
-#ifdef CONFIG_NET_CLS_ACT
-	__u8			tc_skip_classify:1;
-	__u8			tc_at_ingress:1;
 #endif
 	__u8			redirected:1;
 #ifdef CONFIG_NET_REDIRECT
@@ -921,7 +922,6 @@ struct sk_buff {
 	__u8			decrypted:1;
 #endif
 	__u8			slow_gro:1;
-	__u8			mono_delivery_time:1;
 
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
@@ -999,10 +999,16 @@ struct sk_buff {
 /* if you move pkt_vlan_present around you also must adapt these constants */
 #ifdef __BIG_ENDIAN_BITFIELD
 #define PKT_VLAN_PRESENT_BIT	7
+#define TC_AT_INGRESS_MASK		(1 << 0)
+#define SKB_MONO_DELIVERY_TIME_MASK	(1 << 2)
 #else
 #define PKT_VLAN_PRESENT_BIT	0
+#define TC_AT_INGRESS_MASK		(1 << 7)
+#define SKB_MONO_DELIVERY_TIME_MASK	(1 << 5)
 #endif
 #define PKT_VLAN_PRESENT_OFFSET	offsetof(struct sk_buff, __pkt_vlan_present_offset)
+#define TC_AT_INGRESS_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)
+#define SKB_MONO_DELIVERY_TIME_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)
 
 #ifdef __KERNEL__
 /*
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 16a7574292a5..b36771b1bfa1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5076,6 +5076,31 @@ union bpf_attr {
  *		associated to *xdp_md*, at *offset*.
  *	Return
  *		0 on success, or a negative error in case of failure.
+ *
+ * long bpf_skb_set_delivery_time(struct sk_buff *skb, u64 dtime, u32 dtime_type)
+ *	Description
+ *		Set a *dtime* (delivery time) to the __sk_buff->tstamp and also
+ *		change the __sk_buff->delivery_time_type to *dtime_type*.
+ *
+ *		Only BPF_SKB_DELIVERY_TIME_MONO is supported in *dtime_type*
+ *		and it is the only delivery_time_type that will be kept
+ *		after bpf_redirect_*().
+ *		Only ipv4 and ipv6 skb->protocol is supported.
+ *
+ *		If there is no need to change the __sk_buff->delivery_time_type,
+ *		the delivery_time can be directly written to __sk_buff->tstamp
+ *		instead.
+ *
+ *		This function is most useful when it needs to set a
+ *		mono delivery_time to __sk_buff->tstamp and then
+ *		bpf_redirect_*() to the egress of an iface.  For example,
+ *		changing the (rcv) timestamp in __sk_buff->tstamp at
+ *		ingress to a mono delivery time and then bpf_redirect_*()
+ *		to sch_fq@phy-dev.
+ *	Return
+ *		0 on success.
+ *		**-EINVAL** for invalid input
+ *		**-EOPNOTSUPP** for unsupported delivery_time_type and protocol
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5269,6 +5294,7 @@ union bpf_attr {
 	FN(xdp_get_buff_len),		\
 	FN(xdp_load_bytes),		\
 	FN(xdp_store_bytes),		\
+	FN(skb_set_delivery_time),	\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5458,6 +5484,12 @@ union {					\
 	__u64 :64;			\
 } __attribute__((aligned(8)))
 
+enum {
+	BPF_SKB_DELIVERY_TIME_NONE,
+	BPF_SKB_DELIVERY_TIME_UNSPEC,
+	BPF_SKB_DELIVERY_TIME_MONO,
+};
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
@@ -5498,7 +5530,8 @@ struct __sk_buff {
 	__u32 gso_segs;
 	__bpf_md_ptr(struct bpf_sock *, sk);
 	__u32 gso_size;
-	__u32 :32;		/* Padding, future use. */
+	__u8  delivery_time_type;
+	__u32 :24;		/* Padding, future use. */
 	__u64 hwtstamp;
 };
 
diff --git a/net/core/filter.c b/net/core/filter.c
index a2d712be4985..0e79a6ca4a95 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7159,6 +7159,33 @@ static const struct bpf_func_proto bpf_sk_assign_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_skb_set_delivery_time, struct sk_buff *, skb,
+	   u64, dtime, u32, dtime_type)
+{
+	if (!dtime)
+		return -EINVAL;
+
+	/* skb_clear_delivery_time() is done for inet protocol */
+	if (dtime_type != BPF_SKB_DELIVERY_TIME_MONO ||
+	    (skb->protocol != htons(ETH_P_IP) &&
+	     skb->protocol != htons(ETH_P_IPV6)))
+		return -EOPNOTSUPP;
+
+	skb->mono_delivery_time = 1;
+	skb->tstamp = dtime;
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_skb_set_delivery_time_proto = {
+	.func		= bpf_skb_set_delivery_time,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 static const u8 *bpf_search_tcp_opt(const u8 *op, const u8 *opend,
 				    u8 search_kind, const u8 *magic,
 				    u8 magic_len, bool *eol)
@@ -7746,6 +7773,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_gen_syncookie_proto;
 	case BPF_FUNC_sk_assign:
 		return &bpf_sk_assign_proto;
+	case BPF_FUNC_skb_set_delivery_time:
+		return &bpf_skb_set_delivery_time_proto;
 #endif
 	default:
 		return bpf_sk_base_func_proto(func_id);
@@ -8085,7 +8114,9 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type
 			return false;
 		info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL;
 		break;
-	case offsetofend(struct __sk_buff, gso_size) ... offsetof(struct __sk_buff, hwtstamp) - 1:
+	case offsetof(struct __sk_buff, delivery_time_type):
+		return false;
+	case offsetofend(struct __sk_buff, delivery_time_type) ... offsetof(struct __sk_buff, hwtstamp) - 1:
 		/* Explicitly prohibit access to padding in __sk_buff. */
 		return false;
 	default:
@@ -8432,6 +8463,8 @@ static bool tc_cls_act_is_valid_access(int off, int size,
 		break;
 	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
 		return false;
+	case offsetof(struct __sk_buff, delivery_time_type):
+		return size == sizeof(__u8);
 	}
 
 	return bpf_skb_is_valid_access(off, size, type, prog, info);
@@ -8848,6 +8881,45 @@ static struct bpf_insn *bpf_convert_shinfo_access(const struct bpf_insn *si,
 	return insn;
 }
 
+static struct bpf_insn *bpf_convert_dtime_type_read(const struct bpf_insn *si,
+						    struct bpf_insn *insn)
+{
+	__u8 value_reg = si->dst_reg;
+	__u8 skb_reg = si->src_reg;
+	__u8 tmp_reg = BPF_REG_AX;
+
+	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
+			      SKB_MONO_DELIVERY_TIME_OFFSET);
+	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
+				SKB_MONO_DELIVERY_TIME_MASK);
+	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2);
+	/* value_reg = BPF_SKB_DELIVERY_TIME_MONO */
+	*insn++ = BPF_MOV32_IMM(value_reg, BPF_SKB_DELIVERY_TIME_MONO);
+	*insn++ = BPF_JMP_A(IS_ENABLED(CONFIG_NET_CLS_ACT) ? 10 : 5);
+
+	*insn++ = BPF_LDX_MEM(BPF_DW, tmp_reg, skb_reg,
+			      offsetof(struct sk_buff, tstamp));
+	*insn++ = BPF_JMP_IMM(BPF_JNE, tmp_reg, 0, 2);
+	/* value_reg = BPF_SKB_DELIVERY_TIME_NONE */
+	*insn++ = BPF_MOV32_IMM(value_reg, BPF_SKB_DELIVERY_TIME_NONE);
+	*insn++ = BPF_JMP_A(IS_ENABLED(CONFIG_NET_CLS_ACT) ? 6 : 1);
+
+#ifdef CONFIG_NET_CLS_ACT
+	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
+	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, TC_AT_INGRESS_MASK);
+	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2);
+	/* At ingress, value_reg = 0 */
+	*insn++ = BPF_MOV32_IMM(value_reg, 0);
+	*insn++ = BPF_JMP_A(1);
+#endif
+
+	/* value_reg = BPF_SKB_DELIVERYT_TIME_UNSPEC */
+	*insn++ = BPF_MOV32_IMM(value_reg, BPF_SKB_DELIVERY_TIME_UNSPEC);
+
+	/* 15 insns with CONFIG_NET_CLS_ACT */
+	return insn;
+}
+
 static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 				  const struct bpf_insn *si,
 				  struct bpf_insn *insn_buf,
@@ -9169,6 +9241,11 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 							     target_size));
 		break;
 
+	case offsetof(struct __sk_buff, delivery_time_type):
+		insn = bpf_convert_dtime_type_read(si, insn);
+		prog->delivery_time_access = 1;
+		break;
+
 	case offsetof(struct __sk_buff, gso_segs):
 		insn = bpf_convert_shinfo_access(si, insn);
 		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct skb_shared_info, gso_segs),
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 16a7574292a5..b36771b1bfa1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5076,6 +5076,31 @@ union bpf_attr {
  *		associated to *xdp_md*, at *offset*.
  *	Return
  *		0 on success, or a negative error in case of failure.
+ *
+ * long bpf_skb_set_delivery_time(struct sk_buff *skb, u64 dtime, u32 dtime_type)
+ *	Description
+ *		Set a *dtime* (delivery time) to the __sk_buff->tstamp and also
+ *		change the __sk_buff->delivery_time_type to *dtime_type*.
+ *
+ *		Only BPF_SKB_DELIVERY_TIME_MONO is supported in *dtime_type*
+ *		and it is the only delivery_time_type that will be kept
+ *		after bpf_redirect_*().
+ *		Only ipv4 and ipv6 skb->protocol is supported.
+ *
+ *		If there is no need to change the __sk_buff->delivery_time_type,
+ *		the delivery_time can be directly written to __sk_buff->tstamp
+ *		instead.
+ *
+ *		This function is most useful when it needs to set a
+ *		mono delivery_time to __sk_buff->tstamp and then
+ *		bpf_redirect_*() to the egress of an iface.  For example,
+ *		changing the (rcv) timestamp in __sk_buff->tstamp at
+ *		ingress to a mono delivery time and then bpf_redirect_*()
+ *		to sch_fq@phy-dev.
+ *	Return
+ *		0 on success.
+ *		**-EINVAL** for invalid input
+ *		**-EOPNOTSUPP** for unsupported delivery_time_type and protocol
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5269,6 +5294,7 @@ union bpf_attr {
 	FN(xdp_get_buff_len),		\
 	FN(xdp_load_bytes),		\
 	FN(xdp_store_bytes),		\
+	FN(skb_set_delivery_time),	\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5458,6 +5484,12 @@ union {					\
 	__u64 :64;			\
 } __attribute__((aligned(8)))
 
+enum {
+	BPF_SKB_DELIVERY_TIME_NONE,
+	BPF_SKB_DELIVERY_TIME_UNSPEC,
+	BPF_SKB_DELIVERY_TIME_MONO,
+};
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
@@ -5498,7 +5530,8 @@ struct __sk_buff {
 	__u32 gso_segs;
 	__bpf_md_ptr(struct bpf_sock *, sk);
 	__u32 gso_size;
-	__u32 :32;		/* Padding, future use. */
+	__u8  delivery_time_type;
+	__u32 :24;		/* Padding, future use. */
 	__u64 hwtstamp;
 };
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 net-next 8/8] bpf: selftests: test skb->tstamp in redirect_neigh
  2022-02-11  7:12 [PATCH v4 net-next 0/8] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (6 preceding siblings ...)
  2022-02-11  7:13 ` [PATCH v4 net-next 7/8] bpf: Add __sk_buff->delivery_time_type and bpf_skb_set_delivery_time() Martin KaFai Lau
@ 2022-02-11  7:13 ` Martin KaFai Lau
  7 siblings, 0 replies; 19+ messages in thread
From: Martin KaFai Lau @ 2022-02-11  7:13 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

This patch adds tests on forwarding the delivery_time for
the following cases
- tcp/udp + ip4/ip6 + bpf_redirect_neigh
- tcp/udp + ip4/ip6 + ip[6]_forward
- bpf_skb_set_delivery_time
- The old rcv timestamp expectation on tc-bpf@ingress

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 .../selftests/bpf/prog_tests/tc_redirect.c    | 434 ++++++++++++++++++
 .../selftests/bpf/progs/test_tc_dtime.c       | 348 ++++++++++++++
 2 files changed, 782 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/test_tc_dtime.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tc_redirect.c b/tools/testing/selftests/bpf/prog_tests/tc_redirect.c
index c2426df58e17..3036154916a7 100644
--- a/tools/testing/selftests/bpf/prog_tests/tc_redirect.c
+++ b/tools/testing/selftests/bpf/prog_tests/tc_redirect.c
@@ -17,6 +17,8 @@
 #include <linux/if_tun.h>
 #include <linux/limits.h>
 #include <linux/sysctl.h>
+#include <linux/time_types.h>
+#include <linux/net_tstamp.h>
 #include <sched.h>
 #include <stdbool.h>
 #include <stdio.h>
@@ -29,6 +31,11 @@
 #include "test_tc_neigh_fib.skel.h"
 #include "test_tc_neigh.skel.h"
 #include "test_tc_peer.skel.h"
+#include "test_tc_dtime.skel.h"
+
+#ifndef TCP_TX_DELAY
+#define TCP_TX_DELAY 37
+#endif
 
 #define NS_SRC "ns_src"
 #define NS_FWD "ns_fwd"
@@ -61,6 +68,7 @@
 #define CHK_PROG_PIN_FILE "/sys/fs/bpf/test_tc_chk"
 
 #define TIMEOUT_MILLIS 10000
+#define NSEC_PER_SEC 1000000000ULL
 
 #define log_err(MSG, ...) \
 	fprintf(stderr, "(%s:%d: errno: %s) " MSG "\n", \
@@ -440,6 +448,431 @@ static int set_forwarding(bool enable)
 	return 0;
 }
 
+static void rcv_tstamp(int fd, const char *expected, size_t s)
+{
+	struct __kernel_timespec pkt_ts = {};
+	char ctl[CMSG_SPACE(sizeof(pkt_ts))];
+	struct timespec now_ts;
+	struct msghdr msg = {};
+	__u64 now_ns, pkt_ns;
+	struct cmsghdr *cmsg;
+	struct iovec iov;
+	char data[32];
+	int ret;
+
+	iov.iov_base = data;
+	iov.iov_len = sizeof(data);
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+	msg.msg_control = &ctl;
+	msg.msg_controllen = sizeof(ctl);
+
+	ret = recvmsg(fd, &msg, 0);
+	if (!ASSERT_EQ(ret, s, "recvmsg"))
+		return;
+	ASSERT_STRNEQ(data, expected, s, "expected rcv data");
+
+	cmsg = CMSG_FIRSTHDR(&msg);
+	if (cmsg && cmsg->cmsg_level == SOL_SOCKET &&
+	    cmsg->cmsg_type == SO_TIMESTAMPNS_NEW)
+		memcpy(&pkt_ts, CMSG_DATA(cmsg), sizeof(pkt_ts));
+
+	pkt_ns = pkt_ts.tv_sec * NSEC_PER_SEC + pkt_ts.tv_nsec;
+	ASSERT_NEQ(pkt_ns, 0, "pkt rcv tstamp");
+
+	ret = clock_gettime(CLOCK_REALTIME, &now_ts);
+	ASSERT_OK(ret, "clock_gettime");
+	now_ns = now_ts.tv_sec * NSEC_PER_SEC + now_ts.tv_nsec;
+
+	if (ASSERT_GE(now_ns, pkt_ns, "check rcv tstamp"))
+		ASSERT_LT(now_ns - pkt_ns, 5 * NSEC_PER_SEC,
+			  "check rcv tstamp");
+}
+
+static void snd_tstamp(int fd, char *b, size_t s)
+{
+	struct sock_txtime opt = { .clockid = CLOCK_TAI };
+	char ctl[CMSG_SPACE(sizeof(__u64))];
+	struct timespec now_ts;
+	struct msghdr msg = {};
+	struct cmsghdr *cmsg;
+	struct iovec iov;
+	__u64 now_ns;
+	int ret;
+
+	ret = clock_gettime(CLOCK_TAI, &now_ts);
+	ASSERT_OK(ret, "clock_get_time(CLOCK_TAI)");
+	now_ns = now_ts.tv_sec * NSEC_PER_SEC + now_ts.tv_nsec;
+
+	iov.iov_base = b;
+	iov.iov_len = s;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+	msg.msg_control = &ctl;
+	msg.msg_controllen = sizeof(ctl);
+
+	cmsg = CMSG_FIRSTHDR(&msg);
+	cmsg->cmsg_level = SOL_SOCKET;
+	cmsg->cmsg_type = SCM_TXTIME;
+	cmsg->cmsg_len = CMSG_LEN(sizeof(now_ns));
+	*(__u64 *)CMSG_DATA(cmsg) = now_ns;
+
+	ret = setsockopt(fd, SOL_SOCKET, SO_TXTIME, &opt, sizeof(opt));
+	ASSERT_OK(ret, "setsockopt(SO_TXTIME)");
+
+	ret = sendmsg(fd, &msg, 0);
+	ASSERT_EQ(ret, s, "sendmsg");
+}
+
+static void test_inet_dtime(int family, int type, const char *addr, __u16 port)
+{
+	int opt = 1, accept_fd = -1, client_fd = -1, listen_fd, err;
+	char buf[] = "testing testing";
+	struct nstoken *nstoken;
+
+	nstoken = open_netns(NS_DST);
+	if (!ASSERT_OK_PTR(nstoken, "setns dst"))
+		return;
+	listen_fd = start_server(family, type, addr, port, 0);
+	close_netns(nstoken);
+
+	if (!ASSERT_GE(listen_fd, 0, "listen"))
+		return;
+
+	/* Ensure the kernel puts the (rcv) timestamp for all skb */
+	err = setsockopt(listen_fd, SOL_SOCKET, SO_TIMESTAMPNS_NEW,
+			 &opt, sizeof(opt));
+	if (!ASSERT_OK(err, "setsockopt(SO_TIMESTAMPNS_NEW)"))
+		goto done;
+
+	if (type == SOCK_STREAM) {
+		/* Ensure the kernel set EDT when sending out rst/ack
+		 * from the kernel's ctl_sk.
+		 */
+		err = setsockopt(listen_fd, SOL_TCP, TCP_TX_DELAY, &opt,
+				 sizeof(opt));
+		if (!ASSERT_OK(err, "setsockopt(TCP_TX_DELAY)"))
+			goto done;
+	}
+
+	nstoken = open_netns(NS_SRC);
+	if (!ASSERT_OK_PTR(nstoken, "setns src"))
+		goto done;
+	client_fd = connect_to_fd(listen_fd, TIMEOUT_MILLIS);
+	close_netns(nstoken);
+
+	if (!ASSERT_GE(client_fd, 0, "connect_to_fd"))
+		goto done;
+
+	if (type == SOCK_STREAM) {
+		int n;
+
+		accept_fd = accept(listen_fd, NULL, NULL);
+		if (!ASSERT_GE(accept_fd, 0, "accept"))
+			goto done;
+
+		n = write(client_fd, buf, sizeof(buf));
+		if (!ASSERT_EQ(n, sizeof(buf), "send to server"))
+			goto done;
+		rcv_tstamp(accept_fd, buf, sizeof(buf));
+	} else {
+		snd_tstamp(client_fd, buf, sizeof(buf));
+		rcv_tstamp(listen_fd, buf, sizeof(buf));
+	}
+
+done:
+	close(listen_fd);
+	if (accept_fd != -1)
+		close(accept_fd);
+	if (client_fd != -1)
+		close(client_fd);
+}
+
+static int netns_load_dtime_bpf(struct test_tc_dtime *skel)
+{
+	struct nstoken *nstoken;
+
+#define PIN_FNAME(__file) "/sys/fs/bpf/" #__file
+#define PIN(__prog) ({							\
+		int err = bpf_program__pin(skel->progs.__prog, PIN_FNAME(__prog)); \
+		if (!ASSERT_OK(err, "pin " #__prog))		\
+			goto fail;					\
+		})
+
+	/* setup ns_src tc progs */
+	nstoken = open_netns(NS_SRC);
+	if (!ASSERT_OK_PTR(nstoken, "setns " NS_SRC))
+		return -1;
+	PIN(egress_host);
+	PIN(ingress_host);
+	SYS("tc qdisc add dev veth_src clsact");
+	SYS("tc filter add dev veth_src ingress bpf da object-pinned "
+	    PIN_FNAME(ingress_host));
+	SYS("tc filter add dev veth_src egress bpf da object-pinned "
+	    PIN_FNAME(egress_host));
+	close_netns(nstoken);
+
+	/* setup ns_dst tc progs */
+	nstoken = open_netns(NS_DST);
+	if (!ASSERT_OK_PTR(nstoken, "setns " NS_DST))
+		return -1;
+	PIN(egress_host);
+	PIN(ingress_host);
+	SYS("tc qdisc add dev veth_dst clsact");
+	SYS("tc filter add dev veth_dst ingress bpf da object-pinned "
+	    PIN_FNAME(ingress_host));
+	SYS("tc filter add dev veth_dst egress bpf da object-pinned "
+	    PIN_FNAME(egress_host));
+	close_netns(nstoken);
+
+	/* setup ns_fwd tc progs */
+	nstoken = open_netns(NS_FWD);
+	if (!ASSERT_OK_PTR(nstoken, "setns " NS_FWD))
+		return -1;
+	PIN(ingress_fwdns_prio100);
+	PIN(egress_fwdns_prio100);
+	PIN(ingress_fwdns_prio101);
+	PIN(egress_fwdns_prio101);
+	SYS("tc qdisc add dev veth_dst_fwd clsact");
+	SYS("tc filter add dev veth_dst_fwd ingress prio 100 bpf da object-pinned "
+	    PIN_FNAME(ingress_fwdns_prio100));
+	SYS("tc filter add dev veth_dst_fwd ingress prio 101 bpf da object-pinned "
+	    PIN_FNAME(ingress_fwdns_prio101));
+	SYS("tc filter add dev veth_dst_fwd egress prio 100 bpf da object-pinned "
+	    PIN_FNAME(egress_fwdns_prio100));
+	SYS("tc filter add dev veth_dst_fwd egress prio 101 bpf da object-pinned "
+	    PIN_FNAME(egress_fwdns_prio101));
+	SYS("tc qdisc add dev veth_src_fwd clsact");
+	SYS("tc filter add dev veth_src_fwd ingress prio 100 bpf da object-pinned "
+	    PIN_FNAME(ingress_fwdns_prio100));
+	SYS("tc filter add dev veth_src_fwd ingress prio 101 bpf da object-pinned "
+	    PIN_FNAME(ingress_fwdns_prio101));
+	SYS("tc filter add dev veth_src_fwd egress prio 100 bpf da object-pinned "
+	    PIN_FNAME(egress_fwdns_prio100));
+	SYS("tc filter add dev veth_src_fwd egress prio 101 bpf da object-pinned "
+	    PIN_FNAME(egress_fwdns_prio101));
+	close_netns(nstoken);
+
+#undef PIN
+
+	return 0;
+
+fail:
+	close_netns(nstoken);
+	return -1;
+}
+
+enum {
+	INGRESS_FWDNS_P100,
+	INGRESS_FWDNS_P101,
+	EGRESS_FWDNS_P100,
+	EGRESS_FWDNS_P101,
+	INGRESS_ENDHOST,
+	EGRESS_ENDHOST,
+	SET_DTIME,
+	__MAX_CNT,
+};
+
+const char *cnt_names[] = {
+	"ingress_fwdns_p100",
+	"ingress_fwdns_p101",
+	"egress_fwdns_p100",
+	"egress_fwdns_p101",
+	"ingress_endhost",
+	"egress_endhost",
+	"set_dtime",
+};
+
+enum {
+	TCP_IP6_CLEAR_DTIME,
+	TCP_IP4,
+	TCP_IP6,
+	UDP_IP4,
+	UDP_IP6,
+	TCP_IP4_RT_FWD,
+	TCP_IP6_RT_FWD,
+	UDP_IP4_RT_FWD,
+	UDP_IP6_RT_FWD,
+	UKN_TEST,
+	__NR_TESTS,
+};
+
+const char *test_names[] = {
+	"tcp ip6 clear dtime",
+	"tcp ip4",
+	"tcp ip6",
+	"udp ip4",
+	"udp ip6",
+	"tcp ip4 rt fwd",
+	"tcp ip6 rt fwd",
+	"udp ip4 rt fwd",
+	"udp ip6 rt fwd",
+};
+
+static const char *dtime_cnt_str(int test, int cnt)
+{
+	static char name[64];
+
+	snprintf(name, sizeof(name), "%s %s", test_names[test], cnt_names[cnt]);
+
+	return name;
+}
+
+static const char *dtime_err_str(int test, int cnt)
+{
+	static char name[64];
+
+	snprintf(name, sizeof(name), "%s %s errs", test_names[test],
+		 cnt_names[cnt]);
+
+	return name;
+}
+
+static void test_tcp_clear_dtime(struct test_tc_dtime *skel)
+{
+	int i, t = TCP_IP6_CLEAR_DTIME;
+	__u32 *dtimes = skel->bss->dtimes[t];
+	__u32 *errs = skel->bss->errs[t];
+
+	skel->bss->test = t;
+	test_inet_dtime(AF_INET6, SOCK_STREAM, IP6_DST, 0);
+
+	ASSERT_EQ(dtimes[INGRESS_FWDNS_P100], 0,
+		  dtime_cnt_str(t, INGRESS_FWDNS_P100));
+	ASSERT_EQ(dtimes[INGRESS_FWDNS_P101], 0,
+		  dtime_cnt_str(t, INGRESS_FWDNS_P101));
+	ASSERT_GT(dtimes[EGRESS_FWDNS_P100], 0,
+		  dtime_cnt_str(t, EGRESS_FWDNS_P100));
+	ASSERT_EQ(dtimes[EGRESS_FWDNS_P101], 0,
+		  dtime_cnt_str(t, EGRESS_FWDNS_P101));
+	ASSERT_GT(dtimes[EGRESS_ENDHOST], 0,
+		  dtime_cnt_str(t, EGRESS_ENDHOST));
+	ASSERT_GT(dtimes[INGRESS_ENDHOST], 0,
+		  dtime_cnt_str(t, INGRESS_ENDHOST));
+
+	for (i = INGRESS_FWDNS_P100; i < __MAX_CNT; i++)
+		ASSERT_EQ(errs[i], 0, dtime_err_str(t, i));
+}
+
+static void test_tcp_dtime(struct test_tc_dtime *skel, int family, bool bpf_fwd)
+{
+	__u32 *dtimes, *errs;
+	const char *addr;
+	int i, t;
+
+	if (family == AF_INET) {
+		t = bpf_fwd ? TCP_IP4 : TCP_IP4_RT_FWD;
+		addr = IP4_DST;
+	} else {
+		t = bpf_fwd ? TCP_IP6 : TCP_IP6_RT_FWD;
+		addr = IP6_DST;
+	}
+
+	dtimes = skel->bss->dtimes[t];
+	errs = skel->bss->errs[t];
+
+	skel->bss->test = t;
+	test_inet_dtime(family, SOCK_STREAM, addr, 0);
+
+	/* fwdns_prio100 prog does not read delivery_time_type, so
+	 * kernel puts the (rcv) timetamp in __sk_buff->tstamp
+	 */
+	ASSERT_EQ(dtimes[INGRESS_FWDNS_P100], 0,
+		  dtime_cnt_str(t, INGRESS_FWDNS_P100));
+	for (i = INGRESS_FWDNS_P101; i < SET_DTIME; i++)
+		ASSERT_GT(dtimes[i], 0, dtime_cnt_str(t, i));
+
+	for (i = INGRESS_FWDNS_P100; i < __MAX_CNT; i++)
+		ASSERT_EQ(errs[i], 0, dtime_err_str(t, i));
+}
+
+static void test_udp_dtime(struct test_tc_dtime *skel, int family, bool bpf_fwd)
+{
+	__u32 *dtimes, *errs;
+	const char *addr;
+	int i, t;
+
+	if (family == AF_INET) {
+		t = bpf_fwd ? UDP_IP4 : UDP_IP4_RT_FWD;
+		addr = IP4_DST;
+	} else {
+		t = bpf_fwd ? UDP_IP6 : UDP_IP6_RT_FWD;
+		addr = IP6_DST;
+	}
+
+	dtimes = skel->bss->dtimes[t];
+	errs = skel->bss->errs[t];
+
+	skel->bss->test = t;
+	test_inet_dtime(family, SOCK_DGRAM, addr, 0);
+
+	ASSERT_EQ(dtimes[INGRESS_FWDNS_P100], 0,
+		  dtime_cnt_str(t, INGRESS_FWDNS_P100));
+	/* non mono delivery time is not forwarded */
+	ASSERT_EQ(dtimes[INGRESS_FWDNS_P101], 0,
+		  dtime_cnt_str(t, INGRESS_FWDNS_P100));
+	for (i = EGRESS_FWDNS_P100; i < SET_DTIME; i++)
+		ASSERT_GT(dtimes[i], 0, dtime_cnt_str(t, i));
+
+	for (i = INGRESS_FWDNS_P100; i < __MAX_CNT; i++)
+		ASSERT_EQ(errs[i], 0, dtime_err_str(t, i));
+}
+
+static void test_tc_redirect_dtime(struct netns_setup_result *setup_result)
+{
+	struct test_tc_dtime *skel;
+	struct nstoken *nstoken;
+	int err;
+
+	skel = test_tc_dtime__open();
+	if (!ASSERT_OK_PTR(skel, "test_tc_dtime__open"))
+		return;
+
+	skel->rodata->IFINDEX_SRC = setup_result->ifindex_veth_src_fwd;
+	skel->rodata->IFINDEX_DST = setup_result->ifindex_veth_dst_fwd;
+
+	err = test_tc_dtime__load(skel);
+	if (!ASSERT_OK(err, "test_tc_dtime__load"))
+		goto done;
+
+	if (netns_load_dtime_bpf(skel))
+		goto done;
+
+	nstoken = open_netns(NS_FWD);
+	if (!ASSERT_OK_PTR(nstoken, "setns fwd"))
+		goto done;
+	err = set_forwarding(false);
+	close_netns(nstoken);
+	if (!ASSERT_OK(err, "disable forwarding"))
+		goto done;
+
+	test_tcp_clear_dtime(skel);
+
+	test_tcp_dtime(skel, AF_INET, true);
+	test_tcp_dtime(skel, AF_INET6, true);
+	test_udp_dtime(skel, AF_INET, true);
+	test_udp_dtime(skel, AF_INET6, true);
+
+	/* Test the kernel ip[6]_forward path instead
+	 * of bpf_redirect_neigh().
+	 */
+	nstoken = open_netns(NS_FWD);
+	if (!ASSERT_OK_PTR(nstoken, "setns fwd"))
+		goto done;
+	err = set_forwarding(true);
+	close_netns(nstoken);
+	if (!ASSERT_OK(err, "enable forwarding"))
+		goto done;
+
+	test_tcp_dtime(skel, AF_INET, false);
+	test_tcp_dtime(skel, AF_INET6, false);
+	test_udp_dtime(skel, AF_INET, false);
+	test_udp_dtime(skel, AF_INET6, false);
+
+done:
+	test_tc_dtime__destroy(skel);
+}
+
 static void test_tc_redirect_neigh_fib(struct netns_setup_result *setup_result)
 {
 	struct nstoken *nstoken = NULL;
@@ -787,6 +1220,7 @@ static void *test_tc_redirect_run_tests(void *arg)
 	RUN_TEST(tc_redirect_peer_l3);
 	RUN_TEST(tc_redirect_neigh);
 	RUN_TEST(tc_redirect_neigh_fib);
+	RUN_TEST(tc_redirect_dtime);
 	return NULL;
 }
 
diff --git a/tools/testing/selftests/bpf/progs/test_tc_dtime.c b/tools/testing/selftests/bpf/progs/test_tc_dtime.c
new file mode 100644
index 000000000000..b52ed4e4696b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tc_dtime.c
@@ -0,0 +1,348 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stddef.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <linux/bpf.h>
+#include <linux/stddef.h>
+#include <linux/pkt_cls.h>
+#include <linux/if_ether.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#include <sys/socket.h>
+
+/* veth_src --- veth_src_fwd --- veth_det_fwd --- veth_dst
+ *           |                                 |
+ *  ns_src   |              ns_fwd             |   ns_dst
+ *
+ * ns_src and ns_dst: ENDHOST namespace
+ *            ns_fwd: Fowarding namespace
+ */
+
+#define ctx_ptr(field)		(void *)(long)(field)
+
+#define ip4_src			__bpf_htonl(0xac100164) /* 172.16.1.100 */
+#define ip4_dst			__bpf_htonl(0xac100264) /* 172.16.2.100 */
+
+#define ip6_src			{ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, \
+				  0x00, 0x01, 0xde, 0xad, 0xbe, 0xef, 0xca, 0xfe }
+#define ip6_dst			{ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, \
+				  0x00, 0x02, 0xde, 0xad, 0xbe, 0xef, 0xca, 0xfe }
+
+#define v6_equal(a, b)		(a.s6_addr32[0] == b.s6_addr32[0] && \
+				 a.s6_addr32[1] == b.s6_addr32[1] && \
+				 a.s6_addr32[2] == b.s6_addr32[2] && \
+				 a.s6_addr32[3] == b.s6_addr32[3])
+
+volatile const __u32 IFINDEX_SRC;
+volatile const __u32 IFINDEX_DST;
+
+#define EGRESS_ENDHOST_MAGIC	0x0b9fbeef
+#define INGRESS_FWDNS_MAGIC	0x1b9fbeef
+#define EGRESS_FWDNS_MAGIC	0x2b9fbeef
+
+enum {
+	INGRESS_FWDNS_P100,
+	INGRESS_FWDNS_P101,
+	EGRESS_FWDNS_P100,
+	EGRESS_FWDNS_P101,
+	INGRESS_ENDHOST,
+	EGRESS_ENDHOST,
+	SET_DTIME,
+	__MAX_CNT,
+};
+
+enum {
+	TCP_IP6_CLEAR_DTIME,
+	TCP_IP4,
+	TCP_IP6,
+	UDP_IP4,
+	UDP_IP6,
+	TCP_IP4_RT_FWD,
+	TCP_IP6_RT_FWD,
+	UDP_IP4_RT_FWD,
+	UDP_IP6_RT_FWD,
+	UKN_TEST,
+	__NR_TESTS,
+};
+
+enum {
+	SRC_NS = 1,
+	DST_NS,
+};
+
+__u32 dtimes[__NR_TESTS][__MAX_CNT] = {};
+__u32 errs[__NR_TESTS][__MAX_CNT] = {};
+__u32 test = 0;
+
+static void inc_dtimes(__u32 idx)
+{
+	if (test < __NR_TESTS)
+		dtimes[test][idx]++;
+	else
+		dtimes[UKN_TEST][idx]++;
+}
+
+static void inc_errs(__u32 idx)
+{
+	if (test < __NR_TESTS)
+		errs[test][idx]++;
+	else
+		errs[UKN_TEST][idx]++;
+}
+
+static int skb_proto(int type)
+{
+	return type & 0xff;
+}
+
+static int skb_ns(int type)
+{
+	return (type >> 8) & 0xff;
+}
+
+static bool fwdns_clear_dtime(void)
+{
+	return test == TCP_IP6_CLEAR_DTIME;
+}
+
+static bool bpf_fwd(void)
+{
+	return test < TCP_IP4_RT_FWD;
+}
+
+/* -1: parse error: TC_ACT_SHOT
+ *  0: not testing traffic: TC_ACT_OK
+ * >0: first byte is the inet_proto, second byte has the netns
+ *     of the sender
+ */
+static int skb_get_type(struct __sk_buff *skb)
+{
+	void *data_end = ctx_ptr(skb->data_end);
+	void *data = ctx_ptr(skb->data);
+	__u8 inet_proto = 0, ns = 0;
+	struct ipv6hdr *ip6h;
+	struct iphdr *iph;
+
+	switch (skb->protocol) {
+	case __bpf_htons(ETH_P_IP):
+		iph = data + sizeof(struct ethhdr);
+		if (iph + 1 > data_end)
+			return -1;
+		if (iph->saddr == ip4_src)
+			ns = SRC_NS;
+		else if (iph->saddr == ip4_dst)
+			ns = DST_NS;
+		inet_proto = iph->protocol;
+		break;
+	case __bpf_htons(ETH_P_IPV6):
+		ip6h = data + sizeof(struct ethhdr);
+		if (ip6h + 1 > data_end)
+			return -1;
+		if (v6_equal(ip6h->saddr, (struct in6_addr)ip6_src))
+			ns = SRC_NS;
+		else if (v6_equal(ip6h->saddr, (struct in6_addr)ip6_dst))
+			ns = DST_NS;
+		inet_proto = ip6h->nexthdr;
+		break;
+	default:
+		return 0;
+	}
+
+	if ((inet_proto != IPPROTO_TCP && inet_proto != IPPROTO_UDP) || !ns)
+		return 0;
+
+	return (ns << 8 | inet_proto);
+}
+
+/* format: direction@iface@netns
+ * egress@veth_(src|dst)@ns_(src|dst)
+ */
+SEC("tc")
+int egress_host(struct __sk_buff *skb)
+{
+	int skb_type;
+
+	skb_type = skb_get_type(skb);
+	if (skb_type == -1)
+		return TC_ACT_SHOT;
+	if (!skb_type)
+		return TC_ACT_OK;
+
+	if (skb_proto(skb_type) == IPPROTO_TCP) {
+		if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO &&
+		    skb->tstamp)
+			inc_dtimes(EGRESS_ENDHOST);
+		else
+			inc_errs(EGRESS_ENDHOST);
+	} else {
+		if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_UNSPEC &&
+		    skb->tstamp)
+			inc_dtimes(EGRESS_ENDHOST);
+		else
+			inc_errs(EGRESS_ENDHOST);
+	}
+
+	skb->tstamp = EGRESS_ENDHOST_MAGIC;
+
+	return TC_ACT_OK;
+}
+
+/* ingress@veth_(src|dst)@ns_(src|dst) */
+SEC("tc")
+int ingress_host(struct __sk_buff *skb)
+{
+	int skb_type;
+
+	skb_type = skb_get_type(skb);
+	if (skb_type == -1)
+		return TC_ACT_SHOT;
+	if (!skb_type)
+		return TC_ACT_OK;
+
+	if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO &&
+	    skb->tstamp == EGRESS_FWDNS_MAGIC)
+		inc_dtimes(INGRESS_ENDHOST);
+	else
+		inc_errs(INGRESS_ENDHOST);
+
+	return TC_ACT_OK;
+}
+
+/* ingress@veth_(src|dst)_fwd@ns_fwd priority 100 */
+SEC("tc")
+int ingress_fwdns_prio100(struct __sk_buff *skb)
+{
+	int skb_type;
+
+	skb_type = skb_get_type(skb);
+	if (skb_type == -1)
+		return TC_ACT_SHOT;
+	if (!skb_type)
+		return TC_ACT_OK;
+
+	/* delivery_time is only available to the ingress
+	 * if the tc-bpf checks the skb->delivery_time_type.
+	 */
+	if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
+		inc_errs(INGRESS_FWDNS_P100);
+
+	if (fwdns_clear_dtime())
+		skb->tstamp = 0;
+
+	return TC_ACT_UNSPEC;
+}
+
+/* egress@veth_(src|dst)_fwd@ns_fwd priority 100 */
+SEC("tc")
+int egress_fwdns_prio100(struct __sk_buff *skb)
+{
+	int skb_type;
+
+	skb_type = skb_get_type(skb);
+	if (skb_type == -1)
+		return TC_ACT_SHOT;
+	if (!skb_type)
+		return TC_ACT_OK;
+
+	/* delivery_time is always available to egress even
+	 * the tc-bpf did not use the delivery_time_type.
+	 */
+	if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO &&
+	    skb->tstamp == INGRESS_FWDNS_MAGIC)
+		inc_dtimes(EGRESS_FWDNS_P100);
+	else
+		inc_errs(EGRESS_FWDNS_P100);
+
+	if (fwdns_clear_dtime())
+		skb->tstamp = 0;
+
+	return TC_ACT_UNSPEC;
+}
+
+/* ingress@veth_(src|dst)_fwd@ns_fwd priority 101 */
+SEC("tc")
+int ingress_fwdns_prio101(struct __sk_buff *skb)
+{
+	__u64 expected_dtime = EGRESS_ENDHOST_MAGIC;
+	int skb_type;
+
+	skb_type = skb_get_type(skb);
+	if (skb_type == -1 || !skb_type)
+		/* Should have handled in prio100 */
+		return TC_ACT_SHOT;
+
+	if (skb_proto(skb_type) == IPPROTO_UDP)
+		expected_dtime = 0;
+
+	if (skb->delivery_time_type) {
+		if (fwdns_clear_dtime() ||
+		    skb->delivery_time_type != BPF_SKB_DELIVERY_TIME_MONO ||
+		    skb->tstamp != expected_dtime)
+			inc_errs(INGRESS_FWDNS_P101);
+		else
+			inc_dtimes(INGRESS_FWDNS_P101);
+	} else {
+		if (!fwdns_clear_dtime() && expected_dtime)
+			inc_errs(INGRESS_FWDNS_P101);
+	}
+
+	if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO) {
+		skb->tstamp = INGRESS_FWDNS_MAGIC;
+	} else {
+		if (bpf_skb_set_delivery_time(skb, INGRESS_FWDNS_MAGIC,
+					      BPF_SKB_DELIVERY_TIME_MONO))
+			inc_errs(SET_DTIME);
+		if (!bpf_skb_set_delivery_time(skb, INGRESS_FWDNS_MAGIC,
+					       BPF_SKB_DELIVERY_TIME_UNSPEC))
+			inc_errs(SET_DTIME);
+	}
+
+	if (skb_ns(skb_type) == SRC_NS)
+		return bpf_fwd() ?
+			bpf_redirect_neigh(IFINDEX_DST, NULL, 0, 0) : TC_ACT_OK;
+	else
+		return bpf_fwd() ?
+			bpf_redirect_neigh(IFINDEX_SRC, NULL, 0, 0) : TC_ACT_OK;
+}
+
+/* egress@veth_(src|dst)_fwd@ns_fwd priority 101 */
+SEC("tc")
+int egress_fwdns_prio101(struct __sk_buff *skb)
+{
+	int skb_type;
+
+	skb_type = skb_get_type(skb);
+	if (skb_type == -1 || !skb_type)
+		/* Should have handled in prio100 */
+		return TC_ACT_SHOT;
+
+	if (skb->delivery_time_type) {
+		if (fwdns_clear_dtime() ||
+		    skb->delivery_time_type != BPF_SKB_DELIVERY_TIME_MONO ||
+		    skb->tstamp != INGRESS_FWDNS_MAGIC)
+			inc_errs(EGRESS_FWDNS_P101);
+		else
+			inc_dtimes(EGRESS_FWDNS_P101);
+	} else {
+		if (!fwdns_clear_dtime())
+			inc_errs(EGRESS_FWDNS_P101);
+	}
+
+	if (skb->delivery_time_type == BPF_SKB_DELIVERY_TIME_MONO) {
+		skb->tstamp = EGRESS_FWDNS_MAGIC;
+	} else {
+		if (bpf_skb_set_delivery_time(skb, EGRESS_FWDNS_MAGIC,
+					      BPF_SKB_DELIVERY_TIME_MONO))
+			inc_errs(SET_DTIME);
+		if (!bpf_skb_set_delivery_time(skb, EGRESS_FWDNS_MAGIC,
+					       BPF_SKB_DELIVERY_TIME_UNSPEC))
+			inc_errs(SET_DTIME);
+	}
+
+	return TC_ACT_OK;
+}
+
+char __license[] SEC("license") = "GPL";
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 net-next 1/8] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp
  2022-02-11  7:12 ` [PATCH v4 net-next 1/8] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Martin KaFai Lau
@ 2022-02-12 19:13   ` Cong Wang
  2022-02-12 23:27     ` Martin KaFai Lau
  2022-02-15 20:49   ` Daniel Borkmann
  1 sibling, 1 reply; 19+ messages in thread
From: Cong Wang @ 2022-02-12 19:13 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Willem de Bruijn

On Thu, Feb 10, 2022 at 11:12:38PM -0800, Martin KaFai Lau wrote:
> skb->tstamp was first used as the (rcv) timestamp.
> The major usage is to report it to the user (e.g. SO_TIMESTAMP).
> 
> Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
> during egress and used by the qdisc (e.g. sch_fq) to make decision on when
> the skb can be passed to the dev.
> 
> Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
> or the delivery_time, so it is always reset to 0 whenever forwarded
> between egress and ingress.
> 
> While it makes sense to always clear the (rcv) timestamp in skb->tstamp
> to avoid confusing sch_fq that expects the delivery_time, it is a
> performance issue [0] to clear the delivery_time if the skb finally
> egress to a fq@phy-dev.  For example, when forwarding from egress to
> ingress and then finally back to egress:
> 
>             tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
>                                      ^              ^
>                                      reset          rest
> 
> This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
> is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.
> 
> The current use case is to keep the TCP mono delivery_time (EDT) and
> to be used with sch_fq.  A later patch will also allow tc-bpf@ingress
> to read and change the mono delivery_time.

Can you be more specific? How is the fq in the hostns even visible to
container ns? More importantly, why the packets are always going out from
container to eth0?

From the sender's point of view, it can't see the hostns and can't event
know whether the packets are routed to eth0 or other containers on the
same host. So I don't see how this makes sense.

Crossing netns is pretty much like delivering on wire, *generally speaking*
if the skb meta data is not preserved on wire, it probably should not for
crossing netns either.

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 net-next 1/8] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp
  2022-02-12 19:13   ` Cong Wang
@ 2022-02-12 23:27     ` Martin KaFai Lau
  0 siblings, 0 replies; 19+ messages in thread
From: Martin KaFai Lau @ 2022-02-12 23:27 UTC (permalink / raw)
  To: Cong Wang
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Willem de Bruijn

On Sat, Feb 12, 2022 at 11:13:53AM -0800, Cong Wang wrote:
> On Thu, Feb 10, 2022 at 11:12:38PM -0800, Martin KaFai Lau wrote:
> > skb->tstamp was first used as the (rcv) timestamp.
> > The major usage is to report it to the user (e.g. SO_TIMESTAMP).
> > 
> > Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
> > during egress and used by the qdisc (e.g. sch_fq) to make decision on when
> > the skb can be passed to the dev.
> > 
> > Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
> > or the delivery_time, so it is always reset to 0 whenever forwarded
> > between egress and ingress.
> > 
> > While it makes sense to always clear the (rcv) timestamp in skb->tstamp
> > to avoid confusing sch_fq that expects the delivery_time, it is a
> > performance issue [0] to clear the delivery_time if the skb finally
> > egress to a fq@phy-dev.  For example, when forwarding from egress to
> > ingress and then finally back to egress:
> > 
> >             tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
> >                                      ^              ^
> >                                      reset          rest
> > 
> > This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
> > is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.
> > 
> > The current use case is to keep the TCP mono delivery_time (EDT) and
> > to be used with sch_fq.  A later patch will also allow tc-bpf@ingress
> > to read and change the mono delivery_time.
> 
> Can you be more specific? How is the fq in the hostns even visible to
> container ns? More importantly, why the packets are always going out from
> container to eth0?
> 
> From the sender's point of view, it can't see the hostns and can't event
> know whether the packets are routed to eth0 or other containers on the
> same host. So I don't see how this makes sense.
The sender does not need to know if there is fq installed anywhere or
how the packet will be routed.  It is completely orthogonal.
Today, the TCP is always setting the EDT without knowing where
it will be routed and if there is fq (or any lower layer code) installed
anywhere in the routing path that will be using it.

> Crossing netns is pretty much like delivering on wire, *generally speaking*
> if the skb meta data is not preserved on wire, it probably should not for
> crossing netns either.
There are many fields in the skb that are not cleared.  In general, it clears
when it is needed.  e.g. skb->sk in the veth case above and sk has info
that is not even in the tcp/ip packet itself.  The delivery time was
needed to be cleared because there is no way to distinguish between
the rcv timestamp and the delivery time.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 net-next 1/8] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp
  2022-02-11  7:12 ` [PATCH v4 net-next 1/8] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Martin KaFai Lau
  2022-02-12 19:13   ` Cong Wang
@ 2022-02-15 20:49   ` Daniel Borkmann
  2022-02-16  6:11     ` Martin KaFai Lau
  1 sibling, 1 reply; 19+ messages in thread
From: Daniel Borkmann @ 2022-02-15 20:49 UTC (permalink / raw)
  To: Martin KaFai Lau, bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, David Miller, Eric Dumazet,
	Jakub Kicinski, kernel-team, Willem de Bruijn

On 2/11/22 8:12 AM, Martin KaFai Lau wrote:
[...]
> The current use case is to keep the TCP mono delivery_time (EDT) and
> to be used with sch_fq.  A later patch will also allow tc-bpf@ingress
> to read and change the mono delivery_time.
> 
> In the future, another bit (e.g. skb->user_delivery_time) can be added
[...]
> ---
>   include/linux/skbuff.h                     | 13 +++++++++++++
>   net/bridge/netfilter/nf_conntrack_bridge.c |  5 +++--
>   net/ipv4/ip_output.c                       |  7 +++++--
>   net/ipv4/tcp_output.c                      | 16 +++++++++-------
>   net/ipv6/ip6_output.c                      |  5 +++--
>   net/ipv6/netfilter.c                       |  5 +++--
>   net/ipv6/tcp_ipv6.c                        |  2 +-
>   7 files changed, 37 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index a5adbf6b51e8..32c793de3801 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -747,6 +747,10 @@ typedef unsigned char *sk_buff_data_t;
>    *	@dst_pending_confirm: need to confirm neighbour
>    *	@decrypted: Decrypted SKB
>    *	@slow_gro: state present at GRO time, slower prepare step required
> + *	@mono_delivery_time: When set, skb->tstamp has the
> + *		delivery_time in mono clock base (i.e. EDT).  Otherwise, the
> + *		skb->tstamp has the (rcv) timestamp at ingress and
> + *		delivery_time at egress.
>    *	@napi_id: id of the NAPI struct this skb came from
>    *	@sender_cpu: (aka @napi_id) source CPU in XPS
>    *	@secmark: security marking
> @@ -917,6 +921,7 @@ struct sk_buff {
>   	__u8			decrypted:1;
>   #endif
>   	__u8			slow_gro:1;
> +	__u8			mono_delivery_time:1;
>   

Don't you also need to extend sch_fq to treat any non-mono_delivery_time marked
skb similar as if it hadn't been marked with a delivery time?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 net-next 3/8] net: Set skb->mono_delivery_time and clear it after sch_handle_ingress()
  2022-02-11  7:12 ` [PATCH v4 net-next 3/8] net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() Martin KaFai Lau
@ 2022-02-15 21:04   ` Daniel Borkmann
  0 siblings, 0 replies; 19+ messages in thread
From: Daniel Borkmann @ 2022-02-15 21:04 UTC (permalink / raw)
  To: Martin KaFai Lau, bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, David Miller, Eric Dumazet,
	Jakub Kicinski, kernel-team, Willem de Bruijn

On 2/11/22 8:12 AM, Martin KaFai Lau wrote:
[...]
> +
> +DECLARE_STATIC_KEY_FALSE(netstamp_needed_key);
> +
> +/* It is used in the ingress path to clear the delivery_time.
> + * If needed, set the skb->tstamp to the (rcv) timestamp.
> + */
> +static inline void skb_clear_delivery_time(struct sk_buff *skb)
> +{
> +	if (unlikely(skb->mono_delivery_time)) {
> +		skb->mono_delivery_time = 0;
> +		if (static_branch_unlikely(&netstamp_needed_key))
> +			skb->tstamp = ktime_get_real();
> +		else
> +			skb->tstamp = 0;
> +	}
>   }
>   
>   static inline void skb_clear_tstamp(struct sk_buff *skb)
> @@ -3946,6 +3961,14 @@ static inline void skb_clear_tstamp(struct sk_buff *skb)
>   	skb->tstamp = 0;
>   }
>   
> +static inline ktime_t skb_tstamp(const struct sk_buff *skb)
> +{
> +	if (unlikely(skb->mono_delivery_time))
> +		return 0;
> +
> +	return skb->tstamp;
> +}
> +
>   static inline u8 skb_metadata_len(const struct sk_buff *skb)
>   {

Just small nit, but I don't think here and in other patches as well the conditional
for skb->mono_delivery_time should be marked unlikely(). For container workloads
this is very likely.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 net-next 5/8] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress
  2022-02-11  7:13 ` [PATCH v4 net-next 5/8] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress Martin KaFai Lau
@ 2022-02-15 23:30   ` Daniel Borkmann
  2022-02-16  5:51     ` Martin KaFai Lau
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Borkmann @ 2022-02-15 23:30 UTC (permalink / raw)
  To: Martin KaFai Lau, bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, David Miller, Eric Dumazet,
	Jakub Kicinski, kernel-team, Willem de Bruijn

On 2/11/22 8:13 AM, Martin KaFai Lau wrote:
> The current tc-bpf@ingress reads and writes the __sk_buff->tstamp
> as a (rcv) timestamp.  This patch is to backward compatible with the
> (rcv) timestamp expectation when the skb->tstamp has a mono delivery_time.
> 
> If needed, the patch first saves the mono delivery_time.  Depending on
> the static key "netstamp_needed_key", it then resets the skb->tstamp to
> either 0 or ktime_get_real() before running the tc-bpf@ingress.  After
> the tc-bpf prog returns, if the (rcv) timestamp in skb->tstamp has not
> been changed, it will restore the earlier saved mono delivery_time.
> 
> The current logic to run tc-bpf@ingress is refactored to a new
> bpf_prog_run_at_ingress() function and shared between cls_bpf and act_bpf.
> The above new delivery_time save/restore logic is also done together in
> this function.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>   include/linux/filter.h | 28 ++++++++++++++++++++++++++++
>   net/sched/act_bpf.c    |  5 +----
>   net/sched/cls_bpf.c    |  6 +-----
>   3 files changed, 30 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index d23e999dc032..e43e1701a80e 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -699,6 +699,34 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb)
>   	cb->data_end  = skb->data + skb_headlen(skb);
>   }
>   
> +static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog,
> +						   struct sk_buff *skb)
> +{
> +	ktime_t tstamp, saved_mono_dtime = 0;
> +	int filter_res;
> +
> +	if (unlikely(skb->mono_delivery_time)) {
> +		saved_mono_dtime = skb->tstamp;
> +		skb->mono_delivery_time = 0;
> +		if (static_branch_unlikely(&netstamp_needed_key))
> +			skb->tstamp = tstamp = ktime_get_real();
> +		else
> +			skb->tstamp = tstamp = 0;
> +	}
> +
> +	/* It is safe to push/pull even if skb_shared() */
> +	__skb_push(skb, skb->mac_len);
> +	bpf_compute_data_pointers(skb);
> +	filter_res = bpf_prog_run(prog, skb);
> +	__skb_pull(skb, skb->mac_len);
> +
> +	/* __sk_buff->tstamp was not changed, restore the delivery_time */
> +	if (unlikely(saved_mono_dtime) && skb_tstamp(skb) == tstamp)
> +		skb_set_delivery_time(skb, saved_mono_dtime, true);

So above detour is for skb->tstamp backwards compatibility so users will see real time.
I don't see why we special case {cls,act}_bpf-only, given this will also be the case
for other subsystems (e.g. netfilter) when they read access plain skb->tstamp and get
the egress one instead of ktime_get_real() upon deferred skb_clear_delivery_time().

If we would generally ignore it, then the above bpf_prog_run_at_ingress() save/restore
detour is not needed (so patch 5/6 should be dropped). (Meaning, if we need to special
case {cls,act}_bpf only, we could also have gone for simpler bpf-only solution..)

> +	return filter_res;
> +}
> +
>   /* Similar to bpf_compute_data_pointers(), except that save orginal
>    * data in cb->data and cb->meta_data for restore.
>    */
> diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
> index a77d8908e737..14c3bd0a5088 100644
> --- a/net/sched/act_bpf.c
> +++ b/net/sched/act_bpf.c
> @@ -45,10 +45,7 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act,
>   
>   	filter = rcu_dereference(prog->filter);
>   	if (at_ingress) {
> -		__skb_push(skb, skb->mac_len);
> -		bpf_compute_data_pointers(skb);
> -		filter_res = bpf_prog_run(filter, skb);
> -		__skb_pull(skb, skb->mac_len);
> +		filter_res = bpf_prog_run_at_ingress(filter, skb);
>   	} else {
>   		bpf_compute_data_pointers(skb);
>   		filter_res = bpf_prog_run(filter, skb);
> diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
> index df19a847829e..036b2e1f74af 100644
> --- a/net/sched/cls_bpf.c
> +++ b/net/sched/cls_bpf.c
> @@ -93,11 +93,7 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
>   		if (tc_skip_sw(prog->gen_flags)) {
>   			filter_res = prog->exts_integrated ? TC_ACT_UNSPEC : 0;
>   		} else if (at_ingress) {
> -			/* It is safe to push/pull even if skb_shared() */
> -			__skb_push(skb, skb->mac_len);
> -			bpf_compute_data_pointers(skb);
> -			filter_res = bpf_prog_run(prog->filter, skb);
> -			__skb_pull(skb, skb->mac_len);
> +			filter_res = bpf_prog_run_at_ingress(prog->filter, skb);
>   		} else {
>   			bpf_compute_data_pointers(skb);
>   			filter_res = bpf_prog_run(prog->filter, skb);
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 net-next 5/8] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress
  2022-02-15 23:30   ` Daniel Borkmann
@ 2022-02-16  5:51     ` Martin KaFai Lau
  2022-02-17  0:03       ` Daniel Borkmann
  0 siblings, 1 reply; 19+ messages in thread
From: Martin KaFai Lau @ 2022-02-16  5:51 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team, Willem de Bruijn

On Wed, Feb 16, 2022 at 12:30:53AM +0100, Daniel Borkmann wrote:
> On 2/11/22 8:13 AM, Martin KaFai Lau wrote:
> > The current tc-bpf@ingress reads and writes the __sk_buff->tstamp
> > as a (rcv) timestamp.  This patch is to backward compatible with the
> > (rcv) timestamp expectation when the skb->tstamp has a mono delivery_time.
> > 
> > If needed, the patch first saves the mono delivery_time.  Depending on
> > the static key "netstamp_needed_key", it then resets the skb->tstamp to
> > either 0 or ktime_get_real() before running the tc-bpf@ingress.  After
> > the tc-bpf prog returns, if the (rcv) timestamp in skb->tstamp has not
> > been changed, it will restore the earlier saved mono delivery_time.
> > 
> > The current logic to run tc-bpf@ingress is refactored to a new
> > bpf_prog_run_at_ingress() function and shared between cls_bpf and act_bpf.
> > The above new delivery_time save/restore logic is also done together in
> > this function.
> > 
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >   include/linux/filter.h | 28 ++++++++++++++++++++++++++++
> >   net/sched/act_bpf.c    |  5 +----
> >   net/sched/cls_bpf.c    |  6 +-----
> >   3 files changed, 30 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index d23e999dc032..e43e1701a80e 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -699,6 +699,34 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb)
> >   	cb->data_end  = skb->data + skb_headlen(skb);
> >   }
> > +static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog,
> > +						   struct sk_buff *skb)
> > +{
> > +	ktime_t tstamp, saved_mono_dtime = 0;
> > +	int filter_res;
> > +
> > +	if (unlikely(skb->mono_delivery_time)) {
> > +		saved_mono_dtime = skb->tstamp;
> > +		skb->mono_delivery_time = 0;
> > +		if (static_branch_unlikely(&netstamp_needed_key))
> > +			skb->tstamp = tstamp = ktime_get_real();
> > +		else
> > +			skb->tstamp = tstamp = 0;
> > +	}
> > +
> > +	/* It is safe to push/pull even if skb_shared() */
> > +	__skb_push(skb, skb->mac_len);
> > +	bpf_compute_data_pointers(skb);
> > +	filter_res = bpf_prog_run(prog, skb);
> > +	__skb_pull(skb, skb->mac_len);
> > +
> > +	/* __sk_buff->tstamp was not changed, restore the delivery_time */
> > +	if (unlikely(saved_mono_dtime) && skb_tstamp(skb) == tstamp)
> > +		skb_set_delivery_time(skb, saved_mono_dtime, true);
> 
> So above detour is for skb->tstamp backwards compatibility so users will see real time.
> I don't see why we special case {cls,act}_bpf-only, given this will also be the case
> for other subsystems (e.g. netfilter) when they read access plain skb->tstamp and get
> the egress one instead of ktime_get_real() upon deferred skb_clear_delivery_time().
> 
> If we would generally ignore it, then the above bpf_prog_run_at_ingress() save/restore
> detour is not needed (so patch 5/6 should be dropped). (Meaning, if we need to special
> case {cls,act}_bpf only, we could also have gone for simpler bpf-only solution..)
The limitation here is there is only one skb->tstamp field.  I don't see
a bpf-only solution or not will make a difference here.

Regarding the netfilter (good point!), I only see it is used in nfnetlink_log.c
and nfnetlink_queue.c.  Like the tapping cases (earlier than the bpf run-point)
and in general other ingress cases, it cannot assume the rcv timestamp is
always there, so they can be changed like af_packet in patch 3
which is a straight forward change.  I can make the change in v5.

Going back to the cls_bpf at ingress.  If the concern is on code cleanliness,
how about removing this dance for now while the current rcv tstamp usage is
unclear at ingress.  Meaning keep the delivery_time (if any) in skb->tstamp.
This dance could be brought in later when there was breakage and legit usecase
reported.  The new bpf prog will have to use the __sk_buff->delivery_time_type
regardless if it wants to use skb->tstamp as the delivery_time, so they won't
assume delivery_time is always in skb->tstamp and it will be fine even this
dance would be brought back in later.

I prefer to keep it as-is to avoid unlikely breakage but open to remove
it for now.

Regarding patch 6, it is unrelated.  It needs to clear the
mono_delivery_time bit if the bpf writes 0 to the skb->tstamp.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 net-next 1/8] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp
  2022-02-15 20:49   ` Daniel Borkmann
@ 2022-02-16  6:11     ` Martin KaFai Lau
  0 siblings, 0 replies; 19+ messages in thread
From: Martin KaFai Lau @ 2022-02-16  6:11 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team, Willem de Bruijn

On Tue, Feb 15, 2022 at 09:49:07PM +0100, Daniel Borkmann wrote:
> On 2/11/22 8:12 AM, Martin KaFai Lau wrote:
> [...]
> > The current use case is to keep the TCP mono delivery_time (EDT) and
> > to be used with sch_fq.  A later patch will also allow tc-bpf@ingress
> > to read and change the mono delivery_time.
> > 
> > In the future, another bit (e.g. skb->user_delivery_time) can be added
> [...]
> > ---
> >   include/linux/skbuff.h                     | 13 +++++++++++++
> >   net/bridge/netfilter/nf_conntrack_bridge.c |  5 +++--
> >   net/ipv4/ip_output.c                       |  7 +++++--
> >   net/ipv4/tcp_output.c                      | 16 +++++++++-------
> >   net/ipv6/ip6_output.c                      |  5 +++--
> >   net/ipv6/netfilter.c                       |  5 +++--
> >   net/ipv6/tcp_ipv6.c                        |  2 +-
> >   7 files changed, 37 insertions(+), 16 deletions(-)
> > 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index a5adbf6b51e8..32c793de3801 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -747,6 +747,10 @@ typedef unsigned char *sk_buff_data_t;
> >    *	@dst_pending_confirm: need to confirm neighbour
> >    *	@decrypted: Decrypted SKB
> >    *	@slow_gro: state present at GRO time, slower prepare step required
> > + *	@mono_delivery_time: When set, skb->tstamp has the
> > + *		delivery_time in mono clock base (i.e. EDT).  Otherwise, the
> > + *		skb->tstamp has the (rcv) timestamp at ingress and
> > + *		delivery_time at egress.
> >    *	@napi_id: id of the NAPI struct this skb came from
> >    *	@sender_cpu: (aka @napi_id) source CPU in XPS
> >    *	@secmark: security marking
> > @@ -917,6 +921,7 @@ struct sk_buff {
> >   	__u8			decrypted:1;
> >   #endif
> >   	__u8			slow_gro:1;
> > +	__u8			mono_delivery_time:1;
> 
> Don't you also need to extend sch_fq to treat any non-mono_delivery_time marked
> skb similar as if it hadn't been marked with a delivery time?
The current skb does not have clock base info, so the sch_fq does not depend
on it to work also.  fq_packet_beyond_horizon() should already be a good guard
on the clock base difference.  An extra check on the new mono_delivery_time does
not necessary add extra value on top.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 net-next 5/8] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress
  2022-02-16  5:51     ` Martin KaFai Lau
@ 2022-02-17  0:03       ` Daniel Borkmann
  2022-02-17  1:50         ` Martin KaFai Lau
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Borkmann @ 2022-02-17  0:03 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team, Willem de Bruijn

On 2/16/22 6:51 AM, Martin KaFai Lau wrote:
> On Wed, Feb 16, 2022 at 12:30:53AM +0100, Daniel Borkmann wrote:
>> On 2/11/22 8:13 AM, Martin KaFai Lau wrote:
>>> The current tc-bpf@ingress reads and writes the __sk_buff->tstamp
>>> as a (rcv) timestamp.  This patch is to backward compatible with the
>>> (rcv) timestamp expectation when the skb->tstamp has a mono delivery_time.
>>>
>>> If needed, the patch first saves the mono delivery_time.  Depending on
>>> the static key "netstamp_needed_key", it then resets the skb->tstamp to
>>> either 0 or ktime_get_real() before running the tc-bpf@ingress.  After
>>> the tc-bpf prog returns, if the (rcv) timestamp in skb->tstamp has not
>>> been changed, it will restore the earlier saved mono delivery_time.
>>>
>>> The current logic to run tc-bpf@ingress is refactored to a new
>>> bpf_prog_run_at_ingress() function and shared between cls_bpf and act_bpf.
>>> The above new delivery_time save/restore logic is also done together in
>>> this function.
>>>
>>> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
>>> ---
>>>    include/linux/filter.h | 28 ++++++++++++++++++++++++++++
>>>    net/sched/act_bpf.c    |  5 +----
>>>    net/sched/cls_bpf.c    |  6 +-----
>>>    3 files changed, 30 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>>> index d23e999dc032..e43e1701a80e 100644
>>> --- a/include/linux/filter.h
>>> +++ b/include/linux/filter.h
>>> @@ -699,6 +699,34 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb)
>>>    	cb->data_end  = skb->data + skb_headlen(skb);
>>>    }
>>> +static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog,
>>> +						   struct sk_buff *skb)
>>> +{
>>> +	ktime_t tstamp, saved_mono_dtime = 0;
>>> +	int filter_res;
>>> +
>>> +	if (unlikely(skb->mono_delivery_time)) {
>>> +		saved_mono_dtime = skb->tstamp;
>>> +		skb->mono_delivery_time = 0;
>>> +		if (static_branch_unlikely(&netstamp_needed_key))
>>> +			skb->tstamp = tstamp = ktime_get_real();
>>> +		else
>>> +			skb->tstamp = tstamp = 0;
>>> +	}
>>> +
>>> +	/* It is safe to push/pull even if skb_shared() */
>>> +	__skb_push(skb, skb->mac_len);
>>> +	bpf_compute_data_pointers(skb);
>>> +	filter_res = bpf_prog_run(prog, skb);
>>> +	__skb_pull(skb, skb->mac_len);
>>> +
>>> +	/* __sk_buff->tstamp was not changed, restore the delivery_time */
>>> +	if (unlikely(saved_mono_dtime) && skb_tstamp(skb) == tstamp)
>>> +		skb_set_delivery_time(skb, saved_mono_dtime, true);
>>
>> So above detour is for skb->tstamp backwards compatibility so users will see real time.
>> I don't see why we special case {cls,act}_bpf-only, given this will also be the case
>> for other subsystems (e.g. netfilter) when they read access plain skb->tstamp and get
>> the egress one instead of ktime_get_real() upon deferred skb_clear_delivery_time().
>>
>> If we would generally ignore it, then the above bpf_prog_run_at_ingress() save/restore
>> detour is not needed (so patch 5/6 should be dropped). (Meaning, if we need to special
>> case {cls,act}_bpf only, we could also have gone for simpler bpf-only solution..)
> The limitation here is there is only one skb->tstamp field.  I don't see
> a bpf-only solution or not will make a difference here.

A BPF-only solution would probably just treat the skb->tstamp as (semi-)opaque,
meaning, there're no further bits on clock type needed in skb, but given the
environment is controlled by an orchestrator it can decide which tstamps to
retain or which to reset (e.g. by looking at skb->sk). (The other approach is
exposing info on clock base as done here to some degree for mono/real.)

> Regarding the netfilter (good point!), I only see it is used in nfnetlink_log.c
> and nfnetlink_queue.c.  Like the tapping cases (earlier than the bpf run-point)
> and in general other ingress cases, it cannot assume the rcv timestamp is
> always there, so they can be changed like af_packet in patch 3
> which is a straight forward change.  I can make the change in v5.
> 
> Going back to the cls_bpf at ingress.  If the concern is on code cleanliness,
> how about removing this dance for now while the current rcv tstamp usage is
> unclear at ingress.  Meaning keep the delivery_time (if any) in skb->tstamp.
> This dance could be brought in later when there was breakage and legit usecase
> reported.  The new bpf prog will have to use the __sk_buff->delivery_time_type
> regardless if it wants to use skb->tstamp as the delivery_time, so they won't
> assume delivery_time is always in skb->tstamp and it will be fine even this
> dance would be brought back in later.

Yes, imho, this is still better than the bpf_prog_run_at_ingress() workaround.
Ideally, we know when we call helpers like ktime_get_ns() that the clock will
be mono. We could track that on verifier side in the register type, and when we
end up writing to skb->tstamp, we could implicitly also set the clock base bits
in skb for the ctx rewrite telling that it's of type 'mono'. Same for reading,
we could add __sk_buff->tstamp_type which program can access (imo tstamp_type
is more generic than a __sk_buff->delivery_time_type). If someone needs
ktime_get_clocktai_ns() for sch_etf in future, it could be similar tracking
mechanism. Also setting skb->tstamp to 0 ...

> Regarding patch 6, it is unrelated.  It needs to clear the
> mono_delivery_time bit if the bpf writes 0 to the skb->tstamp.

... doesn't need to be done as code after bpf_prog_run(), but should be brought
closer to when we write to the ctx where verifier generates the relevant insns.
Imo, that's better than having this outside in bpf_prog_run() which is then
checked no matter what program was doing or even accessing tstamp.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 net-next 5/8] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress
  2022-02-17  0:03       ` Daniel Borkmann
@ 2022-02-17  1:50         ` Martin KaFai Lau
  2022-02-17  6:52           ` Martin KaFai Lau
  0 siblings, 1 reply; 19+ messages in thread
From: Martin KaFai Lau @ 2022-02-17  1:50 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team, Willem de Bruijn

On Thu, Feb 17, 2022 at 01:03:21AM +0100, Daniel Borkmann wrote:
> On 2/16/22 6:51 AM, Martin KaFai Lau wrote:
> > On Wed, Feb 16, 2022 at 12:30:53AM +0100, Daniel Borkmann wrote:
> > > On 2/11/22 8:13 AM, Martin KaFai Lau wrote:
> > > > The current tc-bpf@ingress reads and writes the __sk_buff->tstamp
> > > > as a (rcv) timestamp.  This patch is to backward compatible with the
> > > > (rcv) timestamp expectation when the skb->tstamp has a mono delivery_time.
> > > > 
> > > > If needed, the patch first saves the mono delivery_time.  Depending on
> > > > the static key "netstamp_needed_key", it then resets the skb->tstamp to
> > > > either 0 or ktime_get_real() before running the tc-bpf@ingress.  After
> > > > the tc-bpf prog returns, if the (rcv) timestamp in skb->tstamp has not
> > > > been changed, it will restore the earlier saved mono delivery_time.
> > > > 
> > > > The current logic to run tc-bpf@ingress is refactored to a new
> > > > bpf_prog_run_at_ingress() function and shared between cls_bpf and act_bpf.
> > > > The above new delivery_time save/restore logic is also done together in
> > > > this function.
> > > > 
> > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > ---
> > > >    include/linux/filter.h | 28 ++++++++++++++++++++++++++++
> > > >    net/sched/act_bpf.c    |  5 +----
> > > >    net/sched/cls_bpf.c    |  6 +-----
> > > >    3 files changed, 30 insertions(+), 9 deletions(-)
> > > > 
> > > > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > > > index d23e999dc032..e43e1701a80e 100644
> > > > --- a/include/linux/filter.h
> > > > +++ b/include/linux/filter.h
> > > > @@ -699,6 +699,34 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb)
> > > >    	cb->data_end  = skb->data + skb_headlen(skb);
> > > >    }
> > > > +static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog,
> > > > +						   struct sk_buff *skb)
> > > > +{
> > > > +	ktime_t tstamp, saved_mono_dtime = 0;
> > > > +	int filter_res;
> > > > +
> > > > +	if (unlikely(skb->mono_delivery_time)) {
> > > > +		saved_mono_dtime = skb->tstamp;
> > > > +		skb->mono_delivery_time = 0;
> > > > +		if (static_branch_unlikely(&netstamp_needed_key))
> > > > +			skb->tstamp = tstamp = ktime_get_real();
> > > > +		else
> > > > +			skb->tstamp = tstamp = 0;
> > > > +	}
> > > > +
> > > > +	/* It is safe to push/pull even if skb_shared() */
> > > > +	__skb_push(skb, skb->mac_len);
> > > > +	bpf_compute_data_pointers(skb);
> > > > +	filter_res = bpf_prog_run(prog, skb);
> > > > +	__skb_pull(skb, skb->mac_len);
> > > > +
> > > > +	/* __sk_buff->tstamp was not changed, restore the delivery_time */
> > > > +	if (unlikely(saved_mono_dtime) && skb_tstamp(skb) == tstamp)
> > > > +		skb_set_delivery_time(skb, saved_mono_dtime, true);
> > > 
> > > So above detour is for skb->tstamp backwards compatibility so users will see real time.
> > > I don't see why we special case {cls,act}_bpf-only, given this will also be the case
> > > for other subsystems (e.g. netfilter) when they read access plain skb->tstamp and get
> > > the egress one instead of ktime_get_real() upon deferred skb_clear_delivery_time().
> > > 
> > > If we would generally ignore it, then the above bpf_prog_run_at_ingress() save/restore
> > > detour is not needed (so patch 5/6 should be dropped). (Meaning, if we need to special
> > > case {cls,act}_bpf only, we could also have gone for simpler bpf-only solution..)
> > The limitation here is there is only one skb->tstamp field.  I don't see
> > a bpf-only solution or not will make a difference here.
> 
> A BPF-only solution would probably just treat the skb->tstamp as (semi-)opaque,
> meaning, there're no further bits on clock type needed in skb, but given the
> environment is controlled by an orchestrator it can decide which tstamps to
> retain or which to reset (e.g. by looking at skb->sk). (The other approach is
> exposing info on clock base as done here to some degree for mono/real.)
hmm... I think we may be talking about different things.

Using a bit or not still does not change the fact that
there is only one skb->tstamp field which may have a delivery
time or rcv tstamp.  If the delivery time is reset before
forwarding to ingress or the delivery time was never there, then
it will be stamped with the rcv timestamp at ingress.
The bpf needs a way to distinguish between them.
skb->sk can at most tell the clock base if skb->tstamp
does indeed have the delivery_time.

> 
> > Regarding the netfilter (good point!), I only see it is used in nfnetlink_log.c
> > and nfnetlink_queue.c.  Like the tapping cases (earlier than the bpf run-point)
> > and in general other ingress cases, it cannot assume the rcv timestamp is
> > always there, so they can be changed like af_packet in patch 3
> > which is a straight forward change.  I can make the change in v5.
> > 
> > Going back to the cls_bpf at ingress.  If the concern is on code cleanliness,
> > how about removing this dance for now while the current rcv tstamp usage is
> > unclear at ingress.  Meaning keep the delivery_time (if any) in skb->tstamp.
> > This dance could be brought in later when there was breakage and legit usecase
> > reported.  The new bpf prog will have to use the __sk_buff->delivery_time_type
> > regardless if it wants to use skb->tstamp as the delivery_time, so they won't
> > assume delivery_time is always in skb->tstamp and it will be fine even this
> > dance would be brought back in later.
> 
> Yes, imho, this is still better than the bpf_prog_run_at_ingress() workaround.
ic. so it is ok to remove the mono dtime save/restore logic here and only brought
back in if there was legit breakage reported?

btw, did you look at patch 7 which added the __sk_buff->delivery_time_type
and bpf_set_delivery_time()?

> Ideally, we know when we call helpers like ktime_get_ns() that the clock will
> be mono. We could track that on verifier side in the register type, and when we
> end up writing to skb->tstamp, we could implicitly also set the clock base bits
> in skb for the ctx rewrite telling that it's of type 'mono'. Same for reading,
> we could add __sk_buff->tstamp_type which program can access (imo tstamp_type
> is more generic than a __sk_buff->delivery_time_type). If someone needs
> ktime_get_clocktai_ns() for sch_etf in future, it could be similar tracking
> mechanism. Also setting skb->tstamp to 0 ...
hmm... I think it is talking about a way to automatically
update the __sk_buff->delivery_time_type (mono_delivery_time bit) and
also avoid adding the new bpf_set_delivery_time() helper in patch 7?

It may have case that time is not always from helper ktime_get_ns() and
cannot be decided statically. e.g. what if we want to set the current
skb->tstamp based on when the previous skb was sent in a cgroup.  There
will be cases coming up that require runtime decision.

Also, imo, it may be a surprise behavior for the user who only
changed __skb_buff->tstamp but then figured out
__sk_buff->delivery_time_type is also changed in
the background.

Beside, not sure if the compiler will optimize the 2nd read on
__sk_buff->delivery_time_type.  The bpf may need a
READ_ONCE(__sk_buff->delivery_time_type) after writing __skb_buff->tstamp.
We can add volatile to the delivery_time_type in the UAPI but I think
it is overkill.

It is better to treat tstamp and delivery_time_type separately
for direct access, and have the skb_set_delivery_time() to change
both of them together.  Also, more checks can be done in
skb_set_delivery_time() which is more flexible to return
errors.

For TCP, it will be already in mono, so skb_set_delivery_time()
is usually not needed.

Regarding the name delivery_time_type vs tstamp_type, I thought about
that and finally picked the delivery_time_type because I want
a clear separation from rcv tstamp for now until there is more
clarity on how rcv tstamp is used in tc-bpf.

> > Regarding patch 6, it is unrelated.  It needs to clear the
> > mono_delivery_time bit if the bpf writes 0 to the skb->tstamp.
> 
> ... doesn't need to be done as code after bpf_prog_run(), but should be brought
> closer to when we write to the ctx where verifier generates the relevant insns.
> Imo, that's better than having this outside in bpf_prog_run() which is then
> checked no matter what program was doing or even accessing tstamp.
This will also change the mono_delivery_time bit in the background
and will be similar to the above points.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 net-next 5/8] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress
  2022-02-17  1:50         ` Martin KaFai Lau
@ 2022-02-17  6:52           ` Martin KaFai Lau
  0 siblings, 0 replies; 19+ messages in thread
From: Martin KaFai Lau @ 2022-02-17  6:52 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team, Willem de Bruijn

On Wed, Feb 16, 2022 at 05:50:43PM -0800, Martin KaFai Lau wrote:
> On Thu, Feb 17, 2022 at 01:03:21AM +0100, Daniel Borkmann wrote:
> > On 2/16/22 6:51 AM, Martin KaFai Lau wrote:
> > > On Wed, Feb 16, 2022 at 12:30:53AM +0100, Daniel Borkmann wrote:
> > > > On 2/11/22 8:13 AM, Martin KaFai Lau wrote:
> > > > > The current tc-bpf@ingress reads and writes the __sk_buff->tstamp
> > > > > as a (rcv) timestamp.  This patch is to backward compatible with the
> > > > > (rcv) timestamp expectation when the skb->tstamp has a mono delivery_time.
> > > > > 
> > > > > If needed, the patch first saves the mono delivery_time.  Depending on
> > > > > the static key "netstamp_needed_key", it then resets the skb->tstamp to
> > > > > either 0 or ktime_get_real() before running the tc-bpf@ingress.  After
> > > > > the tc-bpf prog returns, if the (rcv) timestamp in skb->tstamp has not
> > > > > been changed, it will restore the earlier saved mono delivery_time.
> > > > > 
> > > > > The current logic to run tc-bpf@ingress is refactored to a new
> > > > > bpf_prog_run_at_ingress() function and shared between cls_bpf and act_bpf.
> > > > > The above new delivery_time save/restore logic is also done together in
> > > > > this function.
> > > > > 
> > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > > ---
> > > > >    include/linux/filter.h | 28 ++++++++++++++++++++++++++++
> > > > >    net/sched/act_bpf.c    |  5 +----
> > > > >    net/sched/cls_bpf.c    |  6 +-----
> > > > >    3 files changed, 30 insertions(+), 9 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > > > > index d23e999dc032..e43e1701a80e 100644
> > > > > --- a/include/linux/filter.h
> > > > > +++ b/include/linux/filter.h
> > > > > @@ -699,6 +699,34 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb)
> > > > >    	cb->data_end  = skb->data + skb_headlen(skb);
> > > > >    }
> > > > > +static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog,
> > > > > +						   struct sk_buff *skb)
> > > > > +{
> > > > > +	ktime_t tstamp, saved_mono_dtime = 0;
> > > > > +	int filter_res;
> > > > > +
> > > > > +	if (unlikely(skb->mono_delivery_time)) {
> > > > > +		saved_mono_dtime = skb->tstamp;
> > > > > +		skb->mono_delivery_time = 0;
> > > > > +		if (static_branch_unlikely(&netstamp_needed_key))
> > > > > +			skb->tstamp = tstamp = ktime_get_real();
> > > > > +		else
> > > > > +			skb->tstamp = tstamp = 0;
> > > > > +	}
> > > > > +
> > > > > +	/* It is safe to push/pull even if skb_shared() */
> > > > > +	__skb_push(skb, skb->mac_len);
> > > > > +	bpf_compute_data_pointers(skb);
> > > > > +	filter_res = bpf_prog_run(prog, skb);
> > > > > +	__skb_pull(skb, skb->mac_len);
> > > > > +
> > > > > +	/* __sk_buff->tstamp was not changed, restore the delivery_time */
> > > > > +	if (unlikely(saved_mono_dtime) && skb_tstamp(skb) == tstamp)
> > > > > +		skb_set_delivery_time(skb, saved_mono_dtime, true);
> > > > 
> > > > So above detour is for skb->tstamp backwards compatibility so users will see real time.
> > > > I don't see why we special case {cls,act}_bpf-only, given this will also be the case
> > > > for other subsystems (e.g. netfilter) when they read access plain skb->tstamp and get
> > > > the egress one instead of ktime_get_real() upon deferred skb_clear_delivery_time().
> > > > 
> > > > If we would generally ignore it, then the above bpf_prog_run_at_ingress() save/restore
> > > > detour is not needed (so patch 5/6 should be dropped). (Meaning, if we need to special
> > > > case {cls,act}_bpf only, we could also have gone for simpler bpf-only solution..)
> > > The limitation here is there is only one skb->tstamp field.  I don't see
> > > a bpf-only solution or not will make a difference here.
> > 
> > A BPF-only solution would probably just treat the skb->tstamp as (semi-)opaque,
> > meaning, there're no further bits on clock type needed in skb, but given the
> > environment is controlled by an orchestrator it can decide which tstamps to
> > retain or which to reset (e.g. by looking at skb->sk). (The other approach is
> > exposing info on clock base as done here to some degree for mono/real.)
> hmm... I think we may be talking about different things.
> 
> Using a bit or not still does not change the fact that
> there is only one skb->tstamp field which may have a delivery
> time or rcv tstamp.  If the delivery time is reset before
> forwarding to ingress or the delivery time was never there, then
> it will be stamped with the rcv timestamp at ingress.
> The bpf needs a way to distinguish between them.
> skb->sk can at most tell the clock base if skb->tstamp
> does indeed have the delivery_time.
> 
> > 
> > > Regarding the netfilter (good point!), I only see it is used in nfnetlink_log.c
> > > and nfnetlink_queue.c.  Like the tapping cases (earlier than the bpf run-point)
> > > and in general other ingress cases, it cannot assume the rcv timestamp is
> > > always there, so they can be changed like af_packet in patch 3
> > > which is a straight forward change.  I can make the change in v5.
> > > 
> > > Going back to the cls_bpf at ingress.  If the concern is on code cleanliness,
> > > how about removing this dance for now while the current rcv tstamp usage is
> > > unclear at ingress.  Meaning keep the delivery_time (if any) in skb->tstamp.
> > > This dance could be brought in later when there was breakage and legit usecase
> > > reported.  The new bpf prog will have to use the __sk_buff->delivery_time_type
> > > regardless if it wants to use skb->tstamp as the delivery_time, so they won't
> > > assume delivery_time is always in skb->tstamp and it will be fine even this
> > > dance would be brought back in later.
> > 
> > Yes, imho, this is still better than the bpf_prog_run_at_ingress() workaround.
> ic. so it is ok to remove the mono dtime save/restore logic here and only brought
> back in if there was legit breakage reported?
Another idea on this which I think is a better middle ground solution
to remove the dance here.

When reading __sk_buff->tstamp, the verifier can do a rewrite if needed.
The rewrite will depend on whether the __sk_buff->delivery_time_type
has been read or not.

If delivery_time_type is not read, it will rewrite to this:
/* BPF_READ: __u64 a = __sk_buff->tstamp; */
if (!skb->tc_at_ingress || !skb->mono_delivery_time)
	a = skb->tstamp;
else
	a = 0;

That will be consistent with other kernel ingress path
expectation (either 0 or rcv tstamp).

If __sk_buff->delivery_time_type is read, no rewrite is needed
and skb->tstamp will be read as is.

> btw, did you look at patch 7 which added the __sk_buff->delivery_time_type
> and bpf_set_delivery_time()?
> 
> > Ideally, we know when we call helpers like ktime_get_ns() that the clock will
> > be mono. We could track that on verifier side in the register type, and when we
> > end up writing to skb->tstamp, we could implicitly also set the clock base bits
> > in skb for the ctx rewrite telling that it's of type 'mono'. Same for reading,
> > we could add __sk_buff->tstamp_type which program can access (imo tstamp_type
> > is more generic than a __sk_buff->delivery_time_type). If someone needs
> > ktime_get_clocktai_ns() for sch_etf in future, it could be similar tracking
> > mechanism. Also setting skb->tstamp to 0 ...
> hmm... I think it is talking about a way to automatically
> update the __sk_buff->delivery_time_type (mono_delivery_time bit) and
> also avoid adding the new bpf_set_delivery_time() helper in patch 7?
> 
> It may have case that time is not always from helper ktime_get_ns() and
> cannot be decided statically. e.g. what if we want to set the current
> skb->tstamp based on when the previous skb was sent in a cgroup.  There
> will be cases coming up that require runtime decision.
> 
> Also, imo, it may be a surprise behavior for the user who only
> changed __skb_buff->tstamp but then figured out
> __sk_buff->delivery_time_type is also changed in
> the background.
> 
> Beside, not sure if the compiler will optimize the 2nd read on
> __sk_buff->delivery_time_type.  The bpf may need a
> READ_ONCE(__sk_buff->delivery_time_type) after writing __skb_buff->tstamp.
> We can add volatile to the delivery_time_type in the UAPI but I think
> it is overkill.
> 
> It is better to treat tstamp and delivery_time_type separately
> for direct access, and have the skb_set_delivery_time() to change
> both of them together.  Also, more checks can be done in
> skb_set_delivery_time() which is more flexible to return
> errors.
> 
> For TCP, it will be already in mono, so skb_set_delivery_time()
> is usually not needed.
> 
> Regarding the name delivery_time_type vs tstamp_type, I thought about
> that and finally picked the delivery_time_type because I want
> a clear separation from rcv tstamp for now until there is more
> clarity on how rcv tstamp is used in tc-bpf.
> 
> > > Regarding patch 6, it is unrelated.  It needs to clear the
> > > mono_delivery_time bit if the bpf writes 0 to the skb->tstamp.
> > 
> > ... doesn't need to be done as code after bpf_prog_run(), but should be brought
> > closer to when we write to the ctx where verifier generates the relevant insns.
> > Imo, that's better than having this outside in bpf_prog_run() which is then
> > checked no matter what program was doing or even accessing tstamp.
> This will also change the mono_delivery_time bit in the background
> and will be similar to the above points.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-02-17  6:52 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-11  7:12 [PATCH v4 net-next 0/8] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
2022-02-11  7:12 ` [PATCH v4 net-next 1/8] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Martin KaFai Lau
2022-02-12 19:13   ` Cong Wang
2022-02-12 23:27     ` Martin KaFai Lau
2022-02-15 20:49   ` Daniel Borkmann
2022-02-16  6:11     ` Martin KaFai Lau
2022-02-11  7:12 ` [PATCH v4 net-next 2/8] net: Add skb_clear_tstamp() to keep the mono delivery_time Martin KaFai Lau
2022-02-11  7:12 ` [PATCH v4 net-next 3/8] net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() Martin KaFai Lau
2022-02-15 21:04   ` Daniel Borkmann
2022-02-11  7:12 ` [PATCH v4 net-next 4/8] net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally Martin KaFai Lau
2022-02-11  7:13 ` [PATCH v4 net-next 5/8] bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress Martin KaFai Lau
2022-02-15 23:30   ` Daniel Borkmann
2022-02-16  5:51     ` Martin KaFai Lau
2022-02-17  0:03       ` Daniel Borkmann
2022-02-17  1:50         ` Martin KaFai Lau
2022-02-17  6:52           ` Martin KaFai Lau
2022-02-11  7:13 ` [PATCH v4 net-next 6/8] bpf: Clear skb->mono_delivery_time bit if needed after running tc-bpf@egress Martin KaFai Lau
2022-02-11  7:13 ` [PATCH v4 net-next 7/8] bpf: Add __sk_buff->delivery_time_type and bpf_skb_set_delivery_time() Martin KaFai Lau
2022-02-11  7:13 ` [PATCH v4 net-next 8/8] bpf: selftests: test skb->tstamp in redirect_neigh Martin KaFai Lau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.