All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v3 net-next 0/4] Preserve mono delivery time (EDT) in skb->tstamp
@ 2022-01-21  7:30 Martin KaFai Lau
  2022-01-21  7:30 ` [RFC PATCH v3 net-next 1/4] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Martin KaFai Lau
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Martin KaFai Lau @ 2022-01-21  7:30 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

skb->tstamp was first used as the (rcv) timestamp in real time clock base.
The major usage is to report it to the user (e.g. SO_TIMESTAMP).

Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
during egress and used by the qdisc (e.g. sch_fq) to make decision on when
the skb can be passed to the dev.

Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
or the delivery_time, so it is always reset to 0 whenever forwarded
between egress and ingress.

While it makes sense to always clear the (rcv) timestamp in skb->tstamp
to avoid confusing sch_fq that expects the delivery_time, it is a
performance issue [0] to clear the delivery_time if the skb finally
egress to a fq@phy-dev.

v3:
- Feedback from v2 is using shinfo(skb)->tx_flags could be racy.
- Considered to reuse a few bits in skb->tstamp to represent
  different semantics, other than more code churns, it will break
  the bpf usecase which currently can write and then read back
  the skb->tstamp.
- Went back to v1 idea on adding a bit to skb and address the
  feedbacks on v1:
- Added one bit skb->mono_delivery_time to flag that
  the skb->tstamp has the mono delivery_time (EDT), instead
  of adding a bit to flag if the skb->tstamp has been forwarded or not.
- Instead of resetting the delivery_time back to the (rcv) timestamp
  during recvmsg syscall which may be too late and not useful,
  the delivery_time reset in v3 happens earlier once the stack
  knows that the skb will be delivered locally.
- Handled the tapping@ingress case by af_packet
- No need to change the (rcv) timestamp to mono clock base as in v1.
  The added one bit to flag skb->mono_delivery_time is enough
  to keep the EDT delivery_time during forward.
- Added logic to the bpf side to make the existing bpf
  running at ingress can still get the (rcv) timestamp
  when reading the __sk_buff->tstamp.  New __sk_buff->mono_delivery_time
  is also added.  Test is still needed to test this piece.

Martin KaFai Lau (4):
  net: Add skb->mono_delivery_time to distinguish mono delivery_time
    from (rcv) timestamp
  net: Add skb_clear_tstamp() to keep the mono delivery_time
  net: Set skb->mono_delivery_time and clear it when delivering locally
  bpf: Add __sk_buff->mono_delivery_time and handle __sk_buff->tstamp
    based on tc_at_ingress

 drivers/net/loopback.c                     |   2 +-
 include/linux/filter.h                     |  31 ++++-
 include/linux/skbuff.h                     |  64 ++++++++--
 include/uapi/linux/bpf.h                   |   1 +
 net/bridge/br_forward.c                    |   2 +-
 net/bridge/netfilter/nf_conntrack_bridge.c |   5 +-
 net/core/dev.c                             |   4 +-
 net/core/filter.c                          | 140 +++++++++++++++++++--
 net/core/skbuff.c                          |   8 +-
 net/ipv4/ip_forward.c                      |   2 +-
 net/ipv4/ip_input.c                        |   1 +
 net/ipv4/ip_output.c                       |   5 +-
 net/ipv4/tcp_output.c                      |  16 +--
 net/ipv6/ip6_input.c                       |   1 +
 net/ipv6/ip6_output.c                      |   7 +-
 net/ipv6/netfilter.c                       |   5 +-
 net/netfilter/ipvs/ip_vs_xmit.c            |   6 +-
 net/netfilter/nf_dup_netdev.c              |   2 +-
 net/netfilter/nf_flow_table_ip.c           |   4 +-
 net/netfilter/nft_fwd_netdev.c             |   2 +-
 net/openvswitch/vport.c                    |   2 +-
 net/packet/af_packet.c                     |   4 +-
 net/sched/act_bpf.c                        |   5 +-
 net/sched/cls_bpf.c                        |   6 +-
 net/xfrm/xfrm_interface.c                  |   2 +-
 tools/include/uapi/linux/bpf.h             |   1 +
 26 files changed, 265 insertions(+), 63 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v3 net-next 1/4] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp
  2022-01-21  7:30 [RFC PATCH v3 net-next 0/4] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
@ 2022-01-21  7:30 ` Martin KaFai Lau
  2022-01-22 15:32   ` Willem de Bruijn
  2022-01-21  7:30 ` [RFC PATCH v3 net-next 2/4] net: Add skb_clear_tstamp() to keep the mono delivery_time Martin KaFai Lau
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Martin KaFai Lau @ 2022-01-21  7:30 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

skb->tstamp was first used as the (rcv) timestamp in real time clock base.
The major usage is to report it to the user (e.g. SO_TIMESTAMP).

Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
during egress and used by the qdisc (e.g. sch_fq) to make decision on when
the skb can be passed to the dev.

Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
or the delivery_time, so it is always reset to 0 whenever forwarded
between egress and ingress.

While it makes sense to always clear the (rcv) timestamp in skb->tstamp
to avoid confusing sch_fq that expects the delivery_time, it is a
performance issue [0] to clear the delivery_time if the skb finally
egress to a fq@phy-dev.  For example, when forwarding from egress to
ingress and then finally back to egress:

            tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
                                     ^              ^
                                     reset          rest

[0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf

This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
is storing the mono delivery_time instead of the (rcv) timestamp.

The current use case is to keep the TCP mono delivery_time (EDT) and
to be used with sch_fq.  The later patch will also allow tc-bpf to read
and change the mono delivery_time.

In the future, another bit (e.g. skb->user_delivery_time) can be added
for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.

[ This patch is a prep work.  The following patch will
  get the other parts of the stack ready first.  Then another patch
  after that will finally set the skb->mono_delivery_time. ]

skb_set_delivery_time() function is added.  It is used by the tcp_output.c
and during ip[6] fragmentation to assign the delivery_time to
the skb->tstamp and also set the skb->mono_delivery_time.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/skbuff.h                     | 13 +++++++++++++
 net/bridge/netfilter/nf_conntrack_bridge.c |  5 +++--
 net/ipv4/ip_output.c                       |  5 +++--
 net/ipv4/tcp_output.c                      | 16 +++++++++-------
 net/ipv6/ip6_output.c                      |  5 +++--
 net/ipv6/netfilter.c                       |  5 +++--
 6 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bf11e1fbd69b..b9e20187242a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -720,6 +720,10 @@ typedef unsigned char *sk_buff_data_t;
  *	@dst_pending_confirm: need to confirm neighbour
  *	@decrypted: Decrypted SKB
  *	@slow_gro: state present at GRO time, slower prepare step required
+ *	@mono_delivery_time: When set, skb->tstamap has the
+ *		delivery_time in mono clock base (i.e. EDT).  Otherwise, the
+ *		skb->tstamp has the (rcv) timestamp at ingress and
+ *		delivery_time at egress.
  *	@napi_id: id of the NAPI struct this skb came from
  *	@sender_cpu: (aka @napi_id) source CPU in XPS
  *	@secmark: security marking
@@ -890,6 +894,7 @@ struct sk_buff {
 	__u8			decrypted:1;
 #endif
 	__u8			slow_gro:1;
+	__u8			mono_delivery_time:1;
 
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
@@ -3903,6 +3908,14 @@ static inline ktime_t net_invalid_timestamp(void)
 	return 0;
 }
 
+static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
+					 bool mono)
+{
+	skb->tstamp = kt;
+	/* Setting mono_delivery_time will be enabled later */
+	/* skb->mono_delivery_time = kt && mono; */
+}
+
 static inline u8 skb_metadata_len(const struct sk_buff *skb)
 {
 	return skb_shinfo(skb)->meta_len;
diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c
index fdbed3158555..ebfb2a5c59e4 100644
--- a/net/bridge/netfilter/nf_conntrack_bridge.c
+++ b/net/bridge/netfilter/nf_conntrack_bridge.c
@@ -32,6 +32,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
 					   struct sk_buff *))
 {
 	int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
+	bool mono_delivery_time = skb->mono_delivery_time;
 	unsigned int hlen, ll_rs, mtu;
 	ktime_t tstamp = skb->tstamp;
 	struct ip_frag_state state;
@@ -81,7 +82,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
 			if (iter.frag)
 				ip_fraglist_prepare(skb, &iter);
 
-			skb->tstamp = tstamp;
+			skb_set_delivery_time(skb, tstamp, mono_delivery_time);
 			err = output(net, sk, data, skb);
 			if (err || !iter.frag)
 				break;
@@ -112,7 +113,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
 			goto blackhole;
 		}
 
-		skb2->tstamp = tstamp;
+		skb_set_delivery_time(skb2, tstamp, mono_delivery_time);
 		err = output(net, sk, data, skb2);
 		if (err)
 			goto blackhole;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 57c1d8431386..6ba08d9bdf8e 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -754,6 +754,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 {
 	struct iphdr *iph;
 	struct sk_buff *skb2;
+	bool mono_delivery_time = skb->mono_delivery_time;
 	struct rtable *rt = skb_rtable(skb);
 	unsigned int mtu, hlen, ll_rs;
 	struct ip_fraglist_iter iter;
@@ -836,7 +837,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 				ip_fraglist_prepare(skb, &iter);
 			}
 
-			skb->tstamp = tstamp;
+			skb_set_delivery_time(skb, tstamp, mono_delivery_time);
 			err = output(net, sk, skb);
 
 			if (!err)
@@ -892,7 +893,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 		/*
 		 *	Put this fragment into the sending queue.
 		 */
-		skb2->tstamp = tstamp;
+		skb_set_delivery_time(skb2, tstamp, mono_delivery_time);
 		err = output(net, sk, skb2);
 		if (err)
 			goto fail;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5079832af5c1..261e9584f81e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1253,7 +1253,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 	tp = tcp_sk(sk);
 	prior_wstamp = tp->tcp_wstamp_ns;
 	tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache);
-	skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
+	skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true);
 	if (clone_it) {
 		oskb = skb;
 
@@ -1589,7 +1589,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 
 	skb_split(skb, buff, len);
 
-	buff->tstamp = skb->tstamp;
+	skb_set_delivery_time(buff, skb->tstamp, true);
 	tcp_fragment_tstamp(skb, buff);
 
 	old_factor = tcp_skb_pcount(skb);
@@ -2616,7 +2616,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 
 		if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
 			/* "skb_mstamp_ns" is used as a start point for the retransmit timer */
-			skb->skb_mstamp_ns = tp->tcp_wstamp_ns = tp->tcp_clock_cache;
+			tp->tcp_wstamp_ns = tp->tcp_clock_cache;
+			skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true);
 			list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue);
 			tcp_init_tso_segs(skb, mss_now);
 			goto repair; /* Skip network transmission */
@@ -3541,11 +3542,12 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	now = tcp_clock_ns();
 #ifdef CONFIG_SYN_COOKIES
 	if (unlikely(synack_type == TCP_SYNACK_COOKIE && ireq->tstamp_ok))
-		skb->skb_mstamp_ns = cookie_init_timestamp(req, now);
+		skb_set_delivery_time(skb, cookie_init_timestamp(req, now),
+				      true);
 	else
 #endif
 	{
-		skb->skb_mstamp_ns = now;
+		skb_set_delivery_time(skb, now, true);
 		if (!tcp_rsk(req)->snt_synack) /* Timestamp first SYNACK */
 			tcp_rsk(req)->snt_synack = tcp_skb_timestamp_us(skb);
 	}
@@ -3594,7 +3596,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb,
 				synack_type, &opts);
 
-	skb->skb_mstamp_ns = now;
+	skb_set_delivery_time(skb, now, true);
 	tcp_add_tx_delay(skb, tp);
 
 	return skb;
@@ -3771,7 +3773,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 
 	err = tcp_transmit_skb(sk, syn_data, 1, sk->sk_allocation);
 
-	syn->skb_mstamp_ns = syn_data->skb_mstamp_ns;
+	skb_set_delivery_time(syn, syn_data->skb_mstamp_ns, true);
 
 	/* Now full SYN+DATA was cloned and sent (or not),
 	 * remove the SYN from the original skb (syn_data)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 2995f8d89e7e..4b873413edc2 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -813,6 +813,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 	struct rt6_info *rt = (struct rt6_info *)skb_dst(skb);
 	struct ipv6_pinfo *np = skb->sk && !dev_recursion_level() ?
 				inet6_sk(skb->sk) : NULL;
+	bool mono_delivery_time = skb->mono_delivery_time;
 	struct ip6_frag_state state;
 	unsigned int mtu, hlen, nexthdr_offset;
 	ktime_t tstamp = skb->tstamp;
@@ -903,7 +904,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			if (iter.frag)
 				ip6_fraglist_prepare(skb, &iter);
 
-			skb->tstamp = tstamp;
+			skb_set_delivery_time(skb, tstamp, mono_delivery_time);
 			err = output(net, sk, skb);
 			if (!err)
 				IP6_INC_STATS(net, ip6_dst_idev(&rt->dst),
@@ -962,7 +963,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 		/*
 		 *	Put this fragment into the sending queue.
 		 */
-		frag->tstamp = tstamp;
+		skb_set_delivery_time(frag, tstamp, mono_delivery_time);
 		err = output(net, sk, frag);
 		if (err)
 			goto fail;
diff --git a/net/ipv6/netfilter.c b/net/ipv6/netfilter.c
index 6ab710b5a1a8..1da332450d98 100644
--- a/net/ipv6/netfilter.c
+++ b/net/ipv6/netfilter.c
@@ -121,6 +121,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 				  struct sk_buff *))
 {
 	int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
+	bool mono_delivery_time = skb->mono_delivery_time;
 	ktime_t tstamp = skb->tstamp;
 	struct ip6_frag_state state;
 	u8 *prevhdr, nexthdr = 0;
@@ -186,7 +187,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			if (iter.frag)
 				ip6_fraglist_prepare(skb, &iter);
 
-			skb->tstamp = tstamp;
+			skb_set_delivery_time(skb, tstamp, mono_delivery_time);
 			err = output(net, sk, data, skb);
 			if (err || !iter.frag)
 				break;
@@ -219,7 +220,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			goto blackhole;
 		}
 
-		skb2->tstamp = tstamp;
+		skb_set_delivery_time(skb2, tstamp, mono_delivery_time);
 		err = output(net, sk, data, skb2);
 		if (err)
 			goto blackhole;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v3 net-next 2/4] net: Add skb_clear_tstamp() to keep the mono delivery_time
  2022-01-21  7:30 [RFC PATCH v3 net-next 0/4] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
  2022-01-21  7:30 ` [RFC PATCH v3 net-next 1/4] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Martin KaFai Lau
@ 2022-01-21  7:30 ` Martin KaFai Lau
  2022-01-21  7:30 ` [RFC PATCH v3 net-next 3/4] net: Set skb->mono_delivery_time and clear it when delivering locally Martin KaFai Lau
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Martin KaFai Lau @ 2022-01-21  7:30 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

Right now, skb->tstamp is reset to 0 whenever the skb is forwarded.

If skb->tstamp has the mono delivery_time, clearing it can hurt
the performance when it finally transmits out to fq@phy-dev.

The earlier patch added a skb->mono_delivery_time bit to
flag the skb->tstamp carrying the mono delivery_time.

This patch adds skb_clear_tstamp() helper which keeps
the mono delivery_time and clear everything else.

The delivery_time clearing is postponed until the stack knows the
skb will be delivered locally.  The postponed delivery_time clearing
will be done together in a later patch when the skb->mono_delivery_time
bit will finally be turned on.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 drivers/net/loopback.c           |  2 +-
 include/linux/skbuff.h           | 10 +++++++++-
 net/bridge/br_forward.c          |  2 +-
 net/core/filter.c                |  6 +++---
 net/core/skbuff.c                |  2 +-
 net/ipv4/ip_forward.c            |  2 +-
 net/ipv6/ip6_output.c            |  2 +-
 net/netfilter/ipvs/ip_vs_xmit.c  |  6 +++---
 net/netfilter/nf_dup_netdev.c    |  2 +-
 net/netfilter/nf_flow_table_ip.c |  4 ++--
 net/netfilter/nft_fwd_netdev.c   |  2 +-
 net/openvswitch/vport.c          |  2 +-
 net/xfrm/xfrm_interface.c        |  2 +-
 13 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index ed0edf5884ef..70a38fa09299 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -74,7 +74,7 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb,
 	skb_tx_timestamp(skb);
 
 	/* do not fool net_timestamp_check() with various clock bases */
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	skb_orphan(skb);
 
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b9e20187242a..8de555513b94 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3916,6 +3916,14 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
 	/* skb->mono_delivery_time = kt && mono; */
 }
 
+static inline void skb_clear_tstamp(struct sk_buff *skb)
+{
+	if (skb->mono_delivery_time)
+		return;
+
+	skb->tstamp = 0;
+}
+
 static inline u8 skb_metadata_len(const struct sk_buff *skb)
 {
 	return skb_shinfo(skb)->meta_len;
@@ -4772,7 +4780,7 @@ static inline void skb_set_redirected(struct sk_buff *skb, bool from_ingress)
 #ifdef CONFIG_NET_REDIRECT
 	skb->from_ingress = from_ingress;
 	if (skb->from_ingress)
-		skb->tstamp = 0;
+		skb_clear_tstamp(skb);
 #endif
 }
 
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index ec646656dbf1..02bb620d3b8d 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -62,7 +62,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit);
 
 int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,
 		       net, sk, skb, NULL, skb->dev,
 		       br_dev_queue_push_xmit);
diff --git a/net/core/filter.c b/net/core/filter.c
index 4603b7cd3cd1..4fc53d645a01 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2107,7 +2107,7 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	dev_xmit_recursion_inc();
 	ret = dev_queue_xmit(skb);
@@ -2176,7 +2176,7 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb,
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
 		skb = skb_expand_head(skb, hh_len);
@@ -2274,7 +2274,7 @@ static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb,
 	}
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
 		skb = skb_expand_head(skb, hh_len);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0118f0afaa4f..3e3da8fdf8f5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5350,7 +5350,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 
 	ipvs_reset(skb);
 	skb->mark = 0;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 }
 EXPORT_SYMBOL_GPL(skb_scrub_packet);
 
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 00ec819f949b..92ba3350274b 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -79,7 +79,7 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s
 	if (unlikely(opt->optlen))
 		ip_forward_options(skb);
 
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	return dst_output(net, sk, skb);
 }
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 4b873413edc2..0875ab61eb95 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -440,7 +440,7 @@ static inline int ip6_forward_finish(struct net *net, struct sock *sk,
 	}
 #endif
 
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	return dst_output(net, sk, skb);
 }
 
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index d2e5a8f644b8..029171379884 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -610,7 +610,7 @@ static inline int ip_vs_tunnel_xmit_prepare(struct sk_buff *skb,
 		nf_reset_ct(skb);
 		skb_forward_csum(skb);
 		if (skb->dev)
-			skb->tstamp = 0;
+			skb_clear_tstamp(skb);
 	}
 	return ret;
 }
@@ -652,7 +652,7 @@ static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb,
 	if (!local) {
 		skb_forward_csum(skb);
 		if (skb->dev)
-			skb->tstamp = 0;
+			skb_clear_tstamp(skb);
 		NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb,
 			NULL, skb_dst(skb)->dev, dst_output);
 	} else
@@ -674,7 +674,7 @@ static inline int ip_vs_send_or_cont(int pf, struct sk_buff *skb,
 		ip_vs_drop_early_demux_sk(skb);
 		skb_forward_csum(skb);
 		if (skb->dev)
-			skb->tstamp = 0;
+			skb_clear_tstamp(skb);
 		NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb,
 			NULL, skb_dst(skb)->dev, dst_output);
 	} else
diff --git a/net/netfilter/nf_dup_netdev.c b/net/netfilter/nf_dup_netdev.c
index a579e59ee5c5..7873bd1389c3 100644
--- a/net/netfilter/nf_dup_netdev.c
+++ b/net/netfilter/nf_dup_netdev.c
@@ -19,7 +19,7 @@ static void nf_do_netdev_egress(struct sk_buff *skb, struct net_device *dev)
 		skb_push(skb, skb->mac_len);
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	dev_queue_xmit(skb);
 }
 
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 889cf88d3dba..f1d387129f02 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -376,7 +376,7 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
 	nf_flow_nat_ip(flow, skb, thoff, dir, iph);
 
 	ip_decrease_ttl(iph);
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	if (flow_table->flags & NF_FLOWTABLE_COUNTER)
 		nf_ct_acct_update(flow->ct, tuplehash->tuple.dir, skb->len);
@@ -611,7 +611,7 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 	nf_flow_nat_ipv6(flow, skb, dir, ip6h);
 
 	ip6h->hop_limit--;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 
 	if (flow_table->flags & NF_FLOWTABLE_COUNTER)
 		nf_ct_acct_update(flow->ct, tuplehash->tuple.dir, skb->len);
diff --git a/net/netfilter/nft_fwd_netdev.c b/net/netfilter/nft_fwd_netdev.c
index fa9301ca6033..4b2b0946c0b6 100644
--- a/net/netfilter/nft_fwd_netdev.c
+++ b/net/netfilter/nft_fwd_netdev.c
@@ -140,7 +140,7 @@ static void nft_fwd_neigh_eval(const struct nft_expr *expr,
 		return;
 
 	skb->dev = dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	neigh_xmit(neigh_table, dev, addr, skb);
 out:
 	regs->verdict.code = verdict;
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index cf2ce5812489..82a74f998966 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -507,7 +507,7 @@ void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto)
 	}
 
 	skb->dev = vport->dev;
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	vport->ops->send(skb);
 	return;
 
diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index 57448fc519fc..4991e99ced9a 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -190,7 +190,7 @@ static void xfrmi_dev_uninit(struct net_device *dev)
 
 static void xfrmi_scrub_packet(struct sk_buff *skb, bool xnet)
 {
-	skb->tstamp = 0;
+	skb_clear_tstamp(skb);
 	skb->pkt_type = PACKET_HOST;
 	skb->skb_iif = 0;
 	skb->ignore_df = 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v3 net-next 3/4] net: Set skb->mono_delivery_time and clear it when delivering locally
  2022-01-21  7:30 [RFC PATCH v3 net-next 0/4] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
  2022-01-21  7:30 ` [RFC PATCH v3 net-next 1/4] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Martin KaFai Lau
  2022-01-21  7:30 ` [RFC PATCH v3 net-next 2/4] net: Add skb_clear_tstamp() to keep the mono delivery_time Martin KaFai Lau
@ 2022-01-21  7:30 ` Martin KaFai Lau
  2022-01-21 12:02   ` Julian Anastasov
  2022-01-21  7:30 ` [RFC PATCH v3 net-next 4/4] bpf: Add __sk_buff->mono_delivery_time and handle __sk_buff->tstamp based on tc_at_ingress Martin KaFai Lau
  2022-01-22 15:43 ` [RFC PATCH v3 net-next 0/4] Preserve mono delivery time (EDT) in skb->tstamp Willem de Bruijn
  4 siblings, 1 reply; 15+ messages in thread
From: Martin KaFai Lau @ 2022-01-21  7:30 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

This patch sets the skb->mono_delivery_time to flag the skb->tstamp
is used as the mono delivery_time (EDT) instead of the (rcv) timestamp.

skb_clear_delivery_time() is added to clear the delivery_time and set
back to the (rcv) timestamp if needed when the skb is being delivered
locally (to a sk).  skb_clear_delivery_time() is called in
ip_local_deliver() and ip6_input().  In most of the regular ingress
cases, the skb->tstamp should already have the (rcv) timestamp.
For the egress loop back to ingress cases, the marking of the (rcv)
timestamp is postponed from dev.c to ip_local_deliver() and
ip6_input().

Another case needs to clear the delivery_time is the network
tapping (e.g. af_packet by tcpdump).  Regardless of tapping at the ingress
or egress,  the tapped skb is received by the af_packet socket, so
it is ingress to the af_packet socket and it expects
the (rcv) timestamp.

When tapping at egress, dev_queue_xmit_nit() is used.  It has already
expected skb->tstamp may have delivery_time,  so it does
skb_clone()+net_timestamp_set() to ensure the cloned skb has
the (rcv) timestamp before passing to the af_packet sk.
This patch only adds to clear the skb->mono_delivery_time
bit in net_timestamp_set().

When tapping at ingress, it currently expects the skb->tstamp is either 0
or has the (rcv) timestamp.  Meaning, the tapping at ingress path
has already expected the skb->tstamp could be 0 and it will get
the (rcv) timestamp by ktime_get_real() when needed.

There are two cases for tapping at ingress:

One case is af_packet queues the skb to its sk_receive_queue.  The skb
is either not shared or new clone created.  The skb_clear_delivery_time()
is called to clear the delivery_time (if any) before it is queued to the
sk_receive_queue.

Another case, the ingress skb is directly copied to the rx_ring
and tpacket_get_timestamp() is used to get the (rcv) timestamp.
skb_tstamp() is used in tpacket_get_timestamp() to check
the skb->mono_delivery_time bit before returning skb->tstamp.
As mentioned earlier, the tapping@ingress has already expected
the skb may not have the (rcv) timestamp (because no sk has asked
for it) and has handled this case by directly calling ktime_get_real().

In __skb_tstamp_tx, it clones the egress skb and queues the clone to the
sk_error_queue.  The outgoing skb may have the mono delivery_time while
the (rcv) timestamp is expected for the clone, so the
skb->mono_delivery_time bit is also cleared from the clone.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/skbuff.h | 27 +++++++++++++++++++++++++--
 net/core/dev.c         |  4 +++-
 net/core/skbuff.c      |  6 ++++--
 net/ipv4/ip_input.c    |  1 +
 net/ipv6/ip6_input.c   |  1 +
 net/packet/af_packet.c |  4 +++-
 6 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8de555513b94..4677bb6c7279 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3912,8 +3912,23 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
 					 bool mono)
 {
 	skb->tstamp = kt;
-	/* Setting mono_delivery_time will be enabled later */
-	/* skb->mono_delivery_time = kt && mono; */
+	skb->mono_delivery_time = kt && mono;
+}
+
+DECLARE_STATIC_KEY_FALSE(netstamp_needed_key);
+
+/* skb is delivering locally.  If needed, set it to the (rcv) timestamp.
+ * Otherwise, clear the delivery time.
+ */
+static inline void skb_clear_delivery_time(struct sk_buff *skb)
+{
+	if (unlikely(skb->mono_delivery_time)) {
+		skb->mono_delivery_time = 0;
+		if (static_branch_unlikely(&netstamp_needed_key))
+			skb->tstamp = ktime_get_real();
+		else
+			skb->tstamp = 0;
+	}
 }
 
 static inline void skb_clear_tstamp(struct sk_buff *skb)
@@ -3924,6 +3939,14 @@ static inline void skb_clear_tstamp(struct sk_buff *skb)
 	skb->tstamp = 0;
 }
 
+static inline ktime_t skb_tstamp(const struct sk_buff *skb)
+{
+	if (unlikely(skb->mono_delivery_time))
+		return 0;
+
+	return skb->tstamp;
+}
+
 static inline u8 skb_metadata_len(const struct sk_buff *skb)
 {
 	return skb_shinfo(skb)->meta_len;
diff --git a/net/core/dev.c b/net/core/dev.c
index 84a0d9542fe9..b4b392f5ef9f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2000,7 +2000,8 @@ void net_dec_egress_queue(void)
 EXPORT_SYMBOL_GPL(net_dec_egress_queue);
 #endif
 
-static DEFINE_STATIC_KEY_FALSE(netstamp_needed_key);
+DEFINE_STATIC_KEY_FALSE(netstamp_needed_key);
+EXPORT_SYMBOL(netstamp_needed_key);
 #ifdef CONFIG_JUMP_LABEL
 static atomic_t netstamp_needed_deferred;
 static atomic_t netstamp_wanted;
@@ -2061,6 +2062,7 @@ EXPORT_SYMBOL(net_disable_timestamp);
 static inline void net_timestamp_set(struct sk_buff *skb)
 {
 	skb->tstamp = 0;
+	skb->mono_delivery_time = 0;
 	if (static_branch_unlikely(&netstamp_needed_key))
 		__net_timestamp(skb);
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3e3da8fdf8f5..93dc763da8cb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4817,10 +4817,12 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 		skb_shinfo(skb)->tskey = skb_shinfo(orig_skb)->tskey;
 	}
 
-	if (hwtstamps)
+	if (hwtstamps) {
 		*skb_hwtstamps(skb) = *hwtstamps;
-	else
+	} else {
 		skb->tstamp = ktime_get_real();
+		skb->mono_delivery_time = 0;
+	}
 
 	__skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
 }
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 3a025c011971..35311ca75496 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -244,6 +244,7 @@ int ip_local_deliver(struct sk_buff *skb)
 	 */
 	struct net *net = dev_net(skb->dev);
 
+	skb_clear_delivery_time(skb);
 	if (ip_is_fragment(ip_hdr(skb))) {
 		if (ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER))
 			return 0;
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 80256717868e..84f93864b774 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -469,6 +469,7 @@ static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *sk
 
 int ip6_input(struct sk_buff *skb)
 {
+	skb_clear_delivery_time(skb);
 	return NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_IN,
 		       dev_net(skb->dev), NULL, skb, skb->dev, NULL,
 		       ip6_input_finish);
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 5bd409ab4cc2..ab55adff3500 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -460,7 +460,7 @@ static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts,
 		return TP_STATUS_TS_RAW_HARDWARE;
 
 	if ((flags & SOF_TIMESTAMPING_SOFTWARE) &&
-	    ktime_to_timespec64_cond(skb->tstamp, ts))
+	    ktime_to_timespec64_cond(skb_tstamp(skb), ts))
 		return TP_STATUS_TS_SOFTWARE;
 
 	return 0;
@@ -2195,6 +2195,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
 	spin_lock(&sk->sk_receive_queue.lock);
 	po->stats.stats1.tp_packets++;
 	sock_skb_set_dropcount(sk, skb);
+	skb_clear_delivery_time(skb);
 	__skb_queue_tail(&sk->sk_receive_queue, skb);
 	spin_unlock(&sk->sk_receive_queue.lock);
 	sk->sk_data_ready(sk);
@@ -2373,6 +2374,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	po->stats.stats1.tp_packets++;
 	if (copy_skb) {
 		status |= TP_STATUS_COPY;
+		skb_clear_delivery_time(copy_skb);
 		__skb_queue_tail(&sk->sk_receive_queue, copy_skb);
 	}
 	spin_unlock(&sk->sk_receive_queue.lock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v3 net-next 4/4] bpf: Add __sk_buff->mono_delivery_time and handle __sk_buff->tstamp based on tc_at_ingress
  2022-01-21  7:30 [RFC PATCH v3 net-next 0/4] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2022-01-21  7:30 ` [RFC PATCH v3 net-next 3/4] net: Set skb->mono_delivery_time and clear it when delivering locally Martin KaFai Lau
@ 2022-01-21  7:30 ` Martin KaFai Lau
  2022-01-21 18:50   ` sdf
  2022-01-22 15:43 ` [RFC PATCH v3 net-next 0/4] Preserve mono delivery time (EDT) in skb->tstamp Willem de Bruijn
  4 siblings, 1 reply; 15+ messages in thread
From: Martin KaFai Lau @ 2022-01-21  7:30 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Willem de Bruijn

__sk_buff->mono_delivery_time:
This patch adds __sk_buff->mono_delivery_time to
read and write the mono delivery_time in skb->tstamp.

The bpf rewrite is like:
/* BPF_READ: __u64 a = __sk_buff->mono_delivery_time; */
if (skb->mono_delivery_time)
	a = skb->tstamp;
else
	a = 0;

/* BPF_WRITE: __sk_buff->mono_delivery_time = a; */
skb->tstamp = a;
skb->mono_delivery_time = !!a;

__sk_buff->tstamp:
The bpf rewrite is like:
/* BPF_READ: __u64 a = __sk_buff->tstamp; */
if (skb->tc_at_ingress && skb->mono_delivery_time)
	a = 0;
else
	a = skb->tstamp;

/* BPF_WRITE: __sk_buff->tstamp = a; */
skb->tstamp = a;
if (skb->tc_at_ingress || !a)
	skb->mono_delivery_time = 0;

At egress, reading is the same as before.  All skb->tstamp
is the delivery_time.  Writing will not change the (kernel)
skb->mono_delivery_time also unless 0 is being written.  This
will be the same behavior as before.

(#) At ingress, the current bpf prog can only expect the
(rcv) timestamp.  Thus, both reading and writing are now treated as
operating on the (rcv) timestamp for the existing bpf prog.

During bpf load time, the verifier will learn if the
bpf prog has accessed the new __sk_buff->mono_delivery_time.

When reading at ingress, if the bpf prog does not access the
new __sk_buff->mono_delivery_time, it will be treated as the
existing behavior as mentioned in (#) above.  If the (kernel) skb->tstamp
currently has a delivery_time,  it will temporary be saved first and then
set the skb->tstamp to either the ktime_get_real() or zero.  After
the bpf prog finished running, if the bpf prog did not change
the skb->tstamp,  the saved delivery_time will be restored
back to the skb->tstamp.

When writing __sk_buff->tstamp at ingress, the
skb->mono_delivery_time will be cleared because of
the (#) mentioned above.

If the bpf prog does access the new __sk_buff->mono_delivery_time
at ingress, it indicates that the bpf prog is aware of this new
kernel support:
the (kernel) skb->tstamp can have the delivery_time or the
(rcv) timestamp at ingress.  If the __sk_buff->mono_delivery_time
is available, the __sk_buff->tstamp will not be available and
it will be zero.

The bpf rewrite needs to access the skb's mono_delivery_time
and tc_at_ingress bit.  They are moved up in sk_buff so
that bpf rewrite can be done at a fixed offset.  tc_skip_classify
is moved together with tc_at_ingress.  To get one bit for
mono_delivery_time, csum_not_inet is moved down and this bit
is currently used by sctp.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/filter.h         |  31 +++++++-
 include/linux/skbuff.h         |  20 +++--
 include/uapi/linux/bpf.h       |   1 +
 net/core/filter.c              | 134 ++++++++++++++++++++++++++++++---
 net/sched/act_bpf.c            |   5 +-
 net/sched/cls_bpf.c            |   6 +-
 tools/include/uapi/linux/bpf.h |   1 +
 7 files changed, 171 insertions(+), 27 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 71fa57b88bfc..5cef695d6575 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -572,7 +572,8 @@ struct bpf_prog {
 				has_callchain_buf:1, /* callchain buffer allocated? */
 				enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */
 				call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */
-				call_get_func_ip:1; /* Do we call get_func_ip() */
+				call_get_func_ip:1, /* Do we call get_func_ip() */
+				delivery_time_access:1; /* Accessed __sk_buff->mono_delivery_time */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
@@ -699,6 +700,34 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb)
 	cb->data_end  = skb->data + skb_headlen(skb);
 }
 
+static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog *prog,
+						   struct sk_buff *skb)
+{
+	ktime_t tstamp, delivery_time = 0;
+	int filter_res;
+
+	if (unlikely(skb->mono_delivery_time) && !prog->delivery_time_access) {
+		delivery_time = skb->tstamp;
+		skb->mono_delivery_time = 0;
+		if (static_branch_unlikely(&netstamp_needed_key))
+			skb->tstamp = tstamp = ktime_get_real();
+		else
+			skb->tstamp = tstamp = 0;
+	}
+
+	/* It is safe to push/pull even if skb_shared() */
+	__skb_push(skb, skb->mac_len);
+	bpf_compute_data_pointers(skb);
+	filter_res = bpf_prog_run(prog, skb);
+	__skb_pull(skb, skb->mac_len);
+
+	/* __sk_buff->tstamp was not changed, restore the delivery_time */
+	if (unlikely(delivery_time) && skb_tstamp(skb) == tstamp)
+		skb_set_delivery_time(skb, delivery_time, true);
+
+	return filter_res;
+}
+
 /* Similar to bpf_compute_data_pointers(), except that save orginal
  * data in cb->data and cb->meta_data for restore.
  */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4677bb6c7279..a14b04b86c13 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -866,22 +866,23 @@ struct sk_buff {
 	__u8			vlan_present:1;	/* See PKT_VLAN_PRESENT_BIT */
 	__u8			csum_complete_sw:1;
 	__u8			csum_level:2;
-	__u8			csum_not_inet:1;
 	__u8			dst_pending_confirm:1;
+	__u8			mono_delivery_time:1;
+
+#ifdef CONFIG_NET_CLS_ACT
+	__u8			tc_skip_classify:1;
+	__u8			tc_at_ingress:1;
+#endif
 #ifdef CONFIG_IPV6_NDISC_NODETYPE
 	__u8			ndisc_nodetype:2;
 #endif
-
+	__u8			csum_not_inet:1;
 	__u8			ipvs_property:1;
 	__u8			inner_protocol_type:1;
 	__u8			remcsum_offload:1;
 #ifdef CONFIG_NET_SWITCHDEV
 	__u8			offload_fwd_mark:1;
 	__u8			offload_l3_fwd_mark:1;
-#endif
-#ifdef CONFIG_NET_CLS_ACT
-	__u8			tc_skip_classify:1;
-	__u8			tc_at_ingress:1;
 #endif
 	__u8			redirected:1;
 #ifdef CONFIG_NET_REDIRECT
@@ -894,7 +895,6 @@ struct sk_buff {
 	__u8			decrypted:1;
 #endif
 	__u8			slow_gro:1;
-	__u8			mono_delivery_time:1;
 
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
@@ -972,10 +972,16 @@ struct sk_buff {
 /* if you move pkt_vlan_present around you also must adapt these constants */
 #ifdef __BIG_ENDIAN_BITFIELD
 #define PKT_VLAN_PRESENT_BIT	7
+#define TC_AT_INGRESS_SHIFT	0
+#define SKB_MONO_DELIVERY_TIME_SHIFT 2
 #else
 #define PKT_VLAN_PRESENT_BIT	0
+#define TC_AT_INGRESS_SHIFT	7
+#define SKB_MONO_DELIVERY_TIME_SHIFT 5
 #endif
 #define PKT_VLAN_PRESENT_OFFSET	offsetof(struct sk_buff, __pkt_vlan_present_offset)
+#define TC_AT_INGRESS_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)
+#define SKB_MONO_DELIVERY_TIME_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)
 
 #ifdef __KERNEL__
 /*
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b0383d371b9a..83725c891f3c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5437,6 +5437,7 @@ struct __sk_buff {
 	__u32 gso_size;
 	__u32 :32;		/* Padding, future use. */
 	__u64 hwtstamp;
+	__u64 mono_delivery_time;
 };
 
 struct bpf_tunnel_key {
diff --git a/net/core/filter.c b/net/core/filter.c
index 4fc53d645a01..db17812f0f8c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7832,6 +7832,7 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type
 			return false;
 		break;
 	case bpf_ctx_range(struct __sk_buff, tstamp):
+	case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
 		if (size != sizeof(__u64))
 			return false;
 		break;
@@ -7872,6 +7873,7 @@ static bool sk_filter_is_valid_access(int off, int size,
 	case bpf_ctx_range(struct __sk_buff, tstamp):
 	case bpf_ctx_range(struct __sk_buff, wire_len):
 	case bpf_ctx_range(struct __sk_buff, hwtstamp):
+	case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
 		return false;
 	}
 
@@ -7911,6 +7913,7 @@ static bool cg_skb_is_valid_access(int off, int size,
 		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
 			break;
 		case bpf_ctx_range(struct __sk_buff, tstamp):
+		case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
 			if (!bpf_capable())
 				return false;
 			break;
@@ -7943,6 +7946,7 @@ static bool lwt_is_valid_access(int off, int size,
 	case bpf_ctx_range(struct __sk_buff, tstamp):
 	case bpf_ctx_range(struct __sk_buff, wire_len):
 	case bpf_ctx_range(struct __sk_buff, hwtstamp):
+	case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
 		return false;
 	}
 
@@ -8169,6 +8173,7 @@ static bool tc_cls_act_is_valid_access(int off, int size,
 		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
 		case bpf_ctx_range(struct __sk_buff, tstamp):
 		case bpf_ctx_range(struct __sk_buff, queue_mapping):
+		case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
 			break;
 		default:
 			return false;
@@ -8445,6 +8450,7 @@ static bool sk_skb_is_valid_access(int off, int size,
 	case bpf_ctx_range(struct __sk_buff, tstamp):
 	case bpf_ctx_range(struct __sk_buff, wire_len):
 	case bpf_ctx_range(struct __sk_buff, hwtstamp):
+	case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
 		return false;
 	}
 
@@ -8603,6 +8609,114 @@ static struct bpf_insn *bpf_convert_shinfo_access(const struct bpf_insn *si,
 	return insn;
 }
 
+static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_insn *si,
+						struct bpf_insn *insn)
+{
+	__u8 value_reg = si->dst_reg;
+	__u8 skb_reg = si->src_reg;
+	__u8 tmp_reg = BPF_REG_AX;
+
+#ifdef CONFIG_NET_CLS_ACT
+	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
+	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, 1 << TC_AT_INGRESS_SHIFT);
+	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 5);
+	/* @ingress, read __sk_buff->tstamp as the (rcv) timestamp,
+	 * so check the skb->mono_delivery_time.
+	 */
+	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
+			      SKB_MONO_DELIVERY_TIME_OFFSET);
+	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
+				1 << SKB_MONO_DELIVERY_TIME_SHIFT);
+	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2);
+	/* skb->mono_delivery_time is set, read 0 as the (rcv) timestamp. */
+	*insn++ = BPF_MOV64_IMM(value_reg, 0);
+	*insn++ = BPF_JMP_A(1);
+#endif
+
+	*insn++ = BPF_LDX_MEM(BPF_DW, value_reg, skb_reg,
+			      offsetof(struct sk_buff, tstamp));
+	return insn;
+}
+
+static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_insn *si,
+						 struct bpf_insn *insn)
+{
+	__u8 value_reg = si->src_reg;
+	__u8 skb_reg = si->dst_reg;
+	__u8 tmp_reg = BPF_REG_AX;
+
+	/* skb->tstamp = tstamp */
+	*insn++ = BPF_STX_MEM(BPF_DW, skb_reg, value_reg,
+			      offsetof(struct sk_buff, tstamp));
+
+#ifdef CONFIG_NET_CLS_ACT
+	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
+	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, 1 << TC_AT_INGRESS_SHIFT);
+	*insn++ = BPF_JMP32_IMM(BPF_JNE, tmp_reg, 0, 1);
+#endif
+
+	/* test tstamp != 0 */
+	*insn++ = BPF_JMP_IMM(BPF_JNE, value_reg, 0, 3);
+	/* writing __sk_buff->tstamp at ingress or writing 0,
+	 * clear the skb->mono_delivery_time.
+	 */
+	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
+			      SKB_MONO_DELIVERY_TIME_OFFSET);
+	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
+				~(1 << SKB_MONO_DELIVERY_TIME_SHIFT));
+	*insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg,
+			      SKB_MONO_DELIVERY_TIME_OFFSET);
+
+	return insn;
+}
+
+static struct bpf_insn *
+bpf_convert_mono_delivery_time_read(const struct bpf_insn *si,
+				    struct bpf_insn *insn)
+{
+	__u8 value_reg = si->dst_reg;
+	__u8 skb_reg = si->src_reg;
+	__u8 tmp_reg = BPF_REG_AX;
+
+	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
+			      SKB_MONO_DELIVERY_TIME_OFFSET);
+	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
+				1 << SKB_MONO_DELIVERY_TIME_SHIFT);
+	*insn++ = BPF_JMP32_IMM(BPF_JNE, tmp_reg, 0, 2);
+	*insn++ = BPF_MOV64_IMM(value_reg, 0);
+	*insn++ = BPF_JMP_A(1);
+	*insn++ = BPF_LDX_MEM(BPF_DW, value_reg, skb_reg,
+			      offsetof(struct sk_buff, tstamp));
+
+	return insn;
+}
+
+static struct bpf_insn *
+bpf_convert_mono_delivery_time_write(const struct bpf_insn *si,
+				     struct bpf_insn *insn)
+{
+	__u8 value_reg = si->src_reg;
+	__u8 skb_reg = si->dst_reg;
+	__u8 tmp_reg = BPF_REG_AX;
+
+	*insn++ = BPF_STX_MEM(BPF_DW, skb_reg, value_reg,
+			      offsetof(struct sk_buff, tstamp));
+	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
+			      SKB_MONO_DELIVERY_TIME_OFFSET);
+	*insn++ = BPF_JMP_IMM(BPF_JNE, value_reg, 0, 2);
+	/* write zero, clear the skb->mono_delivery_time */
+	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
+				~(1 << SKB_MONO_DELIVERY_TIME_SHIFT));
+	*insn++ = BPF_JMP_A(1);
+	/* write non zero, set skb->mono_delivery_time */
+	*insn++ = BPF_ALU32_IMM(BPF_OR, tmp_reg,
+				1 << SKB_MONO_DELIVERY_TIME_SHIFT);
+	*insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg,
+			      SKB_MONO_DELIVERY_TIME_OFFSET);
+
+	return insn;
+}
+
 static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 				  const struct bpf_insn *si,
 				  struct bpf_insn *insn_buf,
@@ -8911,17 +9025,17 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 		BUILD_BUG_ON(sizeof_field(struct sk_buff, tstamp) != 8);
 
 		if (type == BPF_WRITE)
-			*insn++ = BPF_STX_MEM(BPF_DW,
-					      si->dst_reg, si->src_reg,
-					      bpf_target_off(struct sk_buff,
-							     tstamp, 8,
-							     target_size));
+			insn = bpf_convert_tstamp_write(si, insn);
 		else
-			*insn++ = BPF_LDX_MEM(BPF_DW,
-					      si->dst_reg, si->src_reg,
-					      bpf_target_off(struct sk_buff,
-							     tstamp, 8,
-							     target_size));
+			insn = bpf_convert_tstamp_read(si, insn);
+		break;
+
+	case offsetof(struct __sk_buff, mono_delivery_time):
+		if (type == BPF_WRITE)
+			insn = bpf_convert_mono_delivery_time_write(si, insn);
+		else
+			insn = bpf_convert_mono_delivery_time_read(si, insn);
+		prog->delivery_time_access = 1;
 		break;
 
 	case offsetof(struct __sk_buff, gso_segs):
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index a77d8908e737..14c3bd0a5088 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -45,10 +45,7 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act,
 
 	filter = rcu_dereference(prog->filter);
 	if (at_ingress) {
-		__skb_push(skb, skb->mac_len);
-		bpf_compute_data_pointers(skb);
-		filter_res = bpf_prog_run(filter, skb);
-		__skb_pull(skb, skb->mac_len);
+		filter_res = bpf_prog_run_at_ingress(filter, skb);
 	} else {
 		bpf_compute_data_pointers(skb);
 		filter_res = bpf_prog_run(filter, skb);
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index df19a847829e..036b2e1f74af 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -93,11 +93,7 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 		if (tc_skip_sw(prog->gen_flags)) {
 			filter_res = prog->exts_integrated ? TC_ACT_UNSPEC : 0;
 		} else if (at_ingress) {
-			/* It is safe to push/pull even if skb_shared() */
-			__skb_push(skb, skb->mac_len);
-			bpf_compute_data_pointers(skb);
-			filter_res = bpf_prog_run(prog->filter, skb);
-			__skb_pull(skb, skb->mac_len);
+			filter_res = bpf_prog_run_at_ingress(prog->filter, skb);
 		} else {
 			bpf_compute_data_pointers(skb);
 			filter_res = bpf_prog_run(prog->filter, skb);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index b0383d371b9a..83725c891f3c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5437,6 +5437,7 @@ struct __sk_buff {
 	__u32 gso_size;
 	__u32 :32;		/* Padding, future use. */
 	__u64 hwtstamp;
+	__u64 mono_delivery_time;
 };
 
 struct bpf_tunnel_key {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 net-next 3/4] net: Set skb->mono_delivery_time and clear it when delivering locally
  2022-01-21  7:30 ` [RFC PATCH v3 net-next 3/4] net: Set skb->mono_delivery_time and clear it when delivering locally Martin KaFai Lau
@ 2022-01-21 12:02   ` Julian Anastasov
  2022-01-22  3:28     ` Martin KaFai Lau
  0 siblings, 1 reply; 15+ messages in thread
From: Julian Anastasov @ 2022-01-21 12:02 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Willem de Bruijn


	Hello,

On Thu, 20 Jan 2022, Martin KaFai Lau wrote:

> This patch sets the skb->mono_delivery_time to flag the skb->tstamp
> is used as the mono delivery_time (EDT) instead of the (rcv) timestamp.
> 
> skb_clear_delivery_time() is added to clear the delivery_time and set
> back to the (rcv) timestamp if needed when the skb is being delivered
> locally (to a sk).  skb_clear_delivery_time() is called in
> ip_local_deliver() and ip6_input().  In most of the regular ingress
> cases, the skb->tstamp should already have the (rcv) timestamp.
> For the egress loop back to ingress cases, the marking of the (rcv)
> timestamp is postponed from dev.c to ip_local_deliver() and
> ip6_input().
> 
> Another case needs to clear the delivery_time is the network
> tapping (e.g. af_packet by tcpdump).  Regardless of tapping at the ingress
> or egress,  the tapped skb is received by the af_packet socket, so
> it is ingress to the af_packet socket and it expects
> the (rcv) timestamp.
> 
> When tapping at egress, dev_queue_xmit_nit() is used.  It has already
> expected skb->tstamp may have delivery_time,  so it does
> skb_clone()+net_timestamp_set() to ensure the cloned skb has
> the (rcv) timestamp before passing to the af_packet sk.
> This patch only adds to clear the skb->mono_delivery_time
> bit in net_timestamp_set().
> 
> When tapping at ingress, it currently expects the skb->tstamp is either 0
> or has the (rcv) timestamp.  Meaning, the tapping at ingress path
> has already expected the skb->tstamp could be 0 and it will get
> the (rcv) timestamp by ktime_get_real() when needed.
> 
> There are two cases for tapping at ingress:
> 
> One case is af_packet queues the skb to its sk_receive_queue.  The skb
> is either not shared or new clone created.  The skb_clear_delivery_time()
> is called to clear the delivery_time (if any) before it is queued to the
> sk_receive_queue.
> 
> Another case, the ingress skb is directly copied to the rx_ring
> and tpacket_get_timestamp() is used to get the (rcv) timestamp.
> skb_tstamp() is used in tpacket_get_timestamp() to check
> the skb->mono_delivery_time bit before returning skb->tstamp.
> As mentioned earlier, the tapping@ingress has already expected
> the skb may not have the (rcv) timestamp (because no sk has asked
> for it) and has handled this case by directly calling ktime_get_real().
> 
> In __skb_tstamp_tx, it clones the egress skb and queues the clone to the
> sk_error_queue.  The outgoing skb may have the mono delivery_time while
> the (rcv) timestamp is expected for the clone, so the
> skb->mono_delivery_time bit is also cleared from the clone.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/linux/skbuff.h | 27 +++++++++++++++++++++++++--
>  net/core/dev.c         |  4 +++-
>  net/core/skbuff.c      |  6 ++++--
>  net/ipv4/ip_input.c    |  1 +
>  net/ipv6/ip6_input.c   |  1 +
>  net/packet/af_packet.c |  4 +++-
>  6 files changed, 37 insertions(+), 6 deletions(-)
> 

> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> index 3a025c011971..35311ca75496 100644
> --- a/net/ipv4/ip_input.c
> +++ b/net/ipv4/ip_input.c
> @@ -244,6 +244,7 @@ int ip_local_deliver(struct sk_buff *skb)
>  	 */
>  	struct net *net = dev_net(skb->dev);
>  
> +	skb_clear_delivery_time(skb);

	Is it safe to move this line into ip_local_deliver_finish ?

>  	if (ip_is_fragment(ip_hdr(skb))) {
>  		if (ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER))
>  			return 0;
> diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> index 80256717868e..84f93864b774 100644
> --- a/net/ipv6/ip6_input.c
> +++ b/net/ipv6/ip6_input.c
> @@ -469,6 +469,7 @@ static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *sk
>  
>  int ip6_input(struct sk_buff *skb)
>  {
> +	skb_clear_delivery_time(skb);

	Is it safe to move this line into ip6_input_finish?
The problem for both cases is that IPVS hooks at LOCAL_IN and
can decide to forward the packet by returning NF_STOLEN and
avoiding the _finish code. In short, before reaching the
_finish code it is still not decided that packet reaches the
sockets.

>  	return NF_HOOK(NFPROTO_IPV6, NF_INET_LOCAL_IN,
>  		       dev_net(skb->dev), NULL, skb, skb->dev, NULL,
>  		       ip6_input_finish);

Regards

--
Julian Anastasov <ja@ssi.bg>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 net-next 4/4] bpf: Add __sk_buff->mono_delivery_time and handle __sk_buff->tstamp based on tc_at_ingress
  2022-01-21  7:30 ` [RFC PATCH v3 net-next 4/4] bpf: Add __sk_buff->mono_delivery_time and handle __sk_buff->tstamp based on tc_at_ingress Martin KaFai Lau
@ 2022-01-21 18:50   ` sdf
  2022-01-21 20:56     ` Martin KaFai Lau
  0 siblings, 1 reply; 15+ messages in thread
From: sdf @ 2022-01-21 18:50 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Willem de Bruijn

On 01/20, Martin KaFai Lau wrote:
> __sk_buff->mono_delivery_time:
> This patch adds __sk_buff->mono_delivery_time to
> read and write the mono delivery_time in skb->tstamp.

> The bpf rewrite is like:
> /* BPF_READ: __u64 a = __sk_buff->mono_delivery_time; */
> if (skb->mono_delivery_time)
> 	a = skb->tstamp;
> else
> 	a = 0;

> /* BPF_WRITE: __sk_buff->mono_delivery_time = a; */
> skb->tstamp = a;
> skb->mono_delivery_time = !!a;

> __sk_buff->tstamp:
> The bpf rewrite is like:
> /* BPF_READ: __u64 a = __sk_buff->tstamp; */
> if (skb->tc_at_ingress && skb->mono_delivery_time)
> 	a = 0;
> else
> 	a = skb->tstamp;

> /* BPF_WRITE: __sk_buff->tstamp = a; */
> skb->tstamp = a;
> if (skb->tc_at_ingress || !a)
> 	skb->mono_delivery_time = 0;

> At egress, reading is the same as before.  All skb->tstamp
> is the delivery_time.  Writing will not change the (kernel)
> skb->mono_delivery_time also unless 0 is being written.  This
> will be the same behavior as before.

> (#) At ingress, the current bpf prog can only expect the
> (rcv) timestamp.  Thus, both reading and writing are now treated as
> operating on the (rcv) timestamp for the existing bpf prog.

> During bpf load time, the verifier will learn if the
> bpf prog has accessed the new __sk_buff->mono_delivery_time.

> When reading at ingress, if the bpf prog does not access the
> new __sk_buff->mono_delivery_time, it will be treated as the
> existing behavior as mentioned in (#) above.  If the (kernel) skb->tstamp
> currently has a delivery_time,  it will temporary be saved first and then
> set the skb->tstamp to either the ktime_get_real() or zero.  After
> the bpf prog finished running, if the bpf prog did not change
> the skb->tstamp,  the saved delivery_time will be restored
> back to the skb->tstamp.

> When writing __sk_buff->tstamp at ingress, the
> skb->mono_delivery_time will be cleared because of
> the (#) mentioned above.

> If the bpf prog does access the new __sk_buff->mono_delivery_time
> at ingress, it indicates that the bpf prog is aware of this new
> kernel support:
> the (kernel) skb->tstamp can have the delivery_time or the
> (rcv) timestamp at ingress.  If the __sk_buff->mono_delivery_time
> is available, the __sk_buff->tstamp will not be available and
> it will be zero.

> The bpf rewrite needs to access the skb's mono_delivery_time
> and tc_at_ingress bit.  They are moved up in sk_buff so
> that bpf rewrite can be done at a fixed offset.  tc_skip_classify
> is moved together with tc_at_ingress.  To get one bit for
> mono_delivery_time, csum_not_inet is moved down and this bit
> is currently used by sctp.

> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>   include/linux/filter.h         |  31 +++++++-
>   include/linux/skbuff.h         |  20 +++--
>   include/uapi/linux/bpf.h       |   1 +
>   net/core/filter.c              | 134 ++++++++++++++++++++++++++++++---
>   net/sched/act_bpf.c            |   5 +-
>   net/sched/cls_bpf.c            |   6 +-
>   tools/include/uapi/linux/bpf.h |   1 +
>   7 files changed, 171 insertions(+), 27 deletions(-)

> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 71fa57b88bfc..5cef695d6575 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -572,7 +572,8 @@ struct bpf_prog {
>   				has_callchain_buf:1, /* callchain buffer allocated? */
>   				enforce_expected_attach_type:1, /* Enforce expected_attach_type  
> checking at attach time */
>   				call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid()  
> */
> -				call_get_func_ip:1; /* Do we call get_func_ip() */
> +				call_get_func_ip:1, /* Do we call get_func_ip() */
> +				delivery_time_access:1; /* Accessed __sk_buff->mono_delivery_time */
>   	enum bpf_prog_type	type;		/* Type of BPF program */
>   	enum bpf_attach_type	expected_attach_type; /* For some prog types */
>   	u32			len;		/* Number of filter blocks */
> @@ -699,6 +700,34 @@ static inline void bpf_compute_data_pointers(struct  
> sk_buff *skb)
>   	cb->data_end  = skb->data + skb_headlen(skb);
>   }

> +static __always_inline u32 bpf_prog_run_at_ingress(const struct bpf_prog  
> *prog,
> +						   struct sk_buff *skb)
> +{
> +	ktime_t tstamp, delivery_time = 0;
> +	int filter_res;
> +
> +	if (unlikely(skb->mono_delivery_time) && !prog->delivery_time_access) {
> +		delivery_time = skb->tstamp;
> +		skb->mono_delivery_time = 0;
> +		if (static_branch_unlikely(&netstamp_needed_key))
> +			skb->tstamp = tstamp = ktime_get_real();
> +		else
> +			skb->tstamp = tstamp = 0;
> +	}
> +
> +	/* It is safe to push/pull even if skb_shared() */
> +	__skb_push(skb, skb->mac_len);
> +	bpf_compute_data_pointers(skb);
> +	filter_res = bpf_prog_run(prog, skb);
> +	__skb_pull(skb, skb->mac_len);
> +
> +	/* __sk_buff->tstamp was not changed, restore the delivery_time */
> +	if (unlikely(delivery_time) && skb_tstamp(skb) == tstamp)
> +		skb_set_delivery_time(skb, delivery_time, true);
> +
> +	return filter_res;
> +}
> +
>   /* Similar to bpf_compute_data_pointers(), except that save orginal
>    * data in cb->data and cb->meta_data for restore.
>    */
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 4677bb6c7279..a14b04b86c13 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -866,22 +866,23 @@ struct sk_buff {
>   	__u8			vlan_present:1;	/* See PKT_VLAN_PRESENT_BIT */
>   	__u8			csum_complete_sw:1;
>   	__u8			csum_level:2;
> -	__u8			csum_not_inet:1;
>   	__u8			dst_pending_confirm:1;
> +	__u8			mono_delivery_time:1;
> +
> +#ifdef CONFIG_NET_CLS_ACT
> +	__u8			tc_skip_classify:1;
> +	__u8			tc_at_ingress:1;
> +#endif
>   #ifdef CONFIG_IPV6_NDISC_NODETYPE
>   	__u8			ndisc_nodetype:2;
>   #endif
> -
> +	__u8			csum_not_inet:1;
>   	__u8			ipvs_property:1;
>   	__u8			inner_protocol_type:1;
>   	__u8			remcsum_offload:1;
>   #ifdef CONFIG_NET_SWITCHDEV
>   	__u8			offload_fwd_mark:1;
>   	__u8			offload_l3_fwd_mark:1;
> -#endif
> -#ifdef CONFIG_NET_CLS_ACT
> -	__u8			tc_skip_classify:1;
> -	__u8			tc_at_ingress:1;
>   #endif
>   	__u8			redirected:1;
>   #ifdef CONFIG_NET_REDIRECT
> @@ -894,7 +895,6 @@ struct sk_buff {
>   	__u8			decrypted:1;
>   #endif
>   	__u8			slow_gro:1;
> -	__u8			mono_delivery_time:1;

>   #ifdef CONFIG_NET_SCHED
>   	__u16			tc_index;	/* traffic control index */
> @@ -972,10 +972,16 @@ struct sk_buff {
>   /* if you move pkt_vlan_present around you also must adapt these  
> constants */
>   #ifdef __BIG_ENDIAN_BITFIELD
>   #define PKT_VLAN_PRESENT_BIT	7
> +#define TC_AT_INGRESS_SHIFT	0
> +#define SKB_MONO_DELIVERY_TIME_SHIFT 2
>   #else
>   #define PKT_VLAN_PRESENT_BIT	0
> +#define TC_AT_INGRESS_SHIFT	7
> +#define SKB_MONO_DELIVERY_TIME_SHIFT 5
>   #endif
>   #define PKT_VLAN_PRESENT_OFFSET	offsetof(struct sk_buff,  
> __pkt_vlan_present_offset)
> +#define TC_AT_INGRESS_OFFSET offsetof(struct sk_buff,  
> __pkt_vlan_present_offset)
> +#define SKB_MONO_DELIVERY_TIME_OFFSET offsetof(struct sk_buff,  
> __pkt_vlan_present_offset)

>   #ifdef __KERNEL__
>   /*
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index b0383d371b9a..83725c891f3c 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5437,6 +5437,7 @@ struct __sk_buff {
>   	__u32 gso_size;
>   	__u32 :32;		/* Padding, future use. */
>   	__u64 hwtstamp;
> +	__u64 mono_delivery_time;
>   };

>   struct bpf_tunnel_key {
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 4fc53d645a01..db17812f0f8c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -7832,6 +7832,7 @@ static bool bpf_skb_is_valid_access(int off, int  
> size, enum bpf_access_type type
>   			return false;
>   		break;
>   	case bpf_ctx_range(struct __sk_buff, tstamp):
> +	case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
>   		if (size != sizeof(__u64))
>   			return false;
>   		break;
> @@ -7872,6 +7873,7 @@ static bool sk_filter_is_valid_access(int off, int  
> size,
>   	case bpf_ctx_range(struct __sk_buff, tstamp):
>   	case bpf_ctx_range(struct __sk_buff, wire_len):
>   	case bpf_ctx_range(struct __sk_buff, hwtstamp):
> +	case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
>   		return false;
>   	}

> @@ -7911,6 +7913,7 @@ static bool cg_skb_is_valid_access(int off, int  
> size,
>   		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
>   			break;
>   		case bpf_ctx_range(struct __sk_buff, tstamp):
> +		case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
>   			if (!bpf_capable())
>   				return false;
>   			break;
> @@ -7943,6 +7946,7 @@ static bool lwt_is_valid_access(int off, int size,
>   	case bpf_ctx_range(struct __sk_buff, tstamp):
>   	case bpf_ctx_range(struct __sk_buff, wire_len):
>   	case bpf_ctx_range(struct __sk_buff, hwtstamp):
> +	case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
>   		return false;
>   	}

> @@ -8169,6 +8173,7 @@ static bool tc_cls_act_is_valid_access(int off, int  
> size,
>   		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
>   		case bpf_ctx_range(struct __sk_buff, tstamp):
>   		case bpf_ctx_range(struct __sk_buff, queue_mapping):
> +		case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
>   			break;
>   		default:
>   			return false;
> @@ -8445,6 +8450,7 @@ static bool sk_skb_is_valid_access(int off, int  
> size,
>   	case bpf_ctx_range(struct __sk_buff, tstamp):
>   	case bpf_ctx_range(struct __sk_buff, wire_len):
>   	case bpf_ctx_range(struct __sk_buff, hwtstamp):
> +	case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
>   		return false;
>   	}

> @@ -8603,6 +8609,114 @@ static struct bpf_insn  
> *bpf_convert_shinfo_access(const struct bpf_insn *si,
>   	return insn;
>   }


[..]

> +static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_insn  
> *si,
> +						struct bpf_insn *insn)
> +{
> +	__u8 value_reg = si->dst_reg;
> +	__u8 skb_reg = si->src_reg;
> +	__u8 tmp_reg = BPF_REG_AX;
> +
> +#ifdef CONFIG_NET_CLS_ACT
> +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
> +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, 1 << TC_AT_INGRESS_SHIFT);
> +	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 5);
> +	/* @ingress, read __sk_buff->tstamp as the (rcv) timestamp,
> +	 * so check the skb->mono_delivery_time.
> +	 */
> +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
> +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
> +				1 << SKB_MONO_DELIVERY_TIME_SHIFT);
> +	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2);
> +	/* skb->mono_delivery_time is set, read 0 as the (rcv) timestamp. */
> +	*insn++ = BPF_MOV64_IMM(value_reg, 0);
> +	*insn++ = BPF_JMP_A(1);
> +#endif
> +
> +	*insn++ = BPF_LDX_MEM(BPF_DW, value_reg, skb_reg,
> +			      offsetof(struct sk_buff, tstamp));
> +	return insn;
> +}
> +
> +static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_insn  
> *si,
> +						 struct bpf_insn *insn)
> +{
> +	__u8 value_reg = si->src_reg;
> +	__u8 skb_reg = si->dst_reg;
> +	__u8 tmp_reg = BPF_REG_AX;
> +
> +	/* skb->tstamp = tstamp */
> +	*insn++ = BPF_STX_MEM(BPF_DW, skb_reg, value_reg,
> +			      offsetof(struct sk_buff, tstamp));
> +
> +#ifdef CONFIG_NET_CLS_ACT
> +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
> +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, 1 << TC_AT_INGRESS_SHIFT);
> +	*insn++ = BPF_JMP32_IMM(BPF_JNE, tmp_reg, 0, 1);
> +#endif
> +
> +	/* test tstamp != 0 */
> +	*insn++ = BPF_JMP_IMM(BPF_JNE, value_reg, 0, 3);
> +	/* writing __sk_buff->tstamp at ingress or writing 0,
> +	 * clear the skb->mono_delivery_time.
> +	 */
> +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
> +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
> +				~(1 << SKB_MONO_DELIVERY_TIME_SHIFT));
> +	*insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg,
> +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> +
> +	return insn;
> +}

I wonder if we'll see the regression from this. We read/write tstamp
frequently and I'm not sure we care about the forwarding case.

As a future work/follow up, do you think we can support cases like
bpf_prog_load(prog_type=SCHED_CLS expected_attach_type=TC_EGRESS) where
we can generate bytecode with only BPF_LDX_MEM/BPF_STX_MEM for skb->tstamp?
(essentially a bytecode as it was prior to your patch series)

Since we know that that specific program will run only at egress,
I'm assuming we can generate simpler bytecode? (of coarse it needs more
work on cls_bpf to enforce this new expected_attach_type constraint)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 net-next 4/4] bpf: Add __sk_buff->mono_delivery_time and handle __sk_buff->tstamp based on tc_at_ingress
  2022-01-21 18:50   ` sdf
@ 2022-01-21 20:56     ` Martin KaFai Lau
  2022-01-21 22:33       ` sdf
  0 siblings, 1 reply; 15+ messages in thread
From: Martin KaFai Lau @ 2022-01-21 20:56 UTC (permalink / raw)
  To: sdf
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Willem de Bruijn

On Fri, Jan 21, 2022 at 10:50:20AM -0800, sdf@google.com wrote:
> On 01/20, Martin KaFai Lau wrote:
> > __sk_buff->mono_delivery_time:
> > This patch adds __sk_buff->mono_delivery_time to
> > read and write the mono delivery_time in skb->tstamp.
> 
> > The bpf rewrite is like:
> > /* BPF_READ: __u64 a = __sk_buff->mono_delivery_time; */
> > if (skb->mono_delivery_time)
> > 	a = skb->tstamp;
> > else
> > 	a = 0;
> 
> > /* BPF_WRITE: __sk_buff->mono_delivery_time = a; */
> > skb->tstamp = a;
> > skb->mono_delivery_time = !!a;
> 
> > __sk_buff->tstamp:
> > The bpf rewrite is like:
> > /* BPF_READ: __u64 a = __sk_buff->tstamp; */
> > if (skb->tc_at_ingress && skb->mono_delivery_time)
> > 	a = 0;
> > else
> > 	a = skb->tstamp;
> 
> > /* BPF_WRITE: __sk_buff->tstamp = a; */
> > skb->tstamp = a;
> > if (skb->tc_at_ingress || !a)
> > 	skb->mono_delivery_time = 0;
> 
> > At egress, reading is the same as before.  All skb->tstamp
> > is the delivery_time.  Writing will not change the (kernel)
> > skb->mono_delivery_time also unless 0 is being written.  This
> > will be the same behavior as before.
> 
> > (#) At ingress, the current bpf prog can only expect the
> > (rcv) timestamp.  Thus, both reading and writing are now treated as
> > operating on the (rcv) timestamp for the existing bpf prog.
> 
> > During bpf load time, the verifier will learn if the
> > bpf prog has accessed the new __sk_buff->mono_delivery_time.
> 
> > When reading at ingress, if the bpf prog does not access the
> > new __sk_buff->mono_delivery_time, it will be treated as the
> > existing behavior as mentioned in (#) above.  If the (kernel) skb->tstamp
> > currently has a delivery_time,  it will temporary be saved first and then
> > set the skb->tstamp to either the ktime_get_real() or zero.  After
> > the bpf prog finished running, if the bpf prog did not change
> > the skb->tstamp,  the saved delivery_time will be restored
> > back to the skb->tstamp.
> 
> > When writing __sk_buff->tstamp at ingress, the
> > skb->mono_delivery_time will be cleared because of
> > the (#) mentioned above.
> 
> > If the bpf prog does access the new __sk_buff->mono_delivery_time
> > at ingress, it indicates that the bpf prog is aware of this new
> > kernel support:
> > the (kernel) skb->tstamp can have the delivery_time or the
> > (rcv) timestamp at ingress.  If the __sk_buff->mono_delivery_time
> > is available, the __sk_buff->tstamp will not be available and
> > it will be zero.
> 
> > The bpf rewrite needs to access the skb's mono_delivery_time
> > and tc_at_ingress bit.  They are moved up in sk_buff so
> > that bpf rewrite can be done at a fixed offset.  tc_skip_classify
> > is moved together with tc_at_ingress.  To get one bit for
> > mono_delivery_time, csum_not_inet is moved down and this bit
> > is currently used by sctp.
> 
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >   include/linux/filter.h         |  31 +++++++-
> >   include/linux/skbuff.h         |  20 +++--
> >   include/uapi/linux/bpf.h       |   1 +
> >   net/core/filter.c              | 134 ++++++++++++++++++++++++++++++---
> >   net/sched/act_bpf.c            |   5 +-
> >   net/sched/cls_bpf.c            |   6 +-
> >   tools/include/uapi/linux/bpf.h |   1 +
> >   7 files changed, 171 insertions(+), 27 deletions(-)
> 
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index 71fa57b88bfc..5cef695d6575 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -572,7 +572,8 @@ struct bpf_prog {
> >   				has_callchain_buf:1, /* callchain buffer allocated? */
> >   				enforce_expected_attach_type:1, /* Enforce expected_attach_type
> > checking at attach time */
> >   				call_get_stack:1, /* Do we call bpf_get_stack() or
> > bpf_get_stackid() */
> > -				call_get_func_ip:1; /* Do we call get_func_ip() */
> > +				call_get_func_ip:1, /* Do we call get_func_ip() */
> > +				delivery_time_access:1; /* Accessed __sk_buff->mono_delivery_time */
> >   	enum bpf_prog_type	type;		/* Type of BPF program */
> >   	enum bpf_attach_type	expected_attach_type; /* For some prog types */
> >   	u32			len;		/* Number of filter blocks */
> > @@ -699,6 +700,34 @@ static inline void bpf_compute_data_pointers(struct
> > sk_buff *skb)
> >   	cb->data_end  = skb->data + skb_headlen(skb);
> >   }
> 
> > +static __always_inline u32 bpf_prog_run_at_ingress(const struct
> > bpf_prog *prog,
> > +						   struct sk_buff *skb)
> > +{
> > +	ktime_t tstamp, delivery_time = 0;
> > +	int filter_res;
> > +
> > +	if (unlikely(skb->mono_delivery_time) && !prog->delivery_time_access) {
> > +		delivery_time = skb->tstamp;
> > +		skb->mono_delivery_time = 0;
> > +		if (static_branch_unlikely(&netstamp_needed_key))
> > +			skb->tstamp = tstamp = ktime_get_real();
> > +		else
> > +			skb->tstamp = tstamp = 0;
> > +	}
> > +
> > +	/* It is safe to push/pull even if skb_shared() */
> > +	__skb_push(skb, skb->mac_len);
> > +	bpf_compute_data_pointers(skb);
> > +	filter_res = bpf_prog_run(prog, skb);
> > +	__skb_pull(skb, skb->mac_len);
> > +
> > +	/* __sk_buff->tstamp was not changed, restore the delivery_time */
> > +	if (unlikely(delivery_time) && skb_tstamp(skb) == tstamp)
> > +		skb_set_delivery_time(skb, delivery_time, true);
> > +
> > +	return filter_res;
> > +}
> > +
> >   /* Similar to bpf_compute_data_pointers(), except that save orginal
> >    * data in cb->data and cb->meta_data for restore.
> >    */
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 4677bb6c7279..a14b04b86c13 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -866,22 +866,23 @@ struct sk_buff {
> >   	__u8			vlan_present:1;	/* See PKT_VLAN_PRESENT_BIT */
> >   	__u8			csum_complete_sw:1;
> >   	__u8			csum_level:2;
> > -	__u8			csum_not_inet:1;
> >   	__u8			dst_pending_confirm:1;
> > +	__u8			mono_delivery_time:1;
> > +
> > +#ifdef CONFIG_NET_CLS_ACT
> > +	__u8			tc_skip_classify:1;
> > +	__u8			tc_at_ingress:1;
> > +#endif
> >   #ifdef CONFIG_IPV6_NDISC_NODETYPE
> >   	__u8			ndisc_nodetype:2;
> >   #endif
> > -
> > +	__u8			csum_not_inet:1;
> >   	__u8			ipvs_property:1;
> >   	__u8			inner_protocol_type:1;
> >   	__u8			remcsum_offload:1;
> >   #ifdef CONFIG_NET_SWITCHDEV
> >   	__u8			offload_fwd_mark:1;
> >   	__u8			offload_l3_fwd_mark:1;
> > -#endif
> > -#ifdef CONFIG_NET_CLS_ACT
> > -	__u8			tc_skip_classify:1;
> > -	__u8			tc_at_ingress:1;
> >   #endif
> >   	__u8			redirected:1;
> >   #ifdef CONFIG_NET_REDIRECT
> > @@ -894,7 +895,6 @@ struct sk_buff {
> >   	__u8			decrypted:1;
> >   #endif
> >   	__u8			slow_gro:1;
> > -	__u8			mono_delivery_time:1;
> 
> >   #ifdef CONFIG_NET_SCHED
> >   	__u16			tc_index;	/* traffic control index */
> > @@ -972,10 +972,16 @@ struct sk_buff {
> >   /* if you move pkt_vlan_present around you also must adapt these
> > constants */
> >   #ifdef __BIG_ENDIAN_BITFIELD
> >   #define PKT_VLAN_PRESENT_BIT	7
> > +#define TC_AT_INGRESS_SHIFT	0
> > +#define SKB_MONO_DELIVERY_TIME_SHIFT 2
> >   #else
> >   #define PKT_VLAN_PRESENT_BIT	0
> > +#define TC_AT_INGRESS_SHIFT	7
> > +#define SKB_MONO_DELIVERY_TIME_SHIFT 5
> >   #endif
> >   #define PKT_VLAN_PRESENT_OFFSET	offsetof(struct sk_buff,
> > __pkt_vlan_present_offset)
> > +#define TC_AT_INGRESS_OFFSET offsetof(struct sk_buff,
> > __pkt_vlan_present_offset)
> > +#define SKB_MONO_DELIVERY_TIME_OFFSET offsetof(struct sk_buff,
> > __pkt_vlan_present_offset)
> 
> >   #ifdef __KERNEL__
> >   /*
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index b0383d371b9a..83725c891f3c 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -5437,6 +5437,7 @@ struct __sk_buff {
> >   	__u32 gso_size;
> >   	__u32 :32;		/* Padding, future use. */
> >   	__u64 hwtstamp;
> > +	__u64 mono_delivery_time;
> >   };
> 
> >   struct bpf_tunnel_key {
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 4fc53d645a01..db17812f0f8c 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -7832,6 +7832,7 @@ static bool bpf_skb_is_valid_access(int off, int
> > size, enum bpf_access_type type
> >   			return false;
> >   		break;
> >   	case bpf_ctx_range(struct __sk_buff, tstamp):
> > +	case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
> >   		if (size != sizeof(__u64))
> >   			return false;
> >   		break;
> > @@ -7872,6 +7873,7 @@ static bool sk_filter_is_valid_access(int off, int
> > size,
> >   	case bpf_ctx_range(struct __sk_buff, tstamp):
> >   	case bpf_ctx_range(struct __sk_buff, wire_len):
> >   	case bpf_ctx_range(struct __sk_buff, hwtstamp):
> > +	case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
> >   		return false;
> >   	}
> 
> > @@ -7911,6 +7913,7 @@ static bool cg_skb_is_valid_access(int off, int
> > size,
> >   		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
> >   			break;
> >   		case bpf_ctx_range(struct __sk_buff, tstamp):
> > +		case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
> >   			if (!bpf_capable())
> >   				return false;
> >   			break;
> > @@ -7943,6 +7946,7 @@ static bool lwt_is_valid_access(int off, int size,
> >   	case bpf_ctx_range(struct __sk_buff, tstamp):
> >   	case bpf_ctx_range(struct __sk_buff, wire_len):
> >   	case bpf_ctx_range(struct __sk_buff, hwtstamp):
> > +	case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
> >   		return false;
> >   	}
> 
> > @@ -8169,6 +8173,7 @@ static bool tc_cls_act_is_valid_access(int off,
> > int size,
> >   		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
> >   		case bpf_ctx_range(struct __sk_buff, tstamp):
> >   		case bpf_ctx_range(struct __sk_buff, queue_mapping):
> > +		case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
> >   			break;
> >   		default:
> >   			return false;
> > @@ -8445,6 +8450,7 @@ static bool sk_skb_is_valid_access(int off, int
> > size,
> >   	case bpf_ctx_range(struct __sk_buff, tstamp):
> >   	case bpf_ctx_range(struct __sk_buff, wire_len):
> >   	case bpf_ctx_range(struct __sk_buff, hwtstamp):
> > +	case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
> >   		return false;
> >   	}
> 
> > @@ -8603,6 +8609,114 @@ static struct bpf_insn
> > *bpf_convert_shinfo_access(const struct bpf_insn *si,
> >   	return insn;
> >   }
> 
> 
> [..]
> 
> > +static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_insn
> > *si,
> > +						struct bpf_insn *insn)
> > +{
> > +	__u8 value_reg = si->dst_reg;
> > +	__u8 skb_reg = si->src_reg;
> > +	__u8 tmp_reg = BPF_REG_AX;
> > +
> > +#ifdef CONFIG_NET_CLS_ACT
> > +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
> > +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, 1 << TC_AT_INGRESS_SHIFT);
> > +	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 5);
> > +	/* @ingress, read __sk_buff->tstamp as the (rcv) timestamp,
> > +	 * so check the skb->mono_delivery_time.
> > +	 */
> > +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
> > +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> > +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
> > +				1 << SKB_MONO_DELIVERY_TIME_SHIFT);
> > +	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2);
> > +	/* skb->mono_delivery_time is set, read 0 as the (rcv) timestamp. */
> > +	*insn++ = BPF_MOV64_IMM(value_reg, 0);
> > +	*insn++ = BPF_JMP_A(1);
> > +#endif
> > +
> > +	*insn++ = BPF_LDX_MEM(BPF_DW, value_reg, skb_reg,
> > +			      offsetof(struct sk_buff, tstamp));
> > +	return insn;
> > +}
> > +
> > +static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_insn
> > *si,
> > +						 struct bpf_insn *insn)
> > +{
> > +	__u8 value_reg = si->src_reg;
> > +	__u8 skb_reg = si->dst_reg;
> > +	__u8 tmp_reg = BPF_REG_AX;
> > +
> > +	/* skb->tstamp = tstamp */
> > +	*insn++ = BPF_STX_MEM(BPF_DW, skb_reg, value_reg,
> > +			      offsetof(struct sk_buff, tstamp));
> > +
> > +#ifdef CONFIG_NET_CLS_ACT
> > +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, TC_AT_INGRESS_OFFSET);
> > +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, 1 << TC_AT_INGRESS_SHIFT);
> > +	*insn++ = BPF_JMP32_IMM(BPF_JNE, tmp_reg, 0, 1);
> > +#endif
> > +
> > +	/* test tstamp != 0 */
> > +	*insn++ = BPF_JMP_IMM(BPF_JNE, value_reg, 0, 3);
> > +	/* writing __sk_buff->tstamp at ingress or writing 0,
> > +	 * clear the skb->mono_delivery_time.
> > +	 */
> > +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
> > +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> > +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
> > +				~(1 << SKB_MONO_DELIVERY_TIME_SHIFT));
> > +	*insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg,
> > +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> > +
> > +	return insn;
> > +}
> 
> I wonder if we'll see the regression from this. We read/write tstamp
> frequently and I'm not sure we care about the forwarding case.
>
> As a future work/follow up, do you think we can support cases like
> bpf_prog_load(prog_type=SCHED_CLS expected_attach_type=TC_EGRESS) where
> we can generate bytecode with only BPF_LDX_MEM/BPF_STX_MEM for skb->tstamp?
> (essentially a bytecode as it was prior to your patch series)
>
> Since we know that that specific program will run only at egress,
> I'm assuming we can generate simpler bytecode? (of coarse it needs more
> work on cls_bpf to enforce this new expected_attach_type constraint)
The common (if not the only useful?) use case for reading/writing
skb->tstamp should be at egress now.  For this case, the patch added
test on skb->tc_at_ingress and test the writing value is non-zero.  The
skb->mono_delivery_time bit should not be touched in this common
case at egress.  Even with expected_attach_type=TC_EGRESS, it could save
testing the tc_at_ingress (3 bpf insns) but it still needs to test the
writing value is non-zero (1 bpf insn).  Regardless, I  doubt there
is any meaningful difference for these two new tests considering other
things that a typical bpf prog is doing (e.g. parsing header,
lookup map...) and also other logic in the stack's egress path.

For adding expected_attach_type=TC_EGRESS in the future for perf reason
alone... hmm... I suspect it will make it harder/confuse to use but yeah if
there is a convincing difference to justify that.

Unrelated to the perf topic but related to adding expected_attach_type,
I had considered adding an expected_attach_type but not for the
perf reason.

The consideration was to use expected_attach_type to distinguish the
__sk_buff->tstamp behavior on ingress, although I have a hard time
thinking of a reasonable use case on accessing __sk_buff->tstamp at ingress
other than printing the (rcv) timestamp out (which could also be 0).

However, I guess we have to assume we cannot break the ingress
behavior now.  The dance in bpf_prog_run_at_ingress() in this patch
is to keep the (rcv) timestamp behavior for the existing tc-bpf@ingress.
My initial thought was to add expected_attach_type=TC_DELIVERY_TIME
to signal the bpf prog is expecting skb->tstamp could have the delviery_time at
ingress but then later I think learning this in prog->delivery_time_access
during verification should be as good, so dismissed the expected_attach_type
idea and save the user from remembering when to use this
new expected_attach_type.

Thanks for the review !

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 net-next 4/4] bpf: Add __sk_buff->mono_delivery_time and handle __sk_buff->tstamp based on tc_at_ingress
  2022-01-21 20:56     ` Martin KaFai Lau
@ 2022-01-21 22:33       ` sdf
  0 siblings, 0 replies; 15+ messages in thread
From: sdf @ 2022-01-21 22:33 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Willem de Bruijn

On 01/21, Martin KaFai Lau wrote:
> On Fri, Jan 21, 2022 at 10:50:20AM -0800, sdf@google.com wrote:
> > On 01/20, Martin KaFai Lau wrote:
> > > __sk_buff->mono_delivery_time:
> > > This patch adds __sk_buff->mono_delivery_time to
> > > read and write the mono delivery_time in skb->tstamp.
> >
> > > The bpf rewrite is like:
> > > /* BPF_READ: __u64 a = __sk_buff->mono_delivery_time; */
> > > if (skb->mono_delivery_time)
> > > 	a = skb->tstamp;
> > > else
> > > 	a = 0;
> >
> > > /* BPF_WRITE: __sk_buff->mono_delivery_time = a; */
> > > skb->tstamp = a;
> > > skb->mono_delivery_time = !!a;
> >
> > > __sk_buff->tstamp:
> > > The bpf rewrite is like:
> > > /* BPF_READ: __u64 a = __sk_buff->tstamp; */
> > > if (skb->tc_at_ingress && skb->mono_delivery_time)
> > > 	a = 0;
> > > else
> > > 	a = skb->tstamp;
> >
> > > /* BPF_WRITE: __sk_buff->tstamp = a; */
> > > skb->tstamp = a;
> > > if (skb->tc_at_ingress || !a)
> > > 	skb->mono_delivery_time = 0;
> >
> > > At egress, reading is the same as before.  All skb->tstamp
> > > is the delivery_time.  Writing will not change the (kernel)
> > > skb->mono_delivery_time also unless 0 is being written.  This
> > > will be the same behavior as before.
> >
> > > (#) At ingress, the current bpf prog can only expect the
> > > (rcv) timestamp.  Thus, both reading and writing are now treated as
> > > operating on the (rcv) timestamp for the existing bpf prog.
> >
> > > During bpf load time, the verifier will learn if the
> > > bpf prog has accessed the new __sk_buff->mono_delivery_time.
> >
> > > When reading at ingress, if the bpf prog does not access the
> > > new __sk_buff->mono_delivery_time, it will be treated as the
> > > existing behavior as mentioned in (#) above.  If the (kernel)  
> skb->tstamp
> > > currently has a delivery_time,  it will temporary be saved first and  
> then
> > > set the skb->tstamp to either the ktime_get_real() or zero.  After
> > > the bpf prog finished running, if the bpf prog did not change
> > > the skb->tstamp,  the saved delivery_time will be restored
> > > back to the skb->tstamp.
> >
> > > When writing __sk_buff->tstamp at ingress, the
> > > skb->mono_delivery_time will be cleared because of
> > > the (#) mentioned above.
> >
> > > If the bpf prog does access the new __sk_buff->mono_delivery_time
> > > at ingress, it indicates that the bpf prog is aware of this new
> > > kernel support:
> > > the (kernel) skb->tstamp can have the delivery_time or the
> > > (rcv) timestamp at ingress.  If the __sk_buff->mono_delivery_time
> > > is available, the __sk_buff->tstamp will not be available and
> > > it will be zero.
> >
> > > The bpf rewrite needs to access the skb's mono_delivery_time
> > > and tc_at_ingress bit.  They are moved up in sk_buff so
> > > that bpf rewrite can be done at a fixed offset.  tc_skip_classify
> > > is moved together with tc_at_ingress.  To get one bit for
> > > mono_delivery_time, csum_not_inet is moved down and this bit
> > > is currently used by sctp.
> >
> > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > ---
> > >   include/linux/filter.h         |  31 +++++++-
> > >   include/linux/skbuff.h         |  20 +++--
> > >   include/uapi/linux/bpf.h       |   1 +
> > >   net/core/filter.c              | 134  
> ++++++++++++++++++++++++++++++---
> > >   net/sched/act_bpf.c            |   5 +-
> > >   net/sched/cls_bpf.c            |   6 +-
> > >   tools/include/uapi/linux/bpf.h |   1 +
> > >   7 files changed, 171 insertions(+), 27 deletions(-)
> >
> > > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > > index 71fa57b88bfc..5cef695d6575 100644
> > > --- a/include/linux/filter.h
> > > +++ b/include/linux/filter.h
> > > @@ -572,7 +572,8 @@ struct bpf_prog {
> > >   				has_callchain_buf:1, /* callchain buffer allocated? */
> > >   				enforce_expected_attach_type:1, /* Enforce expected_attach_type
> > > checking at attach time */
> > >   				call_get_stack:1, /* Do we call bpf_get_stack() or
> > > bpf_get_stackid() */
> > > -				call_get_func_ip:1; /* Do we call get_func_ip() */
> > > +				call_get_func_ip:1, /* Do we call get_func_ip() */
> > > +				delivery_time_access:1; /* Accessed  
> __sk_buff->mono_delivery_time */
> > >   	enum bpf_prog_type	type;		/* Type of BPF program */
> > >   	enum bpf_attach_type	expected_attach_type; /* For some prog types  
> */
> > >   	u32			len;		/* Number of filter blocks */
> > > @@ -699,6 +700,34 @@ static inline void  
> bpf_compute_data_pointers(struct
> > > sk_buff *skb)
> > >   	cb->data_end  = skb->data + skb_headlen(skb);
> > >   }
> >
> > > +static __always_inline u32 bpf_prog_run_at_ingress(const struct
> > > bpf_prog *prog,
> > > +						   struct sk_buff *skb)
> > > +{
> > > +	ktime_t tstamp, delivery_time = 0;
> > > +	int filter_res;
> > > +
> > > +	if (unlikely(skb->mono_delivery_time)  
> && !prog->delivery_time_access) {
> > > +		delivery_time = skb->tstamp;
> > > +		skb->mono_delivery_time = 0;
> > > +		if (static_branch_unlikely(&netstamp_needed_key))
> > > +			skb->tstamp = tstamp = ktime_get_real();
> > > +		else
> > > +			skb->tstamp = tstamp = 0;
> > > +	}
> > > +
> > > +	/* It is safe to push/pull even if skb_shared() */
> > > +	__skb_push(skb, skb->mac_len);
> > > +	bpf_compute_data_pointers(skb);
> > > +	filter_res = bpf_prog_run(prog, skb);
> > > +	__skb_pull(skb, skb->mac_len);
> > > +
> > > +	/* __sk_buff->tstamp was not changed, restore the delivery_time */
> > > +	if (unlikely(delivery_time) && skb_tstamp(skb) == tstamp)
> > > +		skb_set_delivery_time(skb, delivery_time, true);
> > > +
> > > +	return filter_res;
> > > +}
> > > +
> > >   /* Similar to bpf_compute_data_pointers(), except that save orginal
> > >    * data in cb->data and cb->meta_data for restore.
> > >    */
> > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > index 4677bb6c7279..a14b04b86c13 100644
> > > --- a/include/linux/skbuff.h
> > > +++ b/include/linux/skbuff.h
> > > @@ -866,22 +866,23 @@ struct sk_buff {
> > >   	__u8			vlan_present:1;	/* See PKT_VLAN_PRESENT_BIT */
> > >   	__u8			csum_complete_sw:1;
> > >   	__u8			csum_level:2;
> > > -	__u8			csum_not_inet:1;
> > >   	__u8			dst_pending_confirm:1;
> > > +	__u8			mono_delivery_time:1;
> > > +
> > > +#ifdef CONFIG_NET_CLS_ACT
> > > +	__u8			tc_skip_classify:1;
> > > +	__u8			tc_at_ingress:1;
> > > +#endif
> > >   #ifdef CONFIG_IPV6_NDISC_NODETYPE
> > >   	__u8			ndisc_nodetype:2;
> > >   #endif
> > > -
> > > +	__u8			csum_not_inet:1;
> > >   	__u8			ipvs_property:1;
> > >   	__u8			inner_protocol_type:1;
> > >   	__u8			remcsum_offload:1;
> > >   #ifdef CONFIG_NET_SWITCHDEV
> > >   	__u8			offload_fwd_mark:1;
> > >   	__u8			offload_l3_fwd_mark:1;
> > > -#endif
> > > -#ifdef CONFIG_NET_CLS_ACT
> > > -	__u8			tc_skip_classify:1;
> > > -	__u8			tc_at_ingress:1;
> > >   #endif
> > >   	__u8			redirected:1;
> > >   #ifdef CONFIG_NET_REDIRECT
> > > @@ -894,7 +895,6 @@ struct sk_buff {
> > >   	__u8			decrypted:1;
> > >   #endif
> > >   	__u8			slow_gro:1;
> > > -	__u8			mono_delivery_time:1;
> >
> > >   #ifdef CONFIG_NET_SCHED
> > >   	__u16			tc_index;	/* traffic control index */
> > > @@ -972,10 +972,16 @@ struct sk_buff {
> > >   /* if you move pkt_vlan_present around you also must adapt these
> > > constants */
> > >   #ifdef __BIG_ENDIAN_BITFIELD
> > >   #define PKT_VLAN_PRESENT_BIT	7
> > > +#define TC_AT_INGRESS_SHIFT	0
> > > +#define SKB_MONO_DELIVERY_TIME_SHIFT 2
> > >   #else
> > >   #define PKT_VLAN_PRESENT_BIT	0
> > > +#define TC_AT_INGRESS_SHIFT	7
> > > +#define SKB_MONO_DELIVERY_TIME_SHIFT 5
> > >   #endif
> > >   #define PKT_VLAN_PRESENT_OFFSET	offsetof(struct sk_buff,
> > > __pkt_vlan_present_offset)
> > > +#define TC_AT_INGRESS_OFFSET offsetof(struct sk_buff,
> > > __pkt_vlan_present_offset)
> > > +#define SKB_MONO_DELIVERY_TIME_OFFSET offsetof(struct sk_buff,
> > > __pkt_vlan_present_offset)
> >
> > >   #ifdef __KERNEL__
> > >   /*
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index b0383d371b9a..83725c891f3c 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -5437,6 +5437,7 @@ struct __sk_buff {
> > >   	__u32 gso_size;
> > >   	__u32 :32;		/* Padding, future use. */
> > >   	__u64 hwtstamp;
> > > +	__u64 mono_delivery_time;
> > >   };
> >
> > >   struct bpf_tunnel_key {
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 4fc53d645a01..db17812f0f8c 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -7832,6 +7832,7 @@ static bool bpf_skb_is_valid_access(int off, int
> > > size, enum bpf_access_type type
> > >   			return false;
> > >   		break;
> > >   	case bpf_ctx_range(struct __sk_buff, tstamp):
> > > +	case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
> > >   		if (size != sizeof(__u64))
> > >   			return false;
> > >   		break;
> > > @@ -7872,6 +7873,7 @@ static bool sk_filter_is_valid_access(int off,  
> int
> > > size,
> > >   	case bpf_ctx_range(struct __sk_buff, tstamp):
> > >   	case bpf_ctx_range(struct __sk_buff, wire_len):
> > >   	case bpf_ctx_range(struct __sk_buff, hwtstamp):
> > > +	case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
> > >   		return false;
> > >   	}
> >
> > > @@ -7911,6 +7913,7 @@ static bool cg_skb_is_valid_access(int off, int
> > > size,
> > >   		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
> > >   			break;
> > >   		case bpf_ctx_range(struct __sk_buff, tstamp):
> > > +		case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
> > >   			if (!bpf_capable())
> > >   				return false;
> > >   			break;
> > > @@ -7943,6 +7946,7 @@ static bool lwt_is_valid_access(int off, int  
> size,
> > >   	case bpf_ctx_range(struct __sk_buff, tstamp):
> > >   	case bpf_ctx_range(struct __sk_buff, wire_len):
> > >   	case bpf_ctx_range(struct __sk_buff, hwtstamp):
> > > +	case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
> > >   		return false;
> > >   	}
> >
> > > @@ -8169,6 +8173,7 @@ static bool tc_cls_act_is_valid_access(int off,
> > > int size,
> > >   		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
> > >   		case bpf_ctx_range(struct __sk_buff, tstamp):
> > >   		case bpf_ctx_range(struct __sk_buff, queue_mapping):
> > > +		case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
> > >   			break;
> > >   		default:
> > >   			return false;
> > > @@ -8445,6 +8450,7 @@ static bool sk_skb_is_valid_access(int off, int
> > > size,
> > >   	case bpf_ctx_range(struct __sk_buff, tstamp):
> > >   	case bpf_ctx_range(struct __sk_buff, wire_len):
> > >   	case bpf_ctx_range(struct __sk_buff, hwtstamp):
> > > +	case bpf_ctx_range(struct __sk_buff, mono_delivery_time):
> > >   		return false;
> > >   	}
> >
> > > @@ -8603,6 +8609,114 @@ static struct bpf_insn
> > > *bpf_convert_shinfo_access(const struct bpf_insn *si,
> > >   	return insn;
> > >   }
> >
> >
> > [..]
> >
> > > +static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_insn
> > > *si,
> > > +						struct bpf_insn *insn)
> > > +{
> > > +	__u8 value_reg = si->dst_reg;
> > > +	__u8 skb_reg = si->src_reg;
> > > +	__u8 tmp_reg = BPF_REG_AX;
> > > +
> > > +#ifdef CONFIG_NET_CLS_ACT
> > > +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,  
> TC_AT_INGRESS_OFFSET);
> > > +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, 1 << TC_AT_INGRESS_SHIFT);
> > > +	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 5);
> > > +	/* @ingress, read __sk_buff->tstamp as the (rcv) timestamp,
> > > +	 * so check the skb->mono_delivery_time.
> > > +	 */
> > > +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
> > > +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> > > +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
> > > +				1 << SKB_MONO_DELIVERY_TIME_SHIFT);
> > > +	*insn++ = BPF_JMP32_IMM(BPF_JEQ, tmp_reg, 0, 2);
> > > +	/* skb->mono_delivery_time is set, read 0 as the (rcv) timestamp. */
> > > +	*insn++ = BPF_MOV64_IMM(value_reg, 0);
> > > +	*insn++ = BPF_JMP_A(1);
> > > +#endif
> > > +
> > > +	*insn++ = BPF_LDX_MEM(BPF_DW, value_reg, skb_reg,
> > > +			      offsetof(struct sk_buff, tstamp));
> > > +	return insn;
> > > +}
> > > +
> > > +static struct bpf_insn *bpf_convert_tstamp_write(const struct  
> bpf_insn
> > > *si,
> > > +						 struct bpf_insn *insn)
> > > +{
> > > +	__u8 value_reg = si->src_reg;
> > > +	__u8 skb_reg = si->dst_reg;
> > > +	__u8 tmp_reg = BPF_REG_AX;
> > > +
> > > +	/* skb->tstamp = tstamp */
> > > +	*insn++ = BPF_STX_MEM(BPF_DW, skb_reg, value_reg,
> > > +			      offsetof(struct sk_buff, tstamp));
> > > +
> > > +#ifdef CONFIG_NET_CLS_ACT
> > > +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,  
> TC_AT_INGRESS_OFFSET);
> > > +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, 1 << TC_AT_INGRESS_SHIFT);
> > > +	*insn++ = BPF_JMP32_IMM(BPF_JNE, tmp_reg, 0, 1);
> > > +#endif
> > > +
> > > +	/* test tstamp != 0 */
> > > +	*insn++ = BPF_JMP_IMM(BPF_JNE, value_reg, 0, 3);
> > > +	/* writing __sk_buff->tstamp at ingress or writing 0,
> > > +	 * clear the skb->mono_delivery_time.
> > > +	 */
> > > +	*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
> > > +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> > > +	*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
> > > +				~(1 << SKB_MONO_DELIVERY_TIME_SHIFT));
> > > +	*insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg,
> > > +			      SKB_MONO_DELIVERY_TIME_OFFSET);
> > > +
> > > +	return insn;
> > > +}
> >
> > I wonder if we'll see the regression from this. We read/write tstamp
> > frequently and I'm not sure we care about the forwarding case.
> >
> > As a future work/follow up, do you think we can support cases like
> > bpf_prog_load(prog_type=SCHED_CLS expected_attach_type=TC_EGRESS) where
> > we can generate bytecode with only BPF_LDX_MEM/BPF_STX_MEM for  
> skb->tstamp?
> > (essentially a bytecode as it was prior to your patch series)
> >
> > Since we know that that specific program will run only at egress,
> > I'm assuming we can generate simpler bytecode? (of coarse it needs more
> > work on cls_bpf to enforce this new expected_attach_type constraint)
> The common (if not the only useful?) use case for reading/writing
> skb->tstamp should be at egress now.  For this case, the patch added
> test on skb->tc_at_ingress and test the writing value is non-zero.  The
> skb->mono_delivery_time bit should not be touched in this common
> case at egress.  Even with expected_attach_type=TC_EGRESS, it could save
> testing the tc_at_ingress (3 bpf insns) but it still needs to test the
> writing value is non-zero (1 bpf insn).  Regardless, I  doubt there
> is any meaningful difference for these two new tests considering other
> things that a typical bpf prog is doing (e.g. parsing header,
> lookup map...) and also other logic in the stack's egress path.

I'm mostly concerned about reading where it seems like we can get back
to single BPF_LDX_MEM at egress. But I agree that I'm probably
over-thinking it and there won't be any extra visible hit due to those
3 insns that test skb->tc_at_ingress.

> For adding expected_attach_type=TC_EGRESS in the future for perf reason
> alone... hmm... I suspect it will make it harder/confuse to use but yeah  
> if
> there is a convincing difference to justify that.

> Unrelated to the perf topic but related to adding expected_attach_type,
> I had considered adding an expected_attach_type but not for the
> perf reason.

> The consideration was to use expected_attach_type to distinguish the
> __sk_buff->tstamp behavior on ingress, although I have a hard time
> thinking of a reasonable use case on accessing __sk_buff->tstamp at  
> ingress
> other than printing the (rcv) timestamp out (which could also be 0).

> However, I guess we have to assume we cannot break the ingress
> behavior now.  The dance in bpf_prog_run_at_ingress() in this patch
> is to keep the (rcv) timestamp behavior for the existing tc-bpf@ingress.
> My initial thought was to add expected_attach_type=TC_DELIVERY_TIME
> to signal the bpf prog is expecting skb->tstamp could have the  
> delviery_time at
> ingress but then later I think learning this in prog->delivery_time_access
> during verification should be as good, so dismissed the  
> expected_attach_type
> idea and save the user from remembering when to use this
> new expected_attach_type.

Yeah, having TC_INGRESS/TC_EGRESS expected_attach_type can be helpful,
maybe something we'll eventually have to do. Could open up things
like rx_tstamp / tx_tstamp in __sk_buff (which can return tstamp or 0
depending on skb->mono_delivery_time). Seems like an improvement over  
current
sometimes-rx-sometimes-tx-tstamp :-)

> Thanks for the review !

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 net-next 3/4] net: Set skb->mono_delivery_time and clear it when delivering locally
  2022-01-21 12:02   ` Julian Anastasov
@ 2022-01-22  3:28     ` Martin KaFai Lau
  0 siblings, 0 replies; 15+ messages in thread
From: Martin KaFai Lau @ 2022-01-22  3:28 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team, Willem de Bruijn

On Fri, Jan 21, 2022 at 02:02:23PM +0200, Julian Anastasov wrote:
> > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> > index 3a025c011971..35311ca75496 100644
> > --- a/net/ipv4/ip_input.c
> > +++ b/net/ipv4/ip_input.c
> > @@ -244,6 +244,7 @@ int ip_local_deliver(struct sk_buff *skb)
> >  	 */
> >  	struct net *net = dev_net(skb->dev);
> >  
> > +	skb_clear_delivery_time(skb);
> 
> 	Is it safe to move this line into ip_local_deliver_finish ?
> 
> >  	if (ip_is_fragment(ip_hdr(skb))) {
> >  		if (ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER))
> >  			return 0;
> > diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> > index 80256717868e..84f93864b774 100644
> > --- a/net/ipv6/ip6_input.c
> > +++ b/net/ipv6/ip6_input.c
> > @@ -469,6 +469,7 @@ static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *sk
> >  
> >  int ip6_input(struct sk_buff *skb)
> >  {
> > +	skb_clear_delivery_time(skb);
> 
> 	Is it safe to move this line into ip6_input_finish?
> The problem for both cases is that IPVS hooks at LOCAL_IN and
> can decide to forward the packet by returning NF_STOLEN and
> avoiding the _finish code. In short, before reaching the
> _finish code it is still not decided that packet reaches the
> sockets.
hmm...

Theoretically, it should be doable to push it later because the
ingress path cannot assume the (rcv) timestamp is always available,
so it should be expecting to handle the 0 case and do ktime_get_real(),
e.g. the tapping case used by af_packet.  The tradeoff is just
a later (rcv) timestamp and also more code churns.  e.g.
Somewhere in ip_is_fragment() may need to change.

My initial attempt was to call skb_clear_delivery_time()
right after sch_handle_ingress() in dev.c.  However, it seems not taking
much to make ip[6]_forward work also, so I pushed it here.  However, it
seems that will make other kernel forward paths not consistent in terms of
the expectation in keeping the delivery_time.

I will give it a try in v4 but not very sure for now before looking
closer.  The worst is to move it back to just after sch_handle_ingress()
so that the kernel forward path will still behave consistently
but I will give it a try first.

Thanks for the review !

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 net-next 1/4] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp
  2022-01-21  7:30 ` [RFC PATCH v3 net-next 1/4] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Martin KaFai Lau
@ 2022-01-22 15:32   ` Willem de Bruijn
  2022-01-22 19:52     ` Martin KaFai Lau
  2022-01-22 20:03     ` Martin KaFai Lau
  0 siblings, 2 replies; 15+ messages in thread
From: Willem de Bruijn @ 2022-01-22 15:32 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team

On Fri, Jan 21, 2022 at 2:30 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> skb->tstamp was first used as the (rcv) timestamp in real time clock base.
> The major usage is to report it to the user (e.g. SO_TIMESTAMP).
>
> Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
> during egress and used by the qdisc (e.g. sch_fq) to make decision on when
> the skb can be passed to the dev.
>
> Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
> or the delivery_time, so it is always reset to 0 whenever forwarded
> between egress and ingress.
>
> While it makes sense to always clear the (rcv) timestamp in skb->tstamp
> to avoid confusing sch_fq that expects the delivery_time, it is a
> performance issue [0] to clear the delivery_time if the skb finally
> egress to a fq@phy-dev.  For example, when forwarding from egress to
> ingress and then finally back to egress:
>
>             tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
>                                      ^              ^
>                                      reset          rest
>
> [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf
>
> This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
> is storing the mono delivery_time instead of the (rcv) timestamp.
>
> The current use case is to keep the TCP mono delivery_time (EDT) and
> to be used with sch_fq.  The later patch will also allow tc-bpf to read
> and change the mono delivery_time.
>
> In the future, another bit (e.g. skb->user_delivery_time) can be added
> for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.
>
> [ This patch is a prep work.  The following patch will
>   get the other parts of the stack ready first.  Then another patch
>   after that will finally set the skb->mono_delivery_time. ]
>
> skb_set_delivery_time() function is added.  It is used by the tcp_output.c
> and during ip[6] fragmentation to assign the delivery_time to
> the skb->tstamp and also set the skb->mono_delivery_time.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/linux/skbuff.h                     | 13 +++++++++++++
>  net/bridge/netfilter/nf_conntrack_bridge.c |  5 +++--
>  net/ipv4/ip_output.c                       |  5 +++--
>  net/ipv4/tcp_output.c                      | 16 +++++++++-------
>  net/ipv6/ip6_output.c                      |  5 +++--
>  net/ipv6/netfilter.c                       |  5 +++--
>  6 files changed, 34 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index bf11e1fbd69b..b9e20187242a 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -720,6 +720,10 @@ typedef unsigned char *sk_buff_data_t;
>   *     @dst_pending_confirm: need to confirm neighbour
>   *     @decrypted: Decrypted SKB
>   *     @slow_gro: state present at GRO time, slower prepare step required
> + *     @mono_delivery_time: When set, skb->tstamap has the

tstamp

> + *             delivery_time in mono clock base (i.e. EDT).  Otherwise, the
> + *             skb->tstamp has the (rcv) timestamp at ingress and
> + *             delivery_time at egress.
>   *     @napi_id: id of the NAPI struct this skb came from
>   *     @sender_cpu: (aka @napi_id) source CPU in XPS
>   *     @secmark: security marking
> @@ -890,6 +894,7 @@ struct sk_buff {
>         __u8                    decrypted:1;
>  #endif
>         __u8                    slow_gro:1;
> +       __u8                    mono_delivery_time:1;

This bit fills a hole, does not change sk_buff size, right?

>
>  #ifdef CONFIG_NET_SCHED
>         __u16                   tc_index;       /* traffic control index */
> @@ -3903,6 +3908,14 @@ static inline ktime_t net_invalid_timestamp(void)
>         return 0;
>  }
>
> +static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
> +                                        bool mono)
> +{
> +       skb->tstamp = kt;
> +       /* Setting mono_delivery_time will be enabled later */
> +       /* skb->mono_delivery_time = kt && mono; */
> +}
> +
>  static inline u8 skb_metadata_len(const struct sk_buff *skb)
>  {
>         return skb_shinfo(skb)->meta_len;
> diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c
> index fdbed3158555..ebfb2a5c59e4 100644
> --- a/net/bridge/netfilter/nf_conntrack_bridge.c
> +++ b/net/bridge/netfilter/nf_conntrack_bridge.c
> @@ -32,6 +32,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
>                                            struct sk_buff *))
>  {
>         int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
> +       bool mono_delivery_time = skb->mono_delivery_time;

This use of a local variable is just a style choice, not needed for
correctness, correct?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 net-next 0/4] Preserve mono delivery time (EDT) in skb->tstamp
  2022-01-21  7:30 [RFC PATCH v3 net-next 0/4] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2022-01-21  7:30 ` [RFC PATCH v3 net-next 4/4] bpf: Add __sk_buff->mono_delivery_time and handle __sk_buff->tstamp based on tc_at_ingress Martin KaFai Lau
@ 2022-01-22 15:43 ` Willem de Bruijn
  2022-01-22 21:05   ` Martin KaFai Lau
  4 siblings, 1 reply; 15+ messages in thread
From: Willem de Bruijn @ 2022-01-22 15:43 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team

On Fri, Jan 21, 2022 at 2:30 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> skb->tstamp was first used as the (rcv) timestamp in real time clock base.
> The major usage is to report it to the user (e.g. SO_TIMESTAMP).
>
> Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
> during egress and used by the qdisc (e.g. sch_fq) to make decision on when
> the skb can be passed to the dev.
>
> Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
> or the delivery_time, so it is always reset to 0 whenever forwarded
> between egress and ingress.
>
> While it makes sense to always clear the (rcv) timestamp in skb->tstamp
> to avoid confusing sch_fq that expects the delivery_time, it is a
> performance issue [0] to clear the delivery_time if the skb finally
> egress to a fq@phy-dev.
>
> v3:
> - Feedback from v2 is using shinfo(skb)->tx_flags could be racy.
> - Considered to reuse a few bits in skb->tstamp to represent
>   different semantics, other than more code churns, it will break
>   the bpf usecase which currently can write and then read back
>   the skb->tstamp.
> - Went back to v1 idea on adding a bit to skb and address the
>   feedbacks on v1:
> - Added one bit skb->mono_delivery_time to flag that
>   the skb->tstamp has the mono delivery_time (EDT), instead
>   of adding a bit to flag if the skb->tstamp has been forwarded or not.
> - Instead of resetting the delivery_time back to the (rcv) timestamp
>   during recvmsg syscall which may be too late and not useful,
>   the delivery_time reset in v3 happens earlier once the stack
>   knows that the skb will be delivered locally.
> - Handled the tapping@ingress case by af_packet
> - No need to change the (rcv) timestamp to mono clock base as in v1.
>   The added one bit to flag skb->mono_delivery_time is enough
>   to keep the EDT delivery_time during forward.
> - Added logic to the bpf side to make the existing bpf
>   running at ingress can still get the (rcv) timestamp
>   when reading the __sk_buff->tstamp.  New __sk_buff->mono_delivery_time
>   is also added.  Test is still needed to test this piece.
>
> Martin KaFai Lau (4):
>   net: Add skb->mono_delivery_time to distinguish mono delivery_time
>     from (rcv) timestamp
>   net: Add skb_clear_tstamp() to keep the mono delivery_time
>   net: Set skb->mono_delivery_time and clear it when delivering locally
>   bpf: Add __sk_buff->mono_delivery_time and handle __sk_buff->tstamp
>     based on tc_at_ingress
>
>  drivers/net/loopback.c                     |   2 +-
>  include/linux/filter.h                     |  31 ++++-
>  include/linux/skbuff.h                     |  64 ++++++++--
>  include/uapi/linux/bpf.h                   |   1 +
>  net/bridge/br_forward.c                    |   2 +-
>  net/bridge/netfilter/nf_conntrack_bridge.c |   5 +-
>  net/core/dev.c                             |   4 +-
>  net/core/filter.c                          | 140 +++++++++++++++++++--
>  net/core/skbuff.c                          |   8 +-
>  net/ipv4/ip_forward.c                      |   2 +-
>  net/ipv4/ip_input.c                        |   1 +
>  net/ipv4/ip_output.c                       |   5 +-
>  net/ipv4/tcp_output.c                      |  16 +--
>  net/ipv6/ip6_input.c                       |   1 +
>  net/ipv6/ip6_output.c                      |   7 +-
>  net/ipv6/netfilter.c                       |   5 +-
>  net/netfilter/ipvs/ip_vs_xmit.c            |   6 +-
>  net/netfilter/nf_dup_netdev.c              |   2 +-
>  net/netfilter/nf_flow_table_ip.c           |   4 +-
>  net/netfilter/nft_fwd_netdev.c             |   2 +-
>  net/openvswitch/vport.c                    |   2 +-
>  net/packet/af_packet.c                     |   4 +-
>  net/sched/act_bpf.c                        |   5 +-
>  net/sched/cls_bpf.c                        |   6 +-
>  net/xfrm/xfrm_interface.c                  |   2 +-
>  tools/include/uapi/linux/bpf.h             |   1 +
>  26 files changed, 265 insertions(+), 63 deletions(-)
>
> --

This overall looks great to me.

The only effect on timestamping is slightly delayed receive timestamp
for packets arriving over a virtual loop (loopback, veth, ..)
interface, as the timestamp is now captured at the network protocol
input.

Actually, while writing that: timestamping is a socket level option,
not specific to IP. Might this break receive timestamping over such
interfaces for other protocols?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 net-next 1/4] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp
  2022-01-22 15:32   ` Willem de Bruijn
@ 2022-01-22 19:52     ` Martin KaFai Lau
  2022-01-22 20:03     ` Martin KaFai Lau
  1 sibling, 0 replies; 15+ messages in thread
From: Martin KaFai Lau @ 2022-01-22 19:52 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team

On Sat, Jan 22, 2022 at 10:32:16AM -0500, Willem de Bruijn wrote:
> On Fri, Jan 21, 2022 at 2:30 AM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > skb->tstamp was first used as the (rcv) timestamp in real time clock base.
> > The major usage is to report it to the user (e.g. SO_TIMESTAMP).
> >
> > Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
> > during egress and used by the qdisc (e.g. sch_fq) to make decision on when
> > the skb can be passed to the dev.
> >
> > Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
> > or the delivery_time, so it is always reset to 0 whenever forwarded
> > between egress and ingress.
> >
> > While it makes sense to always clear the (rcv) timestamp in skb->tstamp
> > to avoid confusing sch_fq that expects the delivery_time, it is a
> > performance issue [0] to clear the delivery_time if the skb finally
> > egress to a fq@phy-dev.  For example, when forwarding from egress to
> > ingress and then finally back to egress:
> >
> >             tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
> >                                      ^              ^
> >                                      reset          rest
> >
> > [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf 
> >
> > This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
> > is storing the mono delivery_time instead of the (rcv) timestamp.
> >
> > The current use case is to keep the TCP mono delivery_time (EDT) and
> > to be used with sch_fq.  The later patch will also allow tc-bpf to read
> > and change the mono delivery_time.
> >
> > In the future, another bit (e.g. skb->user_delivery_time) can be added
> > for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.
> >
> > [ This patch is a prep work.  The following patch will
> >   get the other parts of the stack ready first.  Then another patch
> >   after that will finally set the skb->mono_delivery_time. ]
> >
> > skb_set_delivery_time() function is added.  It is used by the tcp_output.c
> > and during ip[6] fragmentation to assign the delivery_time to
> > the skb->tstamp and also set the skb->mono_delivery_time.
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  include/linux/skbuff.h                     | 13 +++++++++++++
> >  net/bridge/netfilter/nf_conntrack_bridge.c |  5 +++--
> >  net/ipv4/ip_output.c                       |  5 +++--
> >  net/ipv4/tcp_output.c                      | 16 +++++++++-------
> >  net/ipv6/ip6_output.c                      |  5 +++--
> >  net/ipv6/netfilter.c                       |  5 +++--
> >  6 files changed, 34 insertions(+), 15 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index bf11e1fbd69b..b9e20187242a 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -720,6 +720,10 @@ typedef unsigned char *sk_buff_data_t;
> >   *     @dst_pending_confirm: need to confirm neighbour
> >   *     @decrypted: Decrypted SKB
> >   *     @slow_gro: state present at GRO time, slower prepare step required
> > + *     @mono_delivery_time: When set, skb->tstamap has the
> 
> tstamp
> 
> > + *             delivery_time in mono clock base (i.e. EDT).  Otherwise, the
> > + *             skb->tstamp has the (rcv) timestamp at ingress and
> > + *             delivery_time at egress.
> >   *     @napi_id: id of the NAPI struct this skb came from
> >   *     @sender_cpu: (aka @napi_id) source CPU in XPS
> >   *     @secmark: security marking
> > @@ -890,6 +894,7 @@ struct sk_buff {
> >         __u8                    decrypted:1;
> >  #endif
> >         __u8                    slow_gro:1;
> > +       __u8                    mono_delivery_time:1;
> 
> This bit fills a hole, does not change sk_buff size, right?
> 
> >
> >  #ifdef CONFIG_NET_SCHED
> >         __u16                   tc_index;       /* traffic control index */
> > @@ -3903,6 +3908,14 @@ static inline ktime_t net_invalid_timestamp(void)
> >         return 0;
> >  }
> >
> > +static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
> > +                                        bool mono)
> > +{
> > +       skb->tstamp = kt;
> > +       /* Setting mono_delivery_time will be enabled later */
> > +       /* skb->mono_delivery_time = kt && mono; */
> > +}
> > +
> >  static inline u8 skb_metadata_len(const struct sk_buff *skb)
> >  {
> >         return skb_shinfo(skb)->meta_len;
> > diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c
> > index fdbed3158555..ebfb2a5c59e4 100644
> > --- a/net/bridge/netfilter/nf_conntrack_bridge.c
> > +++ b/net/bridge/netfilter/nf_conntrack_bridge.c
> > @@ -32,6 +32,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
> >                                            struct sk_buff *))
> >  {
> >         int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
> > +       bool mono_delivery_time = skb->mono_delivery_time;
> 
> This use of a local variable is just a style choice, not needed for
> correctness, correct?
It is needed.  The current code also saves the very first
skb->tstamp (two lines below):

> >         unsigned int hlen, ll_rs, mtu;
> >         ktime_t tstamp = skb->tstamp;

My understanding is all frags reuse the first (/original) skb tstamp info.
The frags output() is one-by-one, meaning when sending the later frags,
the first skb has already been output()-ed, so the first skb tstamp info
needs to be saved in local variables.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 net-next 1/4] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp
  2022-01-22 15:32   ` Willem de Bruijn
  2022-01-22 19:52     ` Martin KaFai Lau
@ 2022-01-22 20:03     ` Martin KaFai Lau
  1 sibling, 0 replies; 15+ messages in thread
From: Martin KaFai Lau @ 2022-01-22 20:03 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team

On Sat, Jan 22, 2022 at 10:32:16AM -0500, Willem de Bruijn wrote:
> > + *             delivery_time in mono clock base (i.e. EDT).  Otherwise, the
> > + *             skb->tstamp has the (rcv) timestamp at ingress and
> > + *             delivery_time at egress.
> >   *     @napi_id: id of the NAPI struct this skb came from
> >   *     @sender_cpu: (aka @napi_id) source CPU in XPS
> >   *     @secmark: security marking
> > @@ -890,6 +894,7 @@ struct sk_buff {
> >         __u8                    decrypted:1;
> >  #endif
> >         __u8                    slow_gro:1;
> > +       __u8                    mono_delivery_time:1;
> 
> This bit fills a hole, does not change sk_buff size, right?
[ Sorry, cut too much when sending the last reply ]

Correct.  With this +1 bit, it still has a 3 bits and a 1 byte hole
before tc_index when every CONFIG_* among these bits are turned on.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 net-next 0/4] Preserve mono delivery time (EDT) in skb->tstamp
  2022-01-22 15:43 ` [RFC PATCH v3 net-next 0/4] Preserve mono delivery time (EDT) in skb->tstamp Willem de Bruijn
@ 2022-01-22 21:05   ` Martin KaFai Lau
  0 siblings, 0 replies; 15+ messages in thread
From: Martin KaFai Lau @ 2022-01-22 21:05 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	kernel-team

On Sat, Jan 22, 2022 at 10:43:57AM -0500, Willem de Bruijn wrote:
> This overall looks great to me.
> 
> The only effect on timestamping is slightly delayed receive timestamp
> for packets arriving over a virtual loop (loopback, veth, ..)
> interface, as the timestamp is now captured at the network protocol
> input.
> 
> Actually, while writing that: timestamping is a socket level option,
> not specific to IP. Might this break receive timestamping over such
> interfaces for other protocols?
The EDT (and the skb->mono_delivery_time bit) is only set for TCP
which is IP only.  For other non IP skb->protocol, that bit is
not set, so the skb->tstamp should have been cleared as before.

Thanks for the review !

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-01-22 21:05 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-21  7:30 [RFC PATCH v3 net-next 0/4] Preserve mono delivery time (EDT) in skb->tstamp Martin KaFai Lau
2022-01-21  7:30 ` [RFC PATCH v3 net-next 1/4] net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp Martin KaFai Lau
2022-01-22 15:32   ` Willem de Bruijn
2022-01-22 19:52     ` Martin KaFai Lau
2022-01-22 20:03     ` Martin KaFai Lau
2022-01-21  7:30 ` [RFC PATCH v3 net-next 2/4] net: Add skb_clear_tstamp() to keep the mono delivery_time Martin KaFai Lau
2022-01-21  7:30 ` [RFC PATCH v3 net-next 3/4] net: Set skb->mono_delivery_time and clear it when delivering locally Martin KaFai Lau
2022-01-21 12:02   ` Julian Anastasov
2022-01-22  3:28     ` Martin KaFai Lau
2022-01-21  7:30 ` [RFC PATCH v3 net-next 4/4] bpf: Add __sk_buff->mono_delivery_time and handle __sk_buff->tstamp based on tc_at_ingress Martin KaFai Lau
2022-01-21 18:50   ` sdf
2022-01-21 20:56     ` Martin KaFai Lau
2022-01-21 22:33       ` sdf
2022-01-22 15:43 ` [RFC PATCH v3 net-next 0/4] Preserve mono delivery time (EDT) in skb->tstamp Willem de Bruijn
2022-01-22 21:05   ` Martin KaFai Lau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.